very tedious to precisely count cache misses
instead, approximations:
good or bad temporal/spatial locality
with nested loops: what does inner loop use?
/* version 1 */
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
A[i] += B[j] * C[i * N + j]
/* version 2 */
for (int j = 0; j < N; ++j)
for (int i = 0; i < N; ++i)
A[i] += B[j] * C[i * N + j];
exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
A[i] += B[j] * C[i * N + j]
Assume: 4 array elements per block, N very large, nothing in cache at beginning.
Example: \(N/4\) estimated misses for A accesses:
Exericse: estimate # of misses for B, C
float A_2d_array[N][N];
float *A_flat = malloc(N * N);
A_flat[i * N + j] === A_2d_array[i][j]
[ C_{ij} = {k=1}^n A{ik}B_{kj} ]
/* version 1: inner loop is k, middle is j */
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
C[i * N + j] += A[i * N + k] * B[k * N + j];
[ C_{ij} = {k=1}^n A{ik}B_{kj} ]
/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
C[i*N+j] += A\[i * N + k] * B\[k * N + j];
/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
C\[i*N+j] += A[i * N + k] * B\[k * N + j];
[ C_{ij} = {k=1}^n A{ik}B_{kj} ]
/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
C[i*N+j] += A\[i * N + k] * B\[k * N + j];
/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
C\[i*N+j] += A[i * N + k] * B\[k * N + j];
[ C_{ij} = {k=1}^n A{ik}B_{kj} ]
/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
C[i*N+j] += A\[i * N + k] * B\[k * N + j];
/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
C\[i*N+j] += A[i * N + k] * B\[k * N + j];
exercise: Which version has better spatial/temporal locality for…
… accesses to C? … accesses to A? … accesses to B?
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
C[i * N + j] += A[i * N + k] * B[k * N + j];
if \(N\) really large
for A: about \(N\\div{} \\text{block size}\) misses per k-loop
for B: about \(N\) misses per k-loop
for C: about \(1\\div{}\\text{block size}\) miss per k-loop
for (int k = 0; k < N; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
C[i * N + j] += A[i * N + k] * B[k * N + j];
for A: about \(1\) misses per j-loop
for B: about \(N\\div{}\\text{block size}\) miss per j-loop
for C: about \(N\\div{}\\text{block size}\) miss per j-loop
for (int k = 0; k < 1000; k += 1)
for (int i = 0; i < 1000; i += 1)
for (int j = 0; j < 1000; j += 1)
A[k*N+j] += B[i*N+j];
B[k*114+j]
|
is at |
10 0000 0000 0100
|
B[k*114+j+1]
|
is at |
10 0000 0000 1000
|
B[(k+1)*114+j]
|
is at |
10 0011 1001 0100
|
B[(k+2)*114+j]
|
is at |
10 0101 0101 1100
|
| … | ||
B[(k+9)*114+j]
|
is at |
11 0000 0000 1100
|
powers of two — lower order bits unchanged
B[k*93+j] and B[(k+11)*93+j]:
64 sets in L1 cache: usually maps to same set
B[k*93+(j+1)] will not be cached (next \(i\) loop)
even if in same block as B[k*93+j]
how to fix? improve spatial locality
/* version 2 */
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
A[i] += B[j] * C[i * N + j]
/* version 3 */
for (int ii = 0; ii < N; ii += 32)
for (int jj = 0; jj < N; jj += 32)
for (int i = ii; i < ii + 32; ++i)
for (int j = jj; j < jj + 32; ++j)
A[i] += B[j] * C[i * N + j];
exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?
for (int k = 0; k < N; k += 1)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
for (int kk = 0; kk < N; kk += 2)
for (int k = kk; k < kk + 2; ++k)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
split the loop over \(k\) — should be exactly the same
for (int kk = 0; kk < N; kk += 2)
/* was here: for (int k = kk; k < kk + 2; ++k) */
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
/* load Aik, Aik+1 into cache and process: */
[[for (int k = kk; k < kk + 2; ++k)]{.fragment fragment-index=2 .custom .myem-only}]{.fragment fragment-index=3 .custom .myem-only}
C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
for (int kk = 0; kk < N; kk += 2) {
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
/* process a "block" of 2 k values: */
C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
C[i*N+j] += A[i*N+kk+1] * B\[(kk+1)*N+j];
}
}
}
for (int kk = 0; kk < N; kk += 2)
for (int i = 0; i < N; i += 1)
for (int j = 0; j < N; ++j) {
C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
}
access pattern for A:
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] … (repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] … (repeats N times)
…
…
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] … (repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] … (repeats N times)
…
…
likely cache misses: only first iterations of \(j\) loop
how many cache misses per iteration? usually one
about \(\\frac{N2}\\cdot{} N\) misses total
for (int kk = 0; kk < N; kk += 2)
for (int i = 0; i < N; i += 1)
for (int j = 0; j < N; ++j) {
C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
}
access pattern for B:
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], … B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], … B[4*N+(N-1)], B[5*N+(N-1)]
…
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
…
access pattern for B:
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], … B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], … B[4*N+(N-1)], B[5*N+(N-1)]
…
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
…
for (int kk = 0; kk < N; kk += 2)
for (int i = 0; i < N; i += 1)
for (int j = 0; j < N; ++j) {
C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
}
\(\\frac{N2}\\cdot{} N\) j-loop executions and (assuming \(N\) large):
about \(1\) misses from \(A\) per j-loop
about \(2N\\div{}\\text{block size}\) misses from \(B\) per j-loop
about \(N\\div{}\\text{block size}\) misses from \(C\) per j-loop
for (int kk = 0; kk < N; kk += 2) {
for (int ii = 0; ii < N; ii += 2) {
for (int j = 0; j < N; ++j) {
/* process a "block": */
for (int k = kk; k < kk + 2; ++k)
for (int i = 0; i < ii + 2; ++i)
C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
}
}
}
for (int k = 0; k < N; k += 2) {
for (int i = 0; i < N; i += 2) {
/* load a block around Aik */
for (int j = 0; j < N; ++j) {
/* process a "block": */
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
}
}
}
now: more temporal locality in \(B\)
for (int k = 0; k < N; k += 2)
for (int i = 0; i < N; i += 2)
for (int j = 0; j < N; ++j) {
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize$B_{k+0,j}$@
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize$B_{k+0,j}$@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
}
\(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop
likely 2 misses per loop with \(A\) (2 cache blocks)
for (int k = 0; k < N; k += 2)
for (int i = 0; i < N; i += 2)
for (int j = 0; j < N; ++j) {
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
}
\(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop
likely \(2\\div{}\\text{block size}\) misses per iteration with \(B\)
for (int k = 0; k < N; k += 2)
for (int i = 0; i < N; i += 2)
for (int j = 0; j < N; ++j) {
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
}
\(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop
likely \(\\frac{2\\text{block size}}\) misses per iteration with \(C\)
for (int k = 0; k < N; k += 2)
for (int i = 0; i < N; i += 2)
for (int j = 0; j < N; ++j) {
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
@\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
@\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
}
partial_matrixmultiply(float *A, float *B, float *C
int startI, int endI, ...) {
for (int i = startI; i < endI; ++i) {
for (int j = startJ; j < endJ; ++j) {
for (int k = startK; k < endK; ++k) {
...
}
matrix_multiply(float *A, float *B, float *C, int N) {
for (int ii = 0; ii < N; ii += BLOCK_I)
for (int jj = 0; jj < N; jj += BLOCK_J)
for (int kk = 0; kk < N; kk += BLOCK_K)
...
/* do everything for segment of A, B, C
that fits in cache! */
partial_matmul(A, B, C,
ii, ii + BLOCK_I, jj, jj + BLOCK_J,
kk, kk + BLOCK_K)
}
for each of \(N^3/IJK\) matrix blocks:
load \(I\\times{} K\) elements of \(A_ik\):
load \(K\\times{} J\) elements of \(B_kj\):
load \(I\\times{} J\) elements of \(C_ij\):
bigger blocks — more work per load!
catch: \(IK+KJ+IJ\) elements must fit in cache
for (int k = 0; k < N; ++k) {
for (int i = 0; i < N; ++i) {
@$A_{ik}$ loaded once in this loop:@
for (int j = 0; j < N; ++j)
@$C_{ij}$, $B_{kj}$ loaded each iteration (if $N$ big):@
B[i*N+j] += A[i*N+k] * A[k*N+j];
values from \(A_ik\) used \(N\) times per load
values from \(B_kj\) used \(1\) times per load
values from \(C_ij\) used \(1\) times per load
for (int kk = 0; kk < 1000; kk += 10)
for (int jj = 0; jj < 1000; jj += 10)
for (int i = 0; i < 1000; i += 1)
for (int j = jj; j < jj+10; j += 1)
for (int k = kk; k < kk + 10; k += 1)
A[k*N+j] += B[i*N+j];
i looploop ordering forces compromises:
for k: for i: for j: c[i,j] += a[i,k] * b[j,k]
perfect temporal locality in a[i,k]
bad temporal locality for c[i,j], b[j,k]
perfect spatial locality in c[i,j]
bad spatial locality in b[j,k], a[i,k]
cache blocking: work on blocks rather than rows/columns
no perfect loop order? work on rectangular matrix blocks
size amount used in inner loops based on cache size
in practice:
for (int kk = 0; kk < N; kk += 3)
for (int i = 0; i < N; i += 1)
for (int j = 0; j < N; ++j) {
C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
}
\(\\frac{N<em>3</em>}\\cdot{} N\) j-loop iterations, and (assuming \(N\) large):
about \(1\) misses from \(A\) per j-loop iteration
about \(3N\\div{}\\text{block size}\) misses from \(B\) per j-loop iteration
about \(3N\\div{}\\text{block size}\) misses from \(C\) per j-loop iteration
can we just keep doing this increase from 3 to some large \(X\)? …
assumption: \(X\) values from A would stay in cache
assumption: \(X\) blocks from B would help with spatial locality
can’t explicitly ensure values are kept in cache
… but reusing values effectively does this
cache optimization ideas: choose what’s in the cache