caching

approximate miss analysis

very tedious to precisely count cache misses
- even more tedious when we take advanced cache optimizations into account

instead, approximations:
good or bad temporal/spatial locality
- good temporal locality: value stays in cache
- good spatial locality: use all parts of cache block
with nested loops: what does inner loop use?
- intuition: values used in inner loop loaded into cache once
- (that is, once each time the inner loop is run)
- … if they can all fit in the cache

locality exercise (1)

/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

/* version 2 */
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        A[i] += B[j] * C[i * N + j];

exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?

exercise: miss estimating (1)

for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

Assume: 4 array elements per block, N very large, nothing in cache at beginning.
Example: \(N/4\) estimated misses for A accesses:
- A[i] should always be hit on all but first iteration of inner-most loop.
- first iter: A[i] should be hit about 3/4s of the time (same block as A[i-1] that often)
Exericse: estimate # of misses for B, C

a note on matrix storage

\(A\) — \(N\\times{} N\) matrix
represent as array
makes dynamic sizes easier:

float A_2d_array[N][N];
float *A_flat = malloc(N * N);

A_flat[i * N + j] === A_2d_array[i][j]

convertion re: rows/columns

going to call the first index rows
\(A_i,j\) is A row i, column j
rows are stored together
this is an arbitrary choice

5x5 array and 4-element cache blocks

matrix multiply

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j */
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i * N + j] += A[i * N + k] * B[k * N + j];

matrix multiply

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i*N+j] += A\[i * N + k] * B\[k * N + j];

/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C\[i*N+j] += A[i * N + k] * B\[k * N + j];

loop orders and locality

loop body: \(C_ij += A_ikB_kj\)
\(ki[j]{.fragment fragment-index=2 .custom .myem-only}\) order: \(C_i[j]{.fragment fragment-index=2 .custom .myem-only}\), \(B_k[j]{.fragment fragment-index=2 .custom .myem-only}\) have spatial locality
\(kij\) order: \(A_ik\) has temporal locality
… better than …
\(ij[k]{.fragment fragment-index=2 .custom .myem-only}\) order: \(A_i[k]{.fragment fragment-index=2 .custom .myem-only}\) has spatial locality
\(ijk\) order: \(C_ij\) has temporal locality

matrix multiply

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i*N+j] += A\[i * N + k] * B\[k * N + j];

/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C\[i*N+j] += A[i * N + k] * B\[k * N + j];

which is better?

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i*N+j] += A\[i * N + k] * B\[k * N + j];

/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C\[i*N+j] += A[i * N + k] * B\[k * N + j];

exercise: Which version has better spatial/temporal locality for…
… accesses to C? … accesses to A? … accesses to B?

array usage: \(ijk\) order

array usage: \(kij\) order

againframe(matrixMultTwoVersions)

performance (with A=B)

alternate view 1: cycles/instruction

alternate view 2: cycles/operation

counting misses: version 1

for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i * N + j] += A[i * N + k] * B[k * N + j];

if \(N\) really large
- assumption: can’t get close to storing \(N\) values in cache at once
for A: about \(N\\div{} \\text{block size}\) misses per k-loop
- total misses: \(N^3\\div{}\\text{block size}\)
for B: about \(N\) misses per k-loop
- total misses: \(N^3\)
for C: about \(1\\div{}\\text{block size}\) miss per k-loop
- total misses: \(N^2\\div{}\\text{block size}\)

counting misses: version 2

for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C[i * N + j] += A[i * N + k] * B[k * N + j];

for A: about \(1\) misses per j-loop
- total misses: \(N^2\)
for B: about \(N\\div{}\\text{block size}\) miss per j-loop
- total misses: \(N^3\\div{}\\text{block size}\)
for C: about \(N\\div{}\\text{block size}\) miss per j-loop
- total misses: \(N^3\\div{}\\text{block size}\)

exercise: miss estimating (2)

for (int k = 0; k < 1000; k += 1)
    for (int i = 0; i < 1000; i += 1)
        for (int j = 0; j < 1000; j += 1)
            A[k*N+j] += B[i*N+j];

assuming: 4 elements per block
assuming: cache not close to big enough to hold 1K elements
estimate: approximately how many misses for A, B?

L1 misses (with A=B)

L1 miss detail (1)

L1 miss detail (2)

addresses

`B[k*114+j]`	is at	`10 0000 0000 0100`
`B[k*114+j+1]`	is at	`10 0000 0000 1000`
`B[(k+1)*114+j]`	is at	`10 0011 1001 0100`
`B[(k+2)*114+j]`	is at	`10 0101 0101 1100`
…
`B[(k+9)*114+j]`	is at	`11 0000 0000 1100`

test system L1 cache: 6 block offset bits

conflict misses

powers of two — lower order bits unchanged
B[k*93+j] and B[(k+11)*93+j]:
- 1023 elements apart (4092 bytes; 63.9 cache blocks)
64 sets in L1 cache: usually maps to same set
B[k*93+(j+1)] will not be cached (next \(i\) loop)
even if in same block as B[k*93+j]
how to fix? improve spatial locality
- (maybe even if it requires copying)

locality exercise (2)

/* version 2 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

/* version 3 */
for (int ii = 0; ii < N; ii += 32)
    for (int jj = 0; jj < N; jj += 32)
        for (int i = ii; i < ii + 32; ++i)
            for (int j = jj; j < jj + 32; ++j)
                A[i] += B[j] * C[i * N + j];

exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?

a transformation

for (int k = 0; k < N; k += 1)
      for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
          C\[i*N+j] += A\[i*N+k] * B\[k*N+j];

for (int kk = 0; kk < N; kk += 2)
  for (int k = kk; k < kk + 2; ++k)
      for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
          C\[i*N+j] += A\[i*N+k] * B\[k*N+j];

split the loop over \(k\) — should be exactly the same
- (assuming even \(N\))

simple blocking

for (int kk = 0; kk < N; kk += 2)
  /* was here: for (int k = kk; k < kk + 2; ++k) */
    for (int i = 0; i < N; ++i)
      for (int j = 0; j < N; ++j)
        /* load Aik, Aik+1 into cache and process: */
        [[for (int k = kk; k < kk + 2; ++k)]{.fragment fragment-index=2 .custom .myem-only}]{.fragment fragment-index=3 .custom .myem-only}
            C\[i*N+j] += A\[i*N+k] * B\[k*N+j];

now reorder split loop — same calculations
now handle \(B_ij\) for \(k+1\) right after \(B_ij\) for \(k\)
(previously: \(B_i,j+1\) for \(k\) right after \(B_ij\) for \(k\))

simple blocking – expanded

for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
      /* process a "block" of 2 k values: */
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B\[(kk+1)*N+j];
    }
  }
}

counting misses for A (1)

for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }

access pattern for A:
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] … (repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] … (repeats N times)
…

…

counting misses for A (2)

A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] … (repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] … (repeats N times)
…

…

likely cache misses: only first iterations of \(j\) loop
how many cache misses per iteration? usually one
- A[0*N+0] and A[0*N+1] usually in same cache block
about \(\\frac{N2}\\cdot{} N\) misses total

counting misses for B (1)

for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }

access pattern for B:
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], … B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], … B[4*N+(N-1)], B[5*N+(N-1)]
…
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
…

counting misses for B (2)

likely cache misses: any access, each time
how many cache misses per iteration? equal to # cache blocks in 2 rows
about \(\\frac{N2}\\cdot{} N\\cdot{}\\frac{2N\\text{block size}}=N^3\\div{}\\text{block size}\) misses

simple blocking – counting misses

for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }

\(\\frac{N2}\\cdot{} N\) j-loop executions and (assuming \(N\) large):
about \(1\) misses from \(A\) per j-loop
- \(N^2/2\) total misses (before blocking: \(N^2\))
about \(2N\\div{}\\text{block size}\) misses from \(B\) per j-loop
- \(N^3\\div{}\\text{block size}\) total misses (same as before blocking)
about \(N\\div{}\\text{block size}\) misses from \(C\) per j-loop
- \(N^3\\div{}(2\\cdot{}\\text{block size})\) total misses (before: \(N^3\\div{}\\text{block size}\))

improvement in read misses

simple blocking (2)

same thing for \(i\) in addition to \(k\)?

for (int kk = 0; kk < N; kk += 2) {
  for (int ii = 0; ii < N; ii += 2) {
    for (int j = 0; j < N; ++j) {
      /* process a "block": */
      for (int k = kk; k < kk + 2; ++k)
        for (int i = 0; i < ii + 2; ++i)
            C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
    }
  }
}

simple blocking — locality

for (int k = 0; k < N; k += 2) {
  for (int i = 0; i < N; i += 2) {
    /* load a block around Aik */
    for (int j = 0; j < N; ++j) {
      /* process a "block": */
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }
  }
}

now: more temporal locality in \(B\)
- previously: access \(B_kj\), then don’t use it again for a long time

simple blocking — counting misses for A

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize$B_{k+0,j}$@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize$B_{k+0,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }

\(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop
likely 2 misses per loop with \(A\) (2 cache blocks)
- total misses: \(\\frac{N^22}\) (same as only blocking in K)

simple blocking — counting misses for B

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }

\(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop
likely \(2\\div{}\\text{block size}\) misses per iteration with \(B\)
- total misses: \(\\frac{N^32\\cdot{}\\text{block size}}\) (before: \(\\frac{N^3\\text{block size}}\))

simple blocking — counting misses for C

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }

\(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop
likely \(\\frac{2\\text{block size}}\) misses per iteration with \(C\)
- total misses: \(\\frac{N^32\\cdot{}\\text{block size}}\) (same as blocking only in K)

simple blocking — counting misses (total)

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }

before:
A: \(\\frac{N^22}\); B: \(\\frac{N^31\\cdot{}\\text{block size}}\); C \(\\frac{N^31\\cdot{}\\text{block size}}\)
after:
A: \(\\frac{N^22}\); B: \(\\frac{N^32\\cdot{}\\text{block size}}\); C \(\\frac{N^32\\cdot{}\\text{block size}}\)

generalizing: divide and conquer

partial_matrixmultiply(float *A, float *B, float *C
               int startI, int endI, ...) {
  for (int i = startI; i < endI; ++i) {
    for (int j = startJ; j < endJ; ++j) {
      for (int k = startK; k < endK; ++k) {
        ...
}
matrix_multiply(float *A, float *B, float *C, int N) {
  for (int ii = 0; ii < N; ii += BLOCK_I)
    for (int jj = 0; jj < N; jj += BLOCK_J)
      for (int kk = 0; kk < N; kk += BLOCK_K)
         ...
         /* do everything for segment of A, B, C
            that fits in cache! */
         partial_matmul(A, B, C,
               ii, ii + BLOCK_I, jj, jj + BLOCK_J,
               kk, kk + BLOCK_K)
}

array usage: matrix block

cache blocking efficiency

for each of \(N^3/IJK\) matrix blocks:
load \(I\\times{} K\) elements of \(A_ik\):
- \(\\approx{} IK\\div{} \\text{block size}\) misses per matrix block
- \(\\approx{} N^3/(J\\cdot{}\\text{blocksize})\) misses total
load \(K\\times{} J\) elements of \(B_kj\):
- \(\\approx{} N^3/(I\\cdot{}\\text{blocksize})\) misses total
load \(I\\times{} J\) elements of \(C_ij\):
- \(\\approx{} N^3/(K\\cdot{}\\text{blocksize})\) misses total
bigger blocks — more work per load!
catch: \(IK+KJ+IJ\) elements must fit in cache
- otherwise estimates above don’t work

cache blocking rule of thumb

fill the most of the cache with useful data
and do as much work as possible from that
example: my desktop 32KB L1 cache
\(I=J=K=48\) uses \(48^2\\times{} 3\) elements, or \(27\)KB.
assumption: conflict misses aren’t important

systematic approach

for (int k = 0; k < N; ++k) {
  for (int i = 0; i < N; ++i) {
    @$A_{ik}$ loaded once in this loop:@
    for (int j = 0; j < N; ++j)
      @$C_{ij}$, $B_{kj}$ loaded each iteration (if $N$ big):@
      B[i*N+j] += A[i*N+k] * A[k*N+j];

values from \(A_ik\) used \(N\) times per load
values from \(B_kj\) used \(1\) times per load
- but good spatial locality, so cache block of \(B_kj\) together
values from \(C_ij\) used \(1\) times per load
- but good spatial locality, so cache block of \(C_ij\) together

exercise: miss estimating (3)

for (int kk = 0; kk < 1000; kk += 10)
    for (int jj = 0; jj < 1000; jj += 10)
        for (int i = 0; i < 1000; i += 1)
            for (int j = jj; j < jj+10; j += 1)
                for (int k = kk; k < kk + 10; k += 1)
                    A[k*N+j] += B[i*N+j];

assuming: 4 elements per block
assuming: cache not close to big enough to hold 1K elements, but big enough to hold 500 or so
estimate: approximately how many misses for A, B?
hint 1: part of A, B loaded in two inner-most loops only needs to be loaded once
hint 2: part of A can be reused between iterations of i loop

loop ordering compromises

loop ordering forces compromises:
for k: for i: for j: c[i,j] += a[i,k] * b[j,k]
perfect temporal locality in a[i,k]
bad temporal locality for c[i,j], b[j,k]
perfect spatial locality in c[i,j]
bad spatial locality in b[j,k], a[i,k]
cache blocking: work on blocks rather than rows/columns
- have some temporal, spatial locality in everything

cache blocking pattern

no perfect loop order? work on rectangular matrix blocks
size amount used in inner loops based on cache size
in practice:
- test performance to determine ‘size’ of blocks

Backup slides

simple blocking – with 3?

for (int kk = 0; kk < N; kk += 3)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
      C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
    }

\(\\frac{N<em>3</em>}\\cdot{} N\) j-loop iterations, and (assuming \(N\) large):
about \(1\) misses from \(A\) per j-loop iteration
- \(N^2/<em>3</em>\) total misses (before blocking: \(N^2\))
about \(3N\\div{}\\text{block size}\) misses from \(B\) per j-loop iteration
- \(N^3\\div{}\\text{block size}\) total misses (same as before)
about \(3N\\div{}\\text{block size}\) misses from \(C\) per j-loop iteration
- \(N^3\\div{}\\text{block size}\) total misses (same as before)

more than 3?

can we just keep doing this increase from 3 to some large \(X\)? …
assumption: \(X\) values from A would stay in cache
- \(X\) too large — cache not big enough
assumption: \(X\) blocks from B would help with spatial locality
- \(X\) too large — evicted from cache before next iteration

array usage (2 \(k\) at a time)

keeping values in cache

can’t explicitly ensure values are kept in cache
… but reusing values effectively does this
- cache will try to keep recently used values

cache optimization ideas: choose what’s in the cache
- for thinking about it: load values explicitly
- for implementing it: access only values we want loaded