caching

approximate miss analysis

  • very tedious to precisely count cache misses

    • even more tedious when we take advanced cache optimizations into account

  • instead, approximations:

  • good or bad temporal/spatial locality

    • good temporal locality: value stays in cache
    • good spatial locality: use all parts of cache block
  • with nested loops: what does inner loop use?

    • intuition: values used in inner loop loaded into cache once
    • (that is, once each time the inner loop is run)
    • … if they can all fit in the cache

locality exercise (1)

/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

/* version 2 */
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        A[i] += B[j] * C[i * N + j];

exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?

exercise: miss estimating (1)

for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]
  • Assume: 4 array elements per block, N very large, nothing in cache at beginning.

  • Example: \(N/4\) estimated misses for A accesses:

    • A[i] should always be hit on all but first iteration of inner-most loop.
    • first iter: A[i] should be hit about 3/4s of the time (same block as A[i-1] that often)
  • Exericse: estimate # of misses for B, C

a note on matrix storage

  • \(A\)\(N\\times{} N\) matrix
  • represent as array
  • makes dynamic sizes easier:
float A_2d_array[N][N];
float *A_flat = malloc(N * N);

A_flat[i * N + j] === A_2d_array[i][j]

convertion re: rows/columns

  • going to call the first index rows
  • \(A_i,j\) is A row i, column j
  • rows are stored together
  • this is an arbitrary choice

5x5 array and 4-element cache blocks

matrix multiply

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j */
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i * N + j] += A[i * N + k] * B[k * N + j];

matrix multiply

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i*N+j] += A\[i * N + k] * B\[k * N + j];

/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C\[i*N+j] += A[i * N + k] * B\[k * N + j];

loop orders and locality

  • loop body: \(C_ij += A_ikB_kj\)
  • \(ki[j]{.fragment fragment-index=2 .custom .myem-only}\) order: \(C_i[j]{.fragment fragment-index=2 .custom .myem-only}\), \(B_k[j]{.fragment fragment-index=2 .custom .myem-only}\) have spatial locality
  • \(kij\) order: \(A_ik\) has temporal locality
  • … better than …
  • \(ij[k]{.fragment fragment-index=2 .custom .myem-only}\) order: \(A_i[k]{.fragment fragment-index=2 .custom .myem-only}\) has spatial locality
  • \(ijk\) order: \(C_ij\) has temporal locality

matrix multiply

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i*N+j] += A\[i * N + k] * B\[k * N + j];

/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C\[i*N+j] += A[i * N + k] * B\[k * N + j];

which is better?

[ C_{ij} = {k=1}^n A{ik}B_{kj} ]

/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i*N+j] += A\[i * N + k] * B\[k * N + j];

/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C\[i*N+j] += A[i * N + k] * B\[k * N + j];

exercise: Which version has better spatial/temporal locality for…
… accesses to C? … accesses to A? … accesses to B?

array usage: \(ijk\) order

array usage: \(kij\) order

againframe(matrixMultTwoVersions)

performance (with A=B)


alternate view 1: cycles/instruction

alternate view 2: cycles/operation

counting misses: version 1

for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i * N + j] += A[i * N + k] * B[k * N + j];
  • if \(N\) really large

    • assumption: can’t get close to storing \(N\) values in cache at once
  • for A: about \(N\\div{} \\text{block size}\) misses per k-loop

    • total misses: \(N^3\\div{}\\text{block size}\)
  • for B: about \(N\) misses per k-loop

    • total misses: \(N^3\)
  • for C: about \(1\\div{}\\text{block size}\) miss per k-loop

    • total misses: \(N^2\\div{}\\text{block size}\)

counting misses: version 2

for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      C[i * N + j] += A[i * N + k] * B[k * N + j];
  • for A: about \(1\) misses per j-loop

    • total misses: \(N^2\)
  • for B: about \(N\\div{}\\text{block size}\) miss per j-loop

    • total misses: \(N^3\\div{}\\text{block size}\)
  • for C: about \(N\\div{}\\text{block size}\) miss per j-loop

    • total misses: \(N^3\\div{}\\text{block size}\)

exercise: miss estimating (2)

for (int k = 0; k < 1000; k += 1)
    for (int i = 0; i < 1000; i += 1)
        for (int j = 0; j < 1000; j += 1)
            A[k*N+j] += B[i*N+j];
  • assuming: 4 elements per block
  • assuming: cache not close to big enough to hold 1K elements
  • estimate: approximately how many misses for A, B?

L1 misses (with A=B)

L1 miss detail (1)

L1 miss detail (2)

addresses

B[k*114+j] is at 10 0000 0000 0100
B[k*114+j+1] is at 10 0000 0000 1000
B[(k+1)*114+j] is at 10 0011 1001 0100
B[(k+2)*114+j] is at 10 0101 0101 1100
B[(k+9)*114+j] is at 11 0000 0000 1100

  • test system L1 cache: 6 block offset bits

conflict misses

  • powers of two — lower order bits unchanged

  • B[k*93+j] and B[(k+11)*93+j]:

    • 1023 elements apart (4092 bytes; 63.9 cache blocks)
  • 64 sets in L1 cache: usually maps to same set

  • B[k*93+(j+1)] will not be cached (next \(i\) loop)

  • even if in same block as B[k*93+j]


  • how to fix? improve spatial locality

    • (maybe even if it requires copying)

locality exercise (2)

/* version 2 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

/* version 3 */
for (int ii = 0; ii < N; ii += 32)
    for (int jj = 0; jj < N; jj += 32)
        for (int i = ii; i < ii + 32; ++i)
            for (int j = jj; j < jj + 32; ++j)
                A[i] += B[j] * C[i * N + j];

exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?

a transformation

for (int k = 0; k < N; k += 1)
      for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
          C\[i*N+j] += A\[i*N+k] * B\[k*N+j];

for (int kk = 0; kk < N; kk += 2)
  for (int k = kk; k < kk + 2; ++k)
      for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
          C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
  • split the loop over \(k\) — should be exactly the same

    • (assuming even \(N\))

simple blocking

for (int kk = 0; kk < N; kk += 2)
  /* was here: for (int k = kk; k < kk + 2; ++k) */
    for (int i = 0; i < N; ++i)
      for (int j = 0; j < N; ++j)
        /* load Aik, Aik+1 into cache and process: */
        [[for (int k = kk; k < kk + 2; ++k)]{.fragment fragment-index=2 .custom .myem-only}]{.fragment fragment-index=3 .custom .myem-only}
            C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
  • now reorder split loop — same calculations
  • now handle \(B_ij\) for \(k+1\) right after \(B_ij\) for \(k\)
  • (previously: \(B_i,j+1\) for \(k\) right after \(B_ij\) for \(k\))

simple blocking – expanded

for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
      /* process a "block" of 2 k values: */
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B\[(kk+1)*N+j];
    }
  }
}

counting misses for A (1)

for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }

access pattern for A:
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] … (repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] … (repeats N times)

 
 

counting misses for A (2)

A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] … (repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] … (repeats N times)

 
 

  • likely cache misses: only first iterations of \(j\) loop

  • how many cache misses per iteration? usually one

    • A[0*N+0] and A[0*N+1] usually in same cache block
  • about \(\\frac{N2}\\cdot{} N\) misses total

counting misses for B (1)

for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }

access pattern for B:
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], … B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], … B[4*N+(N-1)], B[5*N+(N-1)]

B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]

counting misses for B (2)

access pattern for B:
B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], … B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], … B[4*N+(N-1)], B[5*N+(N-1)]

B[0*N+0], B[1*N+0], … B[0*N+(N-1)], B[1*N+(N-1)]

  • likely cache misses: any access, each time
  • how many cache misses per iteration? equal to # cache blocks in 2 rows
  • about \(\\frac{N2}\\cdot{} N\\cdot{}\\frac{2N\\text{block size}}=N^3\\div{}\\text{block size}\) misses

simple blocking – counting misses

for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
  • \(\\frac{N2}\\cdot{} N\) j-loop executions and (assuming \(N\) large):

  • about \(1\) misses from \(A\) per j-loop

    • \(N^2/2\) total misses (before blocking: \(N^2\))
  • about \(2N\\div{}\\text{block size}\) misses from \(B\) per j-loop

    • \(N^3\\div{}\\text{block size}\) total misses (same as before blocking)
  • about \(N\\div{}\\text{block size}\) misses from \(C\) per j-loop

    • \(N^3\\div{}(2\\cdot{}\\text{block size})\) total misses (before: \(N^3\\div{}\\text{block size}\))

improvement in read misses

simple blocking (2)

  • same thing for \(i\) in addition to \(k\)?
for (int kk = 0; kk < N; kk += 2) {
  for (int ii = 0; ii < N; ii += 2) {
    for (int j = 0; j < N; ++j) {
      /* process a "block": */
      for (int k = kk; k < kk + 2; ++k)
        for (int i = 0; i < ii + 2; ++i)
            C\[i*N+j] += A\[i*N+k] * B\[k*N+j];
    }
  }
}

simple blocking — locality

for (int k = 0; k < N; k += 2) {
  for (int i = 0; i < N; i += 2) {
    /* load a block around Aik */
    for (int j = 0; j < N; ++j) {
      /* process a "block": */
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }
  }
}
  • now: more temporal locality in \(B\)

    • previously: access \(B_kj\), then don’t use it again for a long time

simple blocking — counting misses for A

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize$B_{k+0,j}$@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize$B_{k+0,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }
  • \(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop

  • likely 2 misses per loop with \(A\) (2 cache blocks)

    • total misses: \(\\frac{N^22}\) (same as only blocking in K)

simple blocking — counting misses for B

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize\myemph{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }
  • \(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop

  • likely \(2\\div{}\\text{block size}\) misses per iteration with \(B\)

    • total misses: \(\\frac{N^32\\cdot{}\\text{block size}}\) (before: \(\\frac{N^3\\text{block size}}\))

simple blocking — counting misses for C

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }
  • \(\\frac{N2}\\cdot{}\\frac{N2}\) iterations of \(j\) loop

  • likely \(\\frac{2\\text{block size}}\) misses per iteration with \(C\)

    • total misses: \(\\frac{N^32\\cdot{}\\text{block size}}\) (same as blocking only in K)

simple blocking — counting misses (total)

for (int k = 0; k < N; k += 2)
  for (int i = 0; i < N; i += 2)
    for (int j = 0; j < N; ++j) {
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+0,j}$@ += @\normalsize$A_{i+0,k+1}$@ * @\normalsize$B_{k+1,j}$@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+0}$@ * @\normalsize{$B_{k+0,j}$}@
      @\normalsize$C_{i+1,j}$@ += @\normalsize$A_{i+1,k+1}$@ * @\normalsize$B_{k+1,j}$@
    }
  • before:
    A: \(\\frac{N^22}\); B: \(\\frac{N^31\\cdot{}\\text{block size}}\); C \(\\frac{N^31\\cdot{}\\text{block size}}\)
  • after:
    A: \(\\frac{N^22}\); B: \(\\frac{N^32\\cdot{}\\text{block size}}\); C \(\\frac{N^32\\cdot{}\\text{block size}}\)

generalizing: divide and conquer

partial_matrixmultiply(float *A, float *B, float *C
               int startI, int endI, ...) {
  for (int i = startI; i < endI; ++i) {
    for (int j = startJ; j < endJ; ++j) {
      for (int k = startK; k < endK; ++k) {
        ...
}
matrix_multiply(float *A, float *B, float *C, int N) {
  for (int ii = 0; ii < N; ii += BLOCK_I)
    for (int jj = 0; jj < N; jj += BLOCK_J)
      for (int kk = 0; kk < N; kk += BLOCK_K)
         ...
         /* do everything for segment of A, B, C
            that fits in cache! */
         partial_matmul(A, B, C,
               ii, ii + BLOCK_I, jj, jj + BLOCK_J,
               kk, kk + BLOCK_K)
}

array usage: matrix block

cache blocking efficiency

  • for each of \(N^3/IJK\) matrix blocks:

  • load \(I\\times{} K\) elements of \(A_ik\):

    • \(\\approx{} IK\\div{} \\text{block size}\) misses per matrix block
    • \(\\approx{} N^3/(J\\cdot{}\\text{blocksize})\) misses total
  • load \(K\\times{} J\) elements of \(B_kj\):

    • \(\\approx{} N^3/(I\\cdot{}\\text{blocksize})\) misses total
  • load \(I\\times{} J\) elements of \(C_ij\):

    • \(\\approx{} N^3/(K\\cdot{}\\text{blocksize})\) misses total
  • bigger blocks — more work per load!

  • catch: \(IK+KJ+IJ\) elements must fit in cache

    • otherwise estimates above don’t work

cache blocking rule of thumb

  • fill the most of the cache with useful data
  • and do as much work as possible from that
  • example: my desktop 32KB L1 cache
  • \(I=J=K=48\) uses \(48^2\\times{} 3\) elements, or \(27\)KB.
  • assumption: conflict misses aren’t important

systematic approach

for (int k = 0; k < N; ++k) {
  for (int i = 0; i < N; ++i) {
    @$A_{ik}$ loaded once in this loop:@
    for (int j = 0; j < N; ++j)
      @$C_{ij}$, $B_{kj}$ loaded each iteration (if $N$ big):@
      B[i*N+j] += A[i*N+k] * A[k*N+j];
  • values from \(A_ik\) used \(N\) times per load

  • values from \(B_kj\) used \(1\) times per load

    • but good spatial locality, so cache block of \(B_kj\) together
  • values from \(C_ij\) used \(1\) times per load

    • but good spatial locality, so cache block of \(C_ij\) together

exercise: miss estimating (3)

for (int kk = 0; kk < 1000; kk += 10)
    for (int jj = 0; jj < 1000; jj += 10)
        for (int i = 0; i < 1000; i += 1)
            for (int j = jj; j < jj+10; j += 1)
                for (int k = kk; k < kk + 10; k += 1)
                    A[k*N+j] += B[i*N+j];
  • assuming: 4 elements per block
  • assuming: cache not close to big enough to hold 1K elements, but big enough to hold 500 or so
  • estimate: approximately how many misses for A, B?
  • hint 1: part of A, B loaded in two inner-most loops only needs to be loaded once
  • hint 2: part of A can be reused between iterations of i loop

loop ordering compromises

  • loop ordering forces compromises:

  • for k: for i: for j: c[i,j] += a[i,k] * b[j,k]


  • perfect temporal locality in a[i,k]

  • bad temporal locality for c[i,j], b[j,k]

  • perfect spatial locality in c[i,j]

  • bad spatial locality in b[j,k], a[i,k]


  • cache blocking: work on blocks rather than rows/columns

    • have some temporal, spatial locality in everything

cache blocking pattern

  • no perfect loop order? work on rectangular matrix blocks


  • size amount used in inner loops based on cache size

  • in practice:

    • test performance to determine ‘size’ of blocks

Backup slides

simple blocking – with 3?

for (int kk = 0; kk < N; kk += 3)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
      C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
      C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
    }
  • \(\\frac{N<em>3</em>}\\cdot{} N\) j-loop iterations, and (assuming \(N\) large):

  • about \(1\) misses from \(A\) per j-loop iteration

    • \(N^2/<em>3</em>\) total misses (before blocking: \(N^2\))
  • about \(3N\\div{}\\text{block size}\) misses from \(B\) per j-loop iteration

    • \(N^3\\div{}\\text{block size}\) total misses (same as before)
  • about \(3N\\div{}\\text{block size}\) misses from \(C\) per j-loop iteration

    • \(N^3\\div{}\\text{block size}\) total misses (same as before)

more than 3?

  • can we just keep doing this increase from 3 to some large \(X\)? …

  • assumption: \(X\) values from A would stay in cache

    • \(X\) too large — cache not big enough
  • assumption: \(X\) blocks from B would help with spatial locality

    • \(X\) too large — evicted from cache before next iteration

array usage (2 \(k\) at a time)

keeping values in cache

  • can’t explicitly ensure values are kept in cache

  • … but reusing values effectively does this

    • cache will try to keep recently used values

  • cache optimization ideas: choose what’s in the cache

    • for thinking about it: load values explicitly
    • for implementing it: access only values we want loaded