## last time

write policies
thinking about tradeoffs in cache design
average memory access time
hit time $v$ miss rate $v$ miss penalty
miss types (compulsory/capacity/conflict)
data misses and $C$ code

## quiz Q1

8 bits for offset
4 bits for index
$64-12=52$ bits for tag
1 tag (52 bits) + valid bit per block
16 blocks
$16 * 53=848$ bits

## quiz Q2

write to L1
L1 is write-no-allocate: nothing stored in L1, just sent to next level (L2)

L2 is write-allocate: something stored
L2 is write-back: marked dirty when stored (instead of being sent to next level)

## quiz Q4

read from $0 \times 0$ - bring in $0 \times 0-0 \times F$
write to $0 \times 4$ - mark $0 \times 0-0 \times F$ as dirty
write to address $0 \times 2004$ :
write-allocate - so need to add to cache:
first must evict $0 \times 0-0 \times F$ - write whole thing to memory
bring in $0 \times 2000-0 \times 2003$ and $0 \times 2005-0 \times 200 F$ - read from memory

## C and cache misses (warmup 1)

```
int array[4];
```

int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 1 -set direct-mapped cache with 8B blocks?

## some possiblities



Q1: how do cache blocks correspond to array elements? not enough information provided!

## some possiblities

if array[0] starts at beginning of a cache block... array split across two cache blocks

| memory access | cache contents afterwards |
| :---: | :---: |
|  | (empty) |
| read array [0] (miss) | \{array[0], array[1] \} |
| read array[1] (hit) | \{array[0], array[1]\} |
| read array[2] (miss) | \{array[2], array[3]\} |
| read array[3] (hit) | \{array[2], array[3]\} |

## some possiblities

one cache block

if array[0] starts right in the middle of a cache block array split across three cache blocks

| memory access | cache contents afterwards |
| :--- | :--- |
| - | $($ empty ) |
| read array [0] (miss) | $\{\star \star \star \star, \operatorname{array[0]\} }$ |
| read array [1] (miss) | $\{\operatorname{array[1],} \operatorname{array[2]\} }$ |
| read array [2] (hit) | $\{\operatorname{array[1],} \operatorname{array[2]\} }$ |
| read array [3] (miss) | $\{\operatorname{array[3],++++\} }$ |

## some possiblities


if array[0] starts at an odd place in a cache block, need to read two cache blocks to get most array elements

| memory access | cache contents afterwards |
| :---: | :---: |
|  | (empty) |
| read array [0] byte 0 (miss) | \{ ****, array[0] byte 0 \} |
| read array [0] byte 1-3 (miss) | $\{$ array[0] byte 1-3, array[2], array[3] byte 0 \} |
| read array[1] (hit) | $\{$ array[0] byte 1-3, array[2], array[3] byte 0 \} |
| read array[2] byte 0 (hit) | $\{$ array [0] byte 1-3, array[2], array[3] byte 0 \} |
| read array[2] byte 1-3 (miss) | \{part of array[2], array[3], ++++\} |
| read array[3] (hit) | \{part of array[2], array[3], ++++\} |

## aside: alignment

compilers and malloc/new implementations usually try align values align $=$ make address be multiple of something
most important reason: don't cross cache block boundaries

## C and cache misses (warmup 2)

```
int array[4];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
odd_sum += array[1];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

Assume array[0] at beginning of cache block.
How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

## exercise solution

## one cache block

```
array[0] array[1] array[2] array [3]
```

| memory access | cache contents afterwards |
| :---: | :---: |
|  | (empty) |
| read array [0] (miss) | \{array[0], array[1]\} |
| read array[2] (miss) | \{array[2], array[3]\} |
| read array[1] (miss) | \{array[0], array[1]\} |
| read array [3] (miss) | \{array[2], array[3]\} |

## C and cache misses (warmup 3)

```
int array[8];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
even_sum += array[4];
odd_sum += array[5];
even_sum += array[6];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny), and array[0] at beginning of cache block.

How many data cache misses on a 2 -set direct-mapped cache with 8B blocks?

## exercise solution

one cache block


## exercise solution

one cache block one cache block one cache block one cache block


## exercise solution

one cache block one cache block one cache block one cache block

|  | (index 1) |  | (index 1) |  | (index 0) |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\cdots$ |  | array[0] array[1] | array [2] | array [3] | array [4] | array [5] | arra |
|  | memory access | set 0 afterwards |  | set 1 afterwards |  |  |  |
|  | - | (empty) |  | (empty) |  |  |  |
|  | read array [0] (miss) | \{array [0], array [1] \} |  | (empty) |  |  |  |
|  | read array[1] (hit) | \{array[0], array[1] |  | (empty) |  |  |  |
|  | read array[2] (miss) | \{array [0], array[1]\} |  | \{array[2], array[3] |  |  |  |
|  | read array[3] (hit) | \{array [0], array [1] \} |  | \{array[2], array[3] |  |  |  |
|  | read array [4] (miss) | \{array [4], array [5] \} |  | \{array[2], array[3] |  |  |  |
|  | read array [5] (hit) | \{array [4], array [5] \} |  | \{array [2], array [3] |  |  |  |
|  | read array[6] (miss) | \{array [4], array [5] \} |  | \{array[6], array[7]\} |  |  |  |
|  | read array[7] (hit) | \{array[4], array[5]\} |  | \{array[6], array[7]\} |  |  |  |

## exercise solution

one cache block one cache block one cache block one cache block observation: what happens in set 0 doesn't affect set 1


## exercise solution



| read array [4] (miss) | $\{\operatorname{array[4],} \operatorname{array[5]\} }$ |
| :--- | :--- |
| read array [5] (hit) | $\{\operatorname{array}[4], \operatorname{array}[5]\}$ |

## exercise solution

one cache block one cache block one cache block one cache block


| read array [4] (miss) | $\{\operatorname{array[4],} \operatorname{array[5]\} }$ |
| :--- | :--- |
| read array [5] (hit) | $\{\operatorname{array}[4], \operatorname{array}[5]\}$ |

## exercise solution

one cache block one cache block one cache block one cache block


| read array[2] (miss) |
| :--- |
| read array[3] (hit) |

$\{\operatorname{array[2]}, \operatorname{array[3]}\}$
$\{\operatorname{array}[2], \operatorname{array[3]}\}$

| $\mid\{\operatorname{array}[6], \operatorname{array}[7]\}$ |
| :--- |
| $\{\operatorname{array}[6], \operatorname{array[7]\} }$ |

read array[7] (hit)

## C and cache misses (warmup 4)

```
int array[8];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
even_sum += array[4];
even_sum += array[6];
odd_sum += array[1];
odd_sum += array[3];
odd_sum += array[5];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2-set direct-mapped cache with 8B blocks?

## exercise solution

one cache block one cache block one cache block one cache block (index 1) (index 0) (index 1) (index 0)


## exercise solution

one cache block one cache block one cache block one cache block (index 1$) \quad($ index 0$) \quad$ (index 1$) 0$ )


| read $\operatorname{array}[4]$ (miss) | $\{\operatorname{array}[4], \operatorname{array}[5]\}$ |
| :--- | :--- |
| read array [1] (miss) | $\{\operatorname{array}[4], \operatorname{array}[5]\}, \operatorname{array}[1]\}$ |
| read array [3] (miss) | $\{\operatorname{array}[0], \operatorname{array}[1]\}$ |

## exercise solution

one cache block one cache block one cache block one cache block (index 1) (index 0) (index 1) (index 0)


| read array[2] (miss) | \{array [0], array [1] | \{array[2], array[3] |
| :---: | :---: | :---: |
| read array [4] (miss) |  | \{array[2], array[3]\} |
| read array [6] (miss) | \{array [4], array [5] \} | \{array[6], array[7]\} |
| read array [1] (miss) | \{array [0], array [1] \} | \{array [6], array [7]\} |
| read array [3] (miss) | $\{\operatorname{array~[0],~array~[1]\} ~}$ | \{array[2], array[3] |
| read array [5] (miss) |  | \{array [2], array [3]\} |
| read array [7] (miss) | $\{\operatorname{array~[4],~array~[5]\} ~}$ | \{array[6], array[7] |

## arrays and cache misses (1)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1000; i += 2) {
even_sum += array[i + 0];
odd_sum += array[i + 1];
}
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2 KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
odd_sum += array[i + 1];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2 KB direct-mapped cache with 16B cache blocks? Would a set-associtative cache be better? What if the array had 1000 elements?

## approximate miss analysis

very tedious to precisely count cache misses
even more tedious when we take advanced cache optimizations into account
instead, approximations:
good or bad temporal/spatial locality good temporal locality: value stays in cache good spatial locality: use all parts of cache block
with nested loops: what does inner loop use?
intuition: values used in inner loop loaded into cache once (that is, once each time the inner loop is run) ...if they can all fit in the cache

## approximate miss analysis

very tedious to precisely count cache misses
even more tedious when we take advanced cache optimizations into account
instead, approximations:
good or bad temporal/spatial locality good temporal locality: value stays in cache good spatial locality: use all parts of cache block
with nested loops: what does inner loop use?
intuition: values used in inner loop loaded into cache once (that is, once each time the inner loop is run) ...if they can all fit in the cache

## locality exercise (1)

```
/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]
/* version 2 */
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        A[i] += B[j] * C[i * N + j];
```

exercise: which has better temporal locality in $A$ ? in $B$ ? in $C$ ? how about spatial locality?

## exercise: miss estimating (1)

```
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
    A[i] += B[j] * C[i * N + j]
```

Assume: 4 array elements per block, N very large, nothing in cache at beginning.

Example: $N / 4$ estimated misses for A accesses:
$\mathrm{A}[\mathrm{i}]$ should always be hit on all but first iteration of inner-most loop. first iter: $A[i]$ should be hit about $3 / 4$ s of the time (same block as $A[i-1]$ that often)

Exericse: estimate \# of misses for B, C

## a note on matrix storage

$A-N \times N$ matrix
represent as array
makes dynamic sizes easier:

```
float A_2d_array[N][N];
float *A_flat \(=m a l l o c(N * N)\);
```

A_flat $[i \star N+j]===A \_2 d \_a r r a y[i][j]$

## convertion re: rows/columns

going to call the first index rows
$A_{i, j}$ is A row i, column j
rows are stored together
this is an arbitrary choice

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :---: | :---: | :---: | :---: | :---: |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

if array starts on cache block first cache block $=$ first elements all together in one row!

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

second cache block:
1 from row 0
3 from row 1

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

generally: cache blocks contain data from 1 or 2 rows $\rightarrow$ better performance from reusing rows

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is k, middle is j */
for (int i = 0; i < N; ++i)
for (int $\mathrm{j}=0 ; \mathrm{j}<\mathrm{N} ;++\mathrm{j})$
for (int $k=0 ; k<N ;++k)$
$C[i * N+j]+=A[i * N+k] * B[k * N+j] ;$

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is k, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i * N+j]+=A[i \star N+k] \star B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i \star N+j]+=A[i * N+k] * B[k \star N+j] ;$

## loop orders and locality

loop body: $C_{i j}+=A_{i k} B_{k j}$
kij order: $C_{i j}, B_{k j}$ have spatial locality
kij order: $A_{i k}$ has temporal locality
... better than ...
$i j k$ order: $A_{i k}$ has spatial locality
$i j k$ order: $C_{i j}$ has temporal locality

## loop orders and locality

loop body: $C_{i j}+=A_{i k} B_{k j}$
kij order: $C_{i j}, B_{k j}$ have spatial locality
kij order: $A_{i k}$ has temporal locality
... better than ...
$i j k$ order: $A_{i k}$ has spatial locality
$i j k$ order: $C_{i j}$ has temporal locality

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is $k$, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i * N+j]+=A[i \star N+k] \star B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i \star N+j]+=A[i * N+k] * B[k \star N+j] ;$

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is k, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i \star N+j]+=A[i \star N+k] \star B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i \star N+j]+=A[i * N+k] \star B[k \star N+j] ;$

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is $k$, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i \star N+j]+=A[i \star N+k] * B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i \star N+k] \star B[k \star N+j] ;$

## which is better?

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

```
/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        for (int k = 0; k < N; ++k)
        C[i*N+j] += A[i * N + k] * B[k * N + j];
/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
        C[i*N+j] += A[i * N + k] * B[k*N + j];
```

exercise: Which version has better spatial/temporal locality for... ... accesses to C? ...accesses to A? ...accesses to B ?

## array usage: $i j k$ order



## array usage: $i j k$ order



## array usage: $i j k$ order


for all $i$ : for all $j$ : for all $k$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$


looking only at innermost loop: temporal locality in C
bad temporal locality in everything else (everything accessed exactly once)

## array usage: $i j k$ order


$A_{x 0} \quad A_{x N}$
for all $i$ :
for all $j$ :
for all $k$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$

looking only at innermost loop: row of A (elements used once) column of $B$ (elements used once) single element of $C$ (used many times)

## array usage: $i j k$ order


looking only at two innermost loops together: some temporal locality in A (column reused) some temporal locality in B (row reused) some temporal locality in C (row reused)

## array usage: kij order


for all $k$ :
for all $i$ :
for all $j$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$


if $N$ large:
using $C_{i j}$ once per load into cache (but using $C_{i, j+1}$ right after)
using $A_{i k}$ many times per load-into-cache using $B_{k j}$ once per load into cache (but using $B_{k, j+1}$ right after)

## array usage: kij order


for all $k$ :
for all $i$ :
for all $j$ :
$C_{i j}+=A_{i k} \times B_{k j}$
looking only at innermost loop: spatial locality in B, C (use most of loaded B, C cache blocks) no useful spatial locality in A (rest of A's cache block wasted)

## array usage: kij order


$A_{x 0} \quad A_{x N}$
for all $k$ : for all $i$ : for all $j$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$

looking only at innermost loop: temporal locality in A no temporal locality in B, C
( $B, C$ values used exactly once)

## array usage: kij order


looking only at innermost loop: processing one element of A (use many times) row of $B$ (each element used once) $C_{i j}+=A_{i k} \times B_{k j}$ column of C (each element used once)

## array usage: kij order


looking only at two innermost loops together: for all $i$ : for all $j$ :
$C_{i j}+=A_{i k} \times B_{k j}$ good temporal locality in A (column reused) good temporal locality in B (row reused) bad temporal locality in C (nothing reused)

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is $k$, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i \star N+j]+=A[i \star N+k] * B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i \star N+j]+=A[i * N+k] * B[k \star N+j] ;$

## performance (with $A=B$ )



## alternate view 1: cycles/instruction



## alternate view 2: cycles/operation



## counting misses: version 1

```
for (int \(i=0 ; i<N ;++i)\)
    for (int j \(=0 ; j<N ;++j)\)
        for (int \(k=0 ; k<N ;++k)\)
            \(C[i * N+j]+=A[i * N+k] * B[k * N+j] ;\)
```

if $N$ really large
assumption: can't get close to storing $N$ values in cache at once
for A: about $N \div$ block size misses per k-loop
total misses: $N^{3} \div$ block size
for B: about $N$ misses per k-loop
total misses: $N^{3}$
for C : about $1 \div$ block size miss per k -loop
total misses: $N^{2} \div$ block size

## counting misses: version 2

```
for (int \(k=0 ; k<N ;++k)\)
    for (int i \(=0 ; i<N ;++i)\)
    for (int \(j=0 ; j<N ;++j)\)
    \(C[i * N+j]+=A[i * N+k] * B[k * N+j] ;\)
```

for $A$ : about 1 misses per j-loop total misses: $N^{2}$
for B: about $N \div$ block size miss per j-loop total misses: $N^{3} \div$ block size
for C : about $N \div$ block size miss per j-loop total misses: $N^{3} \div$ block size

## backup slides

## exercise: miss estimating (2)

```
for (int k = 0; k < 1000; k += 1)
    for (int i = 0; i < 1000; i += 1)
    for (int j = 0; j < 1000; j += 1)
    A[k*N+j] += B[i*N+j];
```

assuming: 4 elements per block
assuming: cache not close to big enough to hold 1 K elements
estimate: approximately how many misses for $A, B$ ?

## misses with skipping

```
int array1[512]; int array2[512];
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}
```

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2 KB direct-mapped cache with 16B cache blocks?
Hint: depends on relative placement of array1, array2

## best/worst case

array1[i] and array2 [i] always different sets:
$=$ distance from array 1 to array 2 not multiple of $\#$ sets $\times$ bytes $/$ set 2 misses every 4 i
blocks of 4 array $1[X]$ values loaded, then used 4 times before loading next block (and same for array2[X])
array1[i] and array2 [i] same sets:
$=$ distance from array 1 to array 2 is multiple of \# sets $\times$ bytes/set 2 misses every i
block of 4 array $1[X]$ values loaded, one value used from it, then, block of 4 array $2[X]$ values replaces it, one value used from it, ...

## worst case in practice?

two rows of matrix?
often sizeof(row) bytes apart
if the row size is multiple of number of sets $\times$ bytes per block, oops!

## cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size
LRU replacement policies
data cache miss rates:

| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| :--- | ---: | ---: | ---: | ---: |
| 1KB | $8.63 \%$ | $6.97 \%$ | $5.63 \%$ | $5.34 \%$ |
| 2KB | $5.71 \%$ | $4.23 \%$ | $3.30 \%$ | $3.05 \%$ |
| 4KB | $3.70 \%$ | $2.60 \%$ | $2.03 \%$ | $1.90 \%$ |
| 16 KB | $1.59 \%$ | $0.86 \%$ | $0.56 \%$ | $0.50 \%$ |
| 64 KB | $0.66 \%$ | $0.37 \%$ | $0.10 \%$ | $0.001 \%$ |
| 128 KB | $0.27 \%$ | $0.001 \%$ | $0.0006 \%$ | $0.0006 \%$ |

## cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size
LRU replacement policies
data cache miss rates:

| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| :--- | ---: | ---: | ---: | ---: |
| 1 KB | $8.63 \%$ | $6.97 \%$ | $5.63 \%$ | $5.34 \%$ |
| 2 KB | $5.71 \%$ | $4.23 \%$ | $3.30 \%$ | $3.05 \%$ |
| 4 KB | $3.70 \%$ | $2.60 \%$ | $2.03 \%$ | $1.90 \%$ |
| 16 KB | $1.59 \%$ | $0.86 \%$ | $0.56 \%$ | $0.50 \%$ |
| 64 KB | $0.66 \%$ | $0.37 \%$ | $0.10 \%$ | $0.001 \%$ |
| 128 KB | $0.27 \%$ | $0.001 \%$ | $0.0006 \%$ | $0.0006 \%$ |

## $L 1$ misses (with $A=B$ )



## L1 miss detail (1)



## L1 miss detail (2)

read misses/1K instruction


## addresses

| $B[k \star 114+j]$ | is at | 10 | 0000 | 0000 | 0100 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $B[k \star 114+j+1]$ | is at | 10 | 0000 | 0000 | 1000 |
| $B[(k+1) \star 114+j]$ | is at | 10 | 0011 | 1001 | 0100 |
| $B[(k+2) \star 114+j]$ | is at | 10 | 0101 | 0101 | 1100 |
| $\cdots$ |  |  |  |  |  |
| $B[(k+9) \star 114+j]$ | is at | 11 | 0000 | 0000 | 1100 |

## addresses

| $B[k \star 114+j]$ | is at | 10 | 0000 | 0000 | 0100 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $B[k \star 114+j+1]$ | is at | 10 | 0000 | 0000 | 1000 |
| $B[(k+1) \star 114+j]$ | is at | 10 | 0011 | 1001 | 0100 |
| $B[(k+2) \star 114+j]$ | is at | 10 | 0101 | 0101 | 1100 |
| $\cdots$ |  |  |  |  |  |
| $B[(k+9) \star 114+j]$ | is at | 11 | 0000 | 0000 | 1100 |

test system L1 cache: 6 index bits, 6 block offset bits

## conflict misses

powers of two - lower order bits unchanged
$B[k * 93+j]$ and $B[(k+11) * 93+j]:$
1023 elements apart ( 4092 bytes; 63.9 cache blocks)
64 sets in L1 cache: usually maps to same set
$B[k * 93+(j+1)]$ will not be cached (next $i$ loop)
even if in same block as $B[k * 93+j]$
how to fix? improve spatial locality
(maybe even if it requires copying)

## split caches; multiple cores



## hierarchy and instruction/data caches

typically separate data and instruction caches for L1
(almost) never going to read instructions as data or vice-versa avoids instructions evicting data and vice-versa
can optimize instruction cache for different access pattern easier to build fast caches: that handles less accesses at a time

## inclusive versus exclusive

L2 inclusive of L1
everything in L1 cache duplicated in L2 adding to L1 also adds to L2

L2 cache


## L2 exclusive of L1

L2 contains different data than L1 adding to L 1 must remove from L2 probably evicting from L1 adds to L2

L2 cache


## inclusive versus exclusive

L2 inclusive of L 1

| everything in L 1 cache duplicated in L 2 |
| :---: |
| adding to L 1 also adds to L 2 |


inclusive policy:
no extra work on eviction
but duplicated data
easier to explain when
$\mathrm{L} k$ shared by multiple $\mathrm{L}(k-1)$ caches?

## inclusive versus exclusive

exclusive policy: avoid duplicated data sometimes called victim cache (contains cache eviction victims)
makes less sense with multicore

## L2 exclusive of L1

L2 contains different data than L1 adding to L 1 must remove from L2 probably evicting from L1 adds to L2

L2 cache

## exercise (1)

initial cache: 64 -byte blocks, 64 sets, 8 ways/set

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)
A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
B. quadrupling the number of sets
C. quadrupling the number of ways/set

## exercise (2)

initial cache: 64 -byte blocks, 8 ways/set, 64 KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)
A. quadrupling the block size ( 256 -byte block, 8 ways/set, 64 KB cache)
B. quadrupling the number of ways/set
C. quadrupling the cache size

## exercise (3)

initial cache: 64 -byte blocks, 8 ways/set, 64 KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)
A. quadrupling the block size ( 256 -byte block, 8 ways/set, 64 KB cache)
B. quadrupling the number of ways/set
C. quadrupling the cache size

## prefetching

seems like we can't really improve cold misses...
have to have a miss to bring value into the cache?

## prefetching

seems like we can't really improve cold misses...
have to have a miss to bring value into the cache?
solution: don't require miss: 'prefetch' the value before it's accessed
remaining problem: how do we know what to fetch?

## common access patterns

suppose recently accessed 16B cache blocks are at: $0 \times 48010,0 \times 48020,0 \times 48030,0 \times 48040$
guess what's accessed next

## common access patterns

suppose recently accessed 16B cache blocks are at: $0 \times 48010,0 \times 48020,0 \times 48030,0 \times 48040$
guess what's accessed next
common pattern with instruction fetches and array accesses

## prefetching idea

look for sequential accesses
bring in guess at next-to-be-accessed value
if right: no cache miss (even if never accessed before)
if wrong: possibly evicted something else - could cause more misses
fortunately, sequential access guesses almost always right

## quiz exercise solution

## one cache block one cache block (set index 1) (set index 0) <br> one cache block (set index 1) <br> one cache block (set index 0)



| memory access | set 0 afterwards | set 1 afterwards |
| :---: | :---: | :---: |
| - | (empty) | (empty) |
| read array[0] (miss) | \{array[0], array[1]\} | (empty) |
| read array[3] (miss) | \{array[0], array[1]\} | \{array[2], array[3] |
| read array[6] (miss) | \{array[0], array[1]\} | \{array[6], array[7]\} |
| read array[1] (hit) | \{array [0], array[1] | \{array [6], array [7] \} |
| read array[4] (miss) | \{array[4], array[5]\} | \{array[6], array[7]\} |
| read array[7] (hit) | \{array[4], array[5]\} | \{array[6], array[7]\} |
| read array[2] (miss) | \{array[4], array[5]\} | \{array[2], array[3]\} |
| read array[5] (hit) | \{array [4], array[5]\} | \{array [6], array[7]\} |
| read array[8] (miss) | $\{\operatorname{array}[8], \operatorname{array}[9]\}$ | \{array [6], array[7] |

## quiz exercise solution

one cache block one cache block one cache block one cache block (set index 1$) \quad($ set index 0$) \quad($ set index 1$) \quad($ set index 0$)$
$\ldots \quad \overbrace{\operatorname{array[0]}[\operatorname{array[1]}} . \operatorname{array[2]} \operatorname{array[3]} \operatorname{array[4]} \operatorname{array[5]} \operatorname{array[6]} \operatorname{array[7]} \operatorname{arra} \cdot . .$.

| memory access | set $\mathbf{0}$ afterwards |
| :--- | :--- |
| - | (empty) |
| read array[0] (miss) | $\{\operatorname{array[0],~array[1]\} }$ |


| read $\operatorname{array}[1]$ (hit) | $\{\operatorname{array[0],} \operatorname{array[1]\} }$ |
| :--- | :--- |
| read array[4] (miss) | $\{\operatorname{array[4],} \operatorname{array[5]\} }$ |


| read $\operatorname{array}[5]$ (hit) | $\{\operatorname{array[4],} \operatorname{array[5]\} }$ |
| :--- | :--- |
| read array [8] (miss) | $\{\operatorname{array[8],} \operatorname{array[9]\} }$ |

## quiz exercise solution

one cache block one cache block one cache block one cache block (set index 1$) \quad($ set index 0$) \quad($ set index 1$) \quad($ set index 0$)$
$\cdots \overbrace{\operatorname{array[0]}}^{\square} \overbrace{\operatorname{array[1]}} \operatorname{array[2]} \operatorname{array[3]} \operatorname{array[4]} \operatorname{array[5]} \operatorname{array[6]} \operatorname{array[7]} \operatorname{arra} . .$.

| memory access |
| :--- |
| - |

set 1 afterwards
(empty)

| read array[3] (miss) |
| :--- |
| read array[6] (miss) |


| \{array [2], $\operatorname{array[3]\} }$ |
| :--- |
| $\{\operatorname{array}[6], \operatorname{array}[7]\}$ |


| read $\operatorname{array[7]~(hit)~}$ |
| :--- |
| read $\operatorname{array[2]~(miss)~}$ |


| $\{\operatorname{array}[6], \operatorname{array}[7]\}$ |
| :--- |
| array[2], array[3] $\}$ |

## not the quiz problem

one cache block one cache block one cache bloc one cache block
$\cdots \overbrace{\operatorname{array[0]} \operatorname{array[1]} \operatorname{array[2]} \operatorname{array[3]} \operatorname{array[4]} \operatorname{array[5]} \operatorname{array[6]} \operatorname{array[7]} \operatorname{arra} . . .}$
if 1-set 2-way cache instead of 2-set 1-way cache:

| memory access | single set with 2-ways, LRU first |
| :---: | :---: |
| - | ---, -- |
| read array [0] (miss) | ---, \{array [0], array[1]\} |
| read array [3] (miss) | \{array[0], array[1]\}, \{array[2], array[3]\} |
| read array [6] (miss) | \{array[2], array[3]\}, \{array[6], array[7]\} |
| read array [1] (miss) | \{array[6], array[7]\}, \{array[0], array[1]\} |
| read array [4] (miss) | \{array[0], array[1]\}, \{array [3], array[4]\} |
| read array [7] (miss) | \{array[3], array [4]\}, \{array [6], array[7]\} |
| read array [2] (miss) | \{array[6], array[7]\}, \{array[2], array[3]\} |
| read array [5] (miss) | \{array[2], array[3]\}, \{array[5], array[6]\} |
| read array [8] (miss) | \{array[5], array[6]\}, \{array[8], array[9]\} |

## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (3-way)


memory


## mapping of sets to memory (3-way)


memory


## mapping of sets to memory (3-way)


memory


## mapping of sets to memory (3-way)



## C and cache misses (4)

```
typedef struct {
    int a_value, b_value;
    int other_values[6];
} item;
item items[5];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 5; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 5; ++i)
    b_sum += items[i].b_value;
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

## C and cache misses (4, rewrite)

```
int array[40]
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 40; i += 8)
    a_sum += array[i];
for (int i = 1; i < 40; i += 8)
    b_sum += array[i];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array starts at beginning of cache block.

How many data cache misses on a 2-way set associative 128B cache with 16B cache blocks and LRU replacement?

## C and cache misses (4, solution pt 1 )

 ints 4 byte $\rightarrow$ array [0 to 3] and array[16 to 19] in same cache set $64 \mathrm{~B}=16$ ints stored per way4 sets total
accessing $0,8,16,24,32,1,9,17,25,33$

## C and cache misses (4, solution pt 1 )

ints 4 byte $\rightarrow$ array $[0$ to 3 ] and array[ 16 to 19 ] in same cache set $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing $0,8,16,24,32,1,9,17,25,33$
$0(\operatorname{set} 0), 8(\operatorname{set} 2), 16(\operatorname{set} 0), 24(\operatorname{set} 2), 32(\operatorname{set} 0)$
$1(\operatorname{set} 0), 9(\operatorname{set} 2), 17(\operatorname{set} 0), 25(\operatorname{set} 2), 33(\operatorname{set} 0)$

## C and cache misses (4, solution pt 2 )

| access | set 0 after (LRU first) | result |  |
| :--- | :--- | :--- | :--- |
| - | -, - |  |  |
| array[0] | -, array[0 to 3] | miss |  |
| array[16] | array[0 to 3], array[16 to 19] | miss | 6 misses for set 0 |
| array[32] | array[16 to 19], array[32 to 35] | miss |  |
| array[1] | array[32 to 35], array[0 to 3] | miss |  |
| array[17] | array[0 to 3], array[16 to 19] | miss |  |
| array[32] | array[16 to 19], array[32 to 35] | miss |  |

## $C$ and cache misses (4, solution pt 3 )

| access | set 2 after (LRU first) | result |  |
| :--- | :--- | :--- | :--- |
| - | -, |  |  |
| array[8] | -, array[8 to 11] | miss | 2 misses for set 1 |
| array[24] | array[8 to 11], array[24 to 27] | miss |  |
| array[9] | array[8 to 11], array[24 to 27] | hit |  |
| array[25] | array[16 to 19], array[32 to 35] | hit |  |

## C and cache misses (3)

```
typedef struct {
    int a_value, b_value;
    int other_values[10];
} item;
item items[5];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 5; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 5; ++i)
    b_sum += items[i].b_value;
```

observation: 12 ints in struct: only first two used
equivalent to accessing array[0], array[12], array[24], etc.
...then accessing array[1], array[13], array[25], etc.

## C and cache misses (3, rewritten?)

```
int array[60];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 60; i += 12)
    a_sum += array[i];
for (int i = 1; i < 60; i += 12)
    b_sum += array[i];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array at beginning of cache block.

How many data cache misses on a 128B two-way set associative cache with 16B cache blocks and LRU replacement? observation 1: first loop has 5 misses - first accesses to blocks observation 2: array[0] and array[1], array[12] and array[13], etc. in same cache block

## C and cache misses (3, solution)

ints 4 byte $\rightarrow \operatorname{array[0~to~3]~and~array[16~to~19]~in~same~cache~set~}$ $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing array indices $0,12,24,36,48,1,13,25,37,49$
so access to $1,21,41,61,81$ all hits:
set 0 contains block with array [0 to 3]
set 5 contains block with array[20 to 23]
etc.

## C and cache misses (3, solution)

ints 4 byte $\rightarrow \operatorname{array[0~to~3]~and~array[16~to~19]~in~same~cache~set~}$ $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing array indices $0,12,24,36,48,1,13,25,37,49$
so access to $1,21,41,61,81$ all hits:
set 0 contains block with array [0 to 3]
set 5 contains block with array[20 to 23]
etc.

## C and cache misses (3, solution)

ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing array indices $0,12,24,36,48,1,13,25,37,49$
0 (set 0, array[0 to 3]), 12 (set 3), 24 (set 2 ), 36 (set 1 ), 48 (set 0 )
each set used at most twice no replacement needed
so access to $1,21,41,61,81$ all hits:
set 0 contains block with array [0 to 3]
set 5 contains block with array[20 to 23] etc.

## C and cache misses (3)

```
typedef struct {
    int a_value, b_value;
    int boring_values[126];
} item;
item items[8]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
        b_sum += items[i].b_value;
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2 KB direct-mapped cache with 16B cache blocks?

## C and cache misses (3, rewritten?)

item array[1024]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i $=0$; i < 1024; i += 128)
a_sum += array[i];
for (int i = 1; i < 1024; i += 128) b_sum += array[i];

## C and cache misses (4)

```
typedef struct {
    int a_value, b_value;
    int boring_values[126];
} item;
item items[8]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b_sum += items[i].b_value;
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 4-way set associative 2 KB direct-mapped cache with 16B cache blocks?

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ set 1 : address 16 to 31 , (16 to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ set 1 : address 16 to 31 , (16 to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks -
set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ block at 0: array[0] through array[3]
set 1 : address 16 to 31 , $(16$ to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$ block at 16: array[4] through array[7]
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511]

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks -
set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ block at 0: array[0] through array[3] block at $0+2 \mathrm{~KB}$ : array [512] through array [515]
set 1 : address 16 to 31 , $(16$ to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$ block at 16: array[4] through array[7] block at $16+2 \mathrm{~KB}$ : array[516] through array[519]
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511] block at $2032+2 \mathrm{~KB}$ : array[1020] through array[1023]

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$
set 63: address 1008, $2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} .$.

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$ block at 0: array[0] through array[3]
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$ address 16: array[4] through array[7]
set 63: address 1008, $2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} .$.
address 1008: array[252] through array[255]

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$
block at 0: array[0] through array[3]
block at $0+1 \mathrm{~KB}$ : array[256] through array[259] block at $0+2 \mathrm{~KB}$ : array[512] through array[515]
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$ address 16: array[4] through array[7]
set 63: address $1008,2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} . .$. address 1008: array[252] through array[255]

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$
block at 0: array[0] through array[3]
block at $0+1 \mathrm{~KB}$ : array $[256]$ through array[259] block at $0+2 \mathrm{~KB}$ : array[512] through array[515]
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$ address 16: array[4] through array[7]
set 63: address $1008,2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} .$. address 1008: array[252] through array[255]

## array usage: $i j k$ order


$A_{x 0} \quad A_{x N}$
for all $i$ :
for all $j$ :
for all $k$ :
$C_{i j}+=A_{i k} \times B_{k j}$
looking only at two innermost loops together: good spatial locality in A poor spatial locality in $B$ good spatial locality in C

## array usage: kij order



## simple blocking - with 3 ?

```
for (int kk = 0; kk < N; kk += 3)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*NN+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
            C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
        }
```

$\frac{N}{3} \cdot N \mathrm{j}$-loop iterations, and (assuming $N$ large):
about 1 misses from $A$ per j-loop iteration
$N^{2} / 3$ total misses (before blocking: $N^{2}$ )
about $3 N \div$ block size misses from $B$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)
about $3 N \div$ block size misses from $C$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)

## simple blocking - with 3 ?

```
for (int kk = 0; kk < N; kk += 3)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
            C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
        }
```

$\frac{N}{3} \cdot N \mathrm{j}$-loop iterations, and (assuming $N$ large):
about 1 misses from $A$ per j-loop iteration
$N^{2} / 3$ total misses (before blocking: $N^{2}$ )
about $3 N \div$ block size misses from $B$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)
about $3 N \div$ block size misses from $C$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)

## more than 3 ?

can we just keep doing this increase from 3 to some large $X$ ? ... assumption: $X$ values from A would stay in cache $X$ too large - cache not big enough
assumption: $X$ blocks from B would help with spatial locality $X$ too large - evicted from cache before next iteration

## array usage (2 $k$ at a time)


$B_{k i}$ to $B_{k+1, i}$

for each kk: for each i:
for each j :
for $k=k k, k k+1$ :

$$
C_{i j}+=A_{i k} \cdot B_{k j}
$$

## array usage (2k at a time)


for each kk: for each i:
for each j :

$$
\begin{aligned}
& \text { for } \mathrm{k}=\mathrm{kk}, \mathrm{kk}+1 \text { : } \\
& \qquad C_{i j}+=A_{i k} \cdot B_{k j}
\end{aligned}
$$

within innermost loop good spatial locality in $A$ bad locality in $B$
good temporal locality in $C$

## array usage (2k at a time)


for each kk: for each i:
for each j :
for $k=k k, k k+1$ : $C_{i j}+=A_{i k} \cdot B_{k j}$
loop over $j$ : better spatial locality over $A$ than before; still good temporal locality for $A$

## array usage (2k at a time)


for each kk: for each i:
for each j :

$$
\begin{aligned}
& \text { for } \mathrm{k}=\mathrm{kk}, \mathrm{kk}+1 \text { : } \\
& \qquad C_{i j}+=A_{i k} \cdot B_{k j}
\end{aligned}
$$

loop over $j$ : spatial locality over $B$ is worse but probably not more misses cache needs to keep two cache blocks for next iter instead of one (probably has the space left over!)

## array usage (2k at a time)


for each kk: for each i:
for each j :
for $k=k k, k k+1$ : have more than 4 cache blocks? $C_{i j}+=A_{i k}$. increasing $k k$ increment would use more of them
right now: only really care about keeping 4 cache blocks in $j$ loop

## keeping values in cache

can't explicitly ensure values are kept in cache
...but reusing values effectively does this cache will try to keep recently used values
cache optimization ideas: choose what's in the cache for thinking about it: load values explicitly for implementing it: access only values we want loaded

## inclusive versus exclusive

L2 inclusive of L1
everything in L1 cache duplicated in L2 adding to L1 also adds to L2

L2 cache


## L2 exclusive of L1

L2 contains different data than L1 adding to L 1 must remove from L2 probably evicting from L1 adds to L2

L2 cache


## inclusive versus exclusive

L2 inclusive of L 1

| everything in L 1 cache duplicated in L 2 |
| :---: |
| adding to L 1 also adds to L 2 |


inclusive policy:
no extra work on eviction
but duplicated data
easier to explain when
$\mathrm{L} k$ shared by multiple $\mathrm{L}(k-1)$ caches?

## inclusive versus exclusive

exclusive policy: avoid duplicated data sometimes called victim cache (contains cache eviction victims)
makes less sense with multicore

## L2 exclusive of L1

L2 contains different data than L1 adding to L 1 must remove from L2 probably evicting from L1 adds to L2


## locality exercise (2)

```
/* version 2 */
for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]
/* version 3 */
for (int ii = 0; ii < N; ii += 32)
    for (int jj = 0; jj < N; jj += 32)
        for (int i = ii; i < ij + 32; ++i)
        for (int j = jj; j < jj + 32; ++j)
        A[i] += B[j] * C[i * N + j];
```

exercise: which has better temporal locality in $A$ ? in $B$ ? in $C$ ? how about spatial locality?

