## last time

file descriptors - index to array of open file pointers
open() - add pointer in array dup2(j, i) - open_files[i] = open_files[j] close() - set pointer to NULL
convention: $0=$ stdin, $1=$ stdout, $1=$ stderr
redirection pattern
pipe()
special file to connect two processes

## anonymous feedback (1)

"Can you please shorten the length of your power point slides or consolidate them before you upload them?"

I'll think about how to do this (my slide source files are organized as a all the slides for a topic together)
slide PDFs have a outline (usually visible by enabling a sidebar in your PDF viewer)
(though sometimes I'm not careful about adjusting that - esp. for 'backup' slides)

## anonymous feedback (2)

"Could you discuss a capacity miss a bit more in depth? The reading made sense for the cold miss and [conflict?] miss but I couldn't quite grasp the capacity miss. Thank you!"'
we probably won't this topic explicitly in lecture today different reasons why values might not be in cache can try to assign 'reason' for each miss (but corner cases are tricky/depend on precise definition) cold/compulsory - values not loaded yet conflict - cache not flexible enough (more detail later) capacity - cache not big enough (even if it was very flexible)

## reading note

pointed out nothing in reading on dup2, etc. is covered a little in writeup for fork HW
also added something to threads reading

## 2004 CPU




## 2004 CPU



AMD

Opteron

## 2004 CPU




## 2004 CPU



## 2004 CPU



## 2004 CPU



## 2004 CPU



## the place of cache (1)



## memory hierarchy goals

performance of the fastest (smallest) memory hide 100x latency difference? $99+\%$ hit (= value found in cache) rate capacity of the largest (slowest) memory

## memory hierarchy assumptions

temporal locality
"if a value is accessed now, it will be accessed again soon" caches should keep recently accessed values

## spatial locality

"if a value is accessed now, adjacent values will be accessed soon" caches should store adjacent values at the same time
natural properties of programs - think about loops

## locality examples

```
double computeMean(int length, double *values) {
    double total = 0.0;
    for (int i = 0; i < length; ++i) {
    total += values[i];
    }
    return total / length;
}
```

temporal locality: machine code of the loop
spatial locality: machine code of most consecutive instructions temporal locality: total, i, length accessed repeatedly spatial locality: values [i+1] accessed after values [i]

## split caches; multiple cores (one design)



## hierarchy and instruction/data caches

typically separate data and instruction caches for L1
(almost) never going to read instructions as data or vice-versa avoids instructions evicting data and vice-versa
can optimize instruction cache for different access pattern easier to build fast caches: that handles less accesses at a time

## one-block cache

Cache

Memory

| value |
| :---: |
| 0000 |

## one-block cache

decision: divide memory into two-byte blocks put exactly one of these blocks in the cache

## Cache

## Memory

| value |
| :---: |
| 0000 |

\[

\]

## one-block cache

## read byte at 01011?

Cache
Memory

| value |
| :---: |
| 0000 |

\[

\]

## one-block cache

## read byte at 01011?

Cache
Memory

| valid | value | addresses | bytes |
| :---: | :---: | :---: | :---: |
| Q 0000 is this even a value? 1 |  |  |  |
|  |  | -0010-ช0011 | <2 33 |
| need extra bit to know ${ }^{0}-001015555$ |  |  |  |
| -01+0-00111 6677 |  |  |  |
| 01000-01001 8899 |  |  |  |
| 01010-01011 AA BB |  |  |  |
| 01100-01101 CC DD |  |  |  |
| 01110-01111 EE FF |  |  |  |
| 10000-10001 F0 F1 |  |  |  |

## one-block cache

read byte at 01011? invalid, fetch

Cache
Memory

| valid | value |
| :---: | :---: |
| 1 | $A A B B$ |


| addresses | bytes |
| :---: | :---: |
| 00000-00001 | 0011 |
| 00010-00011 | 2233 |
| 00100-00101 | 5555 |
| 00110-00111 | 6677 |
| 01000-01001 | 8899 |
| 01010-01011 | AA BB |
| 01100-01101 | CC DD |
| 01110-01111 | EE FF |
| 10000-10001 | F0 F1 |

## one-block cache

## read byte at 01011?

| Cache valid tag value | value from 00000, 0 |
| :---: | :---: |
| 1 0101 AA BB | 00000-00001 0011 |
| , | 00010-00011 2233 |
| need tag to know | W 00100-00101 5555 |
|  | 00110-00111 6677 |
|  | 01000-01001 8899 |
|  | 01010-01011 AA BB |
|  | 01100-01101 CC DD |
|  | 01110-01111 EE FF |
|  | 10000-10001 F0 F1 |

## one-block cache

## read byte at 01011?

Cache
Memory

| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 0101 | $A A B B$ |


| addresses | bytes |
| :--- | :--- |
| $00000-00001$ | 0011 |
| $00010-00011$ | 2233 |
| $00100-00101$ | $55 ~ 55$ |
| $00110-00111$ | 6677 |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| .. |  |

## one-block cache

## read byte at 01011?

Cache
Memory

| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 0101 | $A A B B$ |


| addresses | bytes |
| :--- | :--- |
| $00000-00001$ | 0011 |
| $00010-00011$ | 2233 |
| $00100-00101$ | $55 ~ 55$ |
| $00110-00111$ | 6677 |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| .. |  |

## one-block cache

## read byte at 01011?

Cache
Memory

| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 0101 | $A A B B$ |


| addresses | bytes |
| :--- | :--- |
| $00000-00001$ | 0011 |
| $00010-00011$ | 2233 |
| $00100-00101$ | $55 ~ 55$ |
| $00110-00111$ | 6677 |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| .. |  |

## building a (direct-mapped) cache

Cache

| value |
| :--- |
| 0000 |
| 0000 |
| 0000 |
| $00 ~ 00$ |

cache block: 2 bytes

Memory

| addresses <br> $0000-00001$ | 0011 |
| :--- | :--- |
| $00010-00011$ | 2233 |
| $00100-00101$ | $55 \quad 55$ |
| $00110-00111$ | 6677 |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| .. | $\ldots$ |

## building a (direct-mapped) cache

 read byte at 01011?Cache Memory

| $\|c\|$ |
| :---: |
| value |
| 0000 |
| 0000 |
| 0000 |
| 0000 |

cache block: 2 bytes

| addresses <br> $0000-00001$ | bytes |
| :--- | :--- |
| 00011 |  |
| $0010-000-0011$ | 2233 |
| $00110-00111$ | $65 \quad 65$ |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| $\ldots$ |  |

## building a (direct-mapped) cache

## read byte at 01011?

exactly one place for each address spread out what can go in a block

Cache
Memory
index
$\bigcirc \odot$
01
10
11

| value | addresses | bytes |
| :---: | :---: | :---: |
| 0000 | $\rightarrow 00000-00001$ | 0011 |
| 0000 | $\rightarrow 00010-00011$ | 2233 |
| 0000 | $\rightarrow 00100-00101$ | 5555 |
| 0000 | $\rightarrow 00110-00111$ | 6677 |
| bytes | -01000-01001 | 8899 |
|  | , 01010-01011 | AA BB |
|  | , $01100-01101$ | CC DD |
|  | *01110-01111 | EE FF |
|  | 10000-10001 | F0 F1 |

## building a (direct-mapped) cache

## read byte at 01011?

exactly one place for each address spread out what can go in a block

Cache
Memory
index
00
01
10
11

| value | addresses | byt |
| :---: | :---: | :---: |
| 0000 | $\rightarrow 00000-00001$ | 0011 |
| 0000 | $\rightarrow 00010-00011$ | 2233 |
| 0000 | $\rightarrow 00100-00101$ | 5555 |
| 0000 | $\rightarrow 00110-00111$ | 6677 |
| bytes | , 01000-01001 | 8899 |
|  | , ${ }^{\text {¢ }}$, 1010-01011 | AA BB |
|  | , 01100-01101 | CC DD |
|  | +01110-01111 | EE FF |
|  | 10000-10001 | F0 F1 |

## building a (direct-mapped) cache

## read byte at 01011?

exactly one place for each address spread out what can go in a block

Cache
Memory
index
00
01
10
11

| value | addresses | bytes |
| :---: | :---: | :---: |
| 0000 | -00000-00001 | 0011 |
| 0000 | -00010-00011 | 2233 |
| 0000 | -00100-00101 | 5555 |
| 0000 | 00110-00111 | 6677 |
| bytes | -01000-01001 | 8899 |
|  | -01010-01011 | AA BB |
|  | , $01100-01101$ | CC DD |
|  | +01110-01111 | EE FF |
|  | 10000-10001 | F0 F1 |

## building a (direct-mapped) cache

 read byte at 01011?Cache Memory

| index | valid | value | addresses | bytes |
| :---: | :---: | :---: | :---: | :---: |
| 00 | 0 | 0000 | is even a valu | e? 1 |
| 01 | $Q$ | $00 \widehat{00}$ | 00010-00011 | <2 33 |
| 10 | 0 need | extra bit | know ${ }^{0}-00101$ | 5555 |
| 11 | 0 | -000 | -O1+0-00111 | 6677 |
| cache block: 2 bytes |  |  | 01000-01001 | 8899 |
|  |  |  | 01010-01011 | AA BB |
| direct-mapped |  |  | 01100-01101 | CC DD |
|  |  |  | 01110-01111 | EE FF |
|  |  |  | 10000-10001 | F0 F1 |

## building a (direct-mapped) cache

read byte at 01011?
invalid, fetch

Cache

| index | valid | value |
| :---: | :---: | :---: |
| 00 | 0 | 0000 |
| 01 | 1 | AA BB |
| 10 | 0 | 0000 |
| 11 | 0 | 0000 |

cache block: 2 bytes direct-mapped

Memory

| addresses <br> $0000-00001$ | 0011 |
| :--- | :--- |
| $00010-00011$ | 2233 |
| $00100-00101$ | $55 \quad 55$ |
| $00110-00111$ | 6677 |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| .. | $\ldots$ |

## building a (direct-mapped) cache

read byte at 01011? invalid, fetch

| Cache |  |  |  | Mamonv |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| index | valid | tag | value va | value from 01010 | 10 or |
| 00 | 0 | 00 | 0000 | 00000-00001 | 0011 |
| 01 | 1 | 01 | $A A B B$ | 00010-00011 | 2233 |
| 10 | 0 | 00 | 0000 | 00100-00101 | 5555 |
| 11 | need tag to know |  |  | ( 00110-00111 | 6677 |
| cache block: 2 bytes |  |  |  | 01000-01001 | 8899 |
|  |  |  |  | 01010-01011 | AA BB |
| direct-mapped |  |  |  | 01100-01101 | CC DD |
|  |  |  |  | 01110-01111 | EE FF |
|  |  |  |  | 10000-10001 | F0 F1 |

## building a (direct-mapped) cache

read byte at 01011?
invalid, fetch

Cache

| index | valid | tag | value |
| :---: | :---: | :---: | :---: |
| 00 | 0 | 00 | 0000 |
| 01 | 1 | 01 | AA BB |
| 10 | 0 | 00 | 0000 |
| 11 | $\bigcirc$ | 00 | 0000 |

cache block: 2 bytes direct-mapped

Memory

| addresses | bytes |
| :--- | :--- |
| $00000-00001$ | 0011 |
| $00010-00011$ | 2233 |
| $00100-00101$ | 5555 |
| $00110-00111$ | 6677 |
| $01000-01001$ | 8899 |
| $01010-01011$ | AA BB |
| $01100-01101$ | CC DD |
| $01110-01111$ | EE FF |
| $10000-10001$ | F0 F1 |
| $\ldots$ |  |

## terminology

row $=$ set
preview: change how much is in a row

## Tag-Index-Offset (TIO)

address 001111 (stores value $0 \times F F$ )
cache tag index offset
2 byte blocks, 4 sets
2 byte blocks, 8 sets
4 byte blocks, 2 sets

| index <br> 00 | byte blocks, 4 sets |  |  | 2 byte blocks, 8 setsindexvalid tag |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | vali | tag | value |  |  |  |  |
|  | 1 | 000 | 0011 | 000 | 1 | 00 | 0011 |
| 01 | 1 | 001 | AA BB | 001 | 1 | 01 | F1 F2 |
| 10 | 0 | -- | ---- | 010 | $\bigcirc$ | -- | -- -- |
| 11 | 1 | 001 | EE FF | 011 | $\bigcirc$ | - | -- -- |
| index | 4 byte blocks, 2 sets valid tag value |  |  | 100 | $\bigcirc$ | - | -- -- |
|  |  |  |  | 101 | 1 | 00 | AA BB |
| 0 | 1 | 000 | 00112233 | 110 | 0 | -- | -- -- |
| 1 | 1 | 001 | CC DD EE FF | 111 | 1 | 00 | EE FF |

## Tag-Index-Offset (TIO)

| address 001111 | (stores value $0 \times F F)$ |  |
| :--- | :--- | :--- |
| cache | tag | index offset |

2 byte blocks, 4 sets
index
00
01
10
11
valid

| 1 | tag | value |
| :---: | :---: | :---: |
| 1 | 000 | 0011 |
| 0 | -- | ---- |
| 1 | 001 | EE FF |

4 byte blocks, 2 sets

| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 000 | 00112233 |
| 1 | 001 | CC DD EE FF |

2 byte blocks, 8 sets


## Tag-Index-Offset (TIO)

## address 001111 (stores value $0 \times F F$ )

cache
tag index offset

2 byte blocks, 4 sets
2 byte blocks, 8 sets
4 byte blocks, 2 sets
2 byte blocks, 4 sets

| index | valid |  | value | index |
| :---: | :---: | :---: | :---: | :---: |
| 00 | 1 | 000 | 0011 | ${ }^{\circ} 0$ |
| 01 | 1 | 001 | A BB | 001 |
| 10 | $\bigcirc$ | $4=2^{2}$ bytes in block 2 bits to say which byte |  |  |
| 11 | b |  |  |  |
| index | valid | tag | value |  |
| 0 | 1 | 000 | 00112233 |  |
| 1 | 1 | 001 | CC DDEE FF |  |

2 byte blocks, 8 sets

| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 00 | 0011 |
| 1 | 01 | F1 F2 |
| 0 | -- | ---- |
| 0 | -- | ---- |
| 0 | -- | ---- |
| 1 | 00 | AA BB |
| 0 | -- | ---- |
| 1 | 00 | EE FF |

## Tag-Index-Offset (TIO)

address 001111 (stores value $0 \times F F$ )
cache tag index offset

2 byte blocks, 4 sets 111
2 byte blocks, 8 sets
1


2 byte blocks, 4 sets

| index | valid | tag | value | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 00 | 1 | 000 | 0011 | ค๐๐ | 1 | $\bigcirc \bigcirc$ | 0011 |
| 01 | 1 | 001 | AA BB | $2^{2}=4$ | S |  | F1 F2 |
| 10 | 0 | -- | -- |  |  |  | ---- |
| 11 | 1 | 001 | EE FF | 2 bits | ind | set | -- -- |
|  | 4 byte blocks, 2 sets valid tag value |  |  | 100 | $\bigcirc$ | - | -- -- |
| index |  |  |  | 101 | 1 | 00 | AA BB |
| 0 | 1 | 000 | 00112233 | 110 | $\bigcirc$ | -- | -- -- |
| 1 | 1 | 001 | CC DD EE FF | 111 | 1 | 00 | EE FF |

## Tag-Index-Offset (TIO)

address 001111 (stores value $0 \times F F$ )
cache tag index offset

2 byte blocks, 4 sets
2 byte blocks, 8 sets
4 byte blocks, 2 sets
2 byte blocks, 4 sets

| index | valid | tag | value | $\left\{\begin{array}{l}\text { index } \\ \left\{\left.\begin{array}{l}000 \\ 001 \\ 010 \\ 011 \\ 100 \\ 101 \\ 1010 \\ 110\end{array} \right\rvert\,\right.\end{array}\right.$ |
| :---: | :---: | :---: | :---: | :---: |
|  | 1 | 000 | 0011 |  |
| 01 | 1 | 001 | AA BB |  |
|  | 0 |  |  |  |
| 11 | ${ }^{1} 2^{3}=8$ sets |  |  |  |
|  | 43 bits to index set |  |  |  |
| index |  |  |  |  |
| 0 | 1 | 000 | 00112233 |  |
| 1 | 1 | 001 | CC DDEEFF |  |

2 byte blocks, 8 sets

| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 00 | 00 11 |
| 1 | 01 | F1 F2 |
| 0 | -- | ---- |
| 0 | -- | ---- |
| 0 | -- | ---- |
| 1 | 00 | AA BB |
| 0 | -- | ---- |
| 1 | 00 | EE FF |

## Tag-Index-Offset (TIO)

address 001111 (stores value $0 \times F F$ )
cache tag index offset

2 byte blocks, 4 sets
2 byte blocks, 8 sets
4 byte blocks, 2 sets
2 byte blocks, 4 sets
index
00
01
10
11
index
0
1

| valid |
| :--- |
| 1 tag value <br> 1 000 0011 <br> 0 -- ---- <br> 1 001 EE FF |
| 4 byte blocks, 2 sets |


| valid | tag | value |
| :---: | :---: | :---: |
| 1 | 000 | 00112233 |
| 1 | 001 | CC DD EE FF |


| 2 byte blocks, 8 sets |  |  |  |
| :---: | :---: | :---: | :---: |
| index | valid | tag | value |
| 000 | 1 | 00 | 0011 |
| 001 | 1 | 01 | F1 F2 |
| 010 | 0 | -- | - |
| $2^{1}=2$ sets <br> 1 bit to index set |  |  |  |
| 110 | O |  |  |
| 111 | 1 | 00 | EE FF |

## Tag-Index-Offset (TIO)

address 001111 (stores value $0 x F F$ )
cache tag index offset
2 byte blocks, 4 sets $00111 \quad 1$

2 byte blocks, 8 sets $00 \quad 111 \quad 1$
4 byte blocks, 2 sets 00111

| tag | whatever is left over |  |  | 2 byte blocks, 8 sets index valid tag value |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 00 | 1 | 000 | 0011 | 000 | 1 | 00 | 0011 |
| 01 | 1 | 001 | AA BB | 001 | 1 | $\bigcirc 1$ | F1 F2 |
| 10 | 0 | -- | -- -- | 010 | 0 | -- | -- -- |
| 11 | 1 | 001 | EE FF | 011 | 0 | -- | -- -- |
| 4 byte blocks, 2 sets <br> valid tag value |  |  |  | 100 | 0 | -- | -- -- |
|  |  |  |  | 101 | 1 | 00 | AA BB |
| 0 | 1 | 000 | 001 | $\begin{aligned} & 110 \\ & 111 \end{aligned}$ | 0 | -- | -- -- |
| 1 | 1 | 001 | CC D |  | 1 | 00 | EE FF |

## cache size

cache size $=$ amount of data in cache not included metadata (tags, valid bits, etc.)

## Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)
$S=2^{s} \quad$ number of sets
$s$
$B=2^{b}$
b
m
$t=m-(s+b)$ tag bits
$C=B \times S \quad$ cache size (if direct-mapped)

## Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)
$S=2^{s} \quad$ number of sets
$s$
$B=2^{b}$
b
m
$t=m-(s+b)$ tag bits
$C=B \times S$
block size
(set) index bits
(block) offset bits
memory addreses bits
cache size (if direct-mapped)

## TIO: exercise

64-byte blocks, 128 set cache
stores $64 \times 128=8192$ bytes (of data)
if addresses 32-bits, then how many tag/index/offset bits?
which bytes are stored in the same block as byte from $0 \times 1037$ ?
A. byte from $0 \times 1011$
B. byte from $0 \times 1021$
C. byte from $0 \times 1035$
D. byte from $0 \times 1041$

## example access pattern (1)

2 byte blocks, 4 sets


## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ |  |
| $00000001(01)$ |  |
| $01100011(63)$ |  |
| $01100001(61)$ |  |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |


| index | valid | tag | value |
| :--- | :---: | :---: | :---: |
| 00 | 0 |  |  |
| 01 | 0 |  |  |
| 10 | 0 |  |  |
| 11 | 0 |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :---: | :---: |
| 00000000 (00) |  |
| 00000001 (01) |  |
| 01100011 (63) |  |
| 01100001 (61) |  |
| 01100010 (62) |  |
| 00000000 (00) |  |
| 01100100 (64) |  |
| ag index offset |  |


| index | valid | tag | value |
| :--- | :---: | :---: | :---: |
| 00 | 0 |  |  |
| 01 | 0 |  |  |
| 10 | 0 |  |  |
| 11 | 0 |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :---: | :---: |
| 00000000 (00) | miss |
| 00000001 (01) |  |
| 01100011 (63) |  |
| 01100001 (61) |  |
| 01100010 (62) |  |
| 00000000 (00) |  |
| 01100100 (64) |  |
| ag index offset |  |


| index <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
|  | 1 | 00000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ |
| 10 | 0 |  |  |
| 11 | 0 |  |  |
| 10 |  |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ |  |
| $01100001(61)$ |  |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag index offset |  |


| index <br> 00 | valid | tag | value |
| :--- | :---: | :---: | :---: |
|  | 1 | 00000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ |
| 10 | 0 |  |  |
| 11 | 0 |  |  |
| 11 | 0 |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ |  |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag index offset |  |


| inde <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
|  | 1 | 00000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ |
| 10 | 1 | 01100 | mem $[0 \times 62]$ <br> mem $[0 \times 63]$ |
| 11 | 0 |  |  |
| 10 |  |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) result <br> $00000000(00)$ miss <br> $00000001(01)$ hit <br> $01100011(63)$ miss <br> $01100001(61)$ miss <br> $01100010(62)$  <br> $00000000(00)$  <br> $01100100(64)$  <br> tag index offset  |
| :--- | :--- |


| inde <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
| 01 | 1 | 01100 | mem $[0 \times 60]$ <br> mem $[0 \times 61]$ |
| 010 | 1 | 01100 | mem $[0 \times 62]$ <br> mem $[0 \times 63]$ |
| 11 | 0 |  |  |
|  | 0 |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag index offset |  |


| inde <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
|  | 1 | 01100 | mem $[0 \times 60]$ <br> mem $[0 \times 61]$ |
| 01 | 1 | 01100 | mem $[0 \times 62]$ <br> mem $[0 \times 63]$ |
| 10 | 0 |  |  |
| 11 | 0 |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ | miss |
| $01100100(64)$ |  |
| tag index offset |  |


| inde <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
| 01 | 1 | 00000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ |
| 10 | 1 | 01100 | mem $[0 \times 62]$ <br> mem $[0 \times 63]$ |
| 11 | 0 |  |  |
|  | 0 |  |  |
|  |  |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ | miss |
| $01100100(64)$ | miss |

tag index offset
$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits

| inde <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
|  | 1 | 00000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ |
| $\mathbf{0 1}$ | 1 | 01100 | mem $[0 \times 62]$ <br> mem $[0 \times 63]$ |
| 10 | 1 | 01100 | mem $[0 \times 64]$ <br> mem $[0 \times 65]$ |
| 11 | 0 |  |  |
|  |  |  |  |

$S=4=2^{s}$ sets
$s=2$ (set) index bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ | miss |
| $01100100(64)$ | miss | tag index offset

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits

| inde <br> 0 | valid | tag | value |
| :--- | :---: | :---: | :---: |
|  | 1 | 00000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ |
| 01 | 1 | 01100 | mem $[0 \times 62]$ <br> mem $[0 \times 63]$ |
| 10 | 1 | 01100 | mem $[0 \times 64]$ <br> mem $[0 \times 65]$ |
| 11 | 0 |  |  |
|  |  |  |  |

$S=4=2^{s}$ sets
$s=2$ (set) index bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits

## example access pattern (1)

2 byte blocks, 4 sets

| address (hex) | result | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 00000000 (00) | miss | 00 | 1 | 00000 | mem[0x00] |
| 00000001 (01) | hit |  |  |  | mem[0x01] |
| 01100011 (63) | miss | 01 | 1 | 01100 | mem[0x62] |
| 01100001 (61) | miss |  |  |  | mem $[0 \times 63]$ |
| 01100010 (62) | hit |  | 0110n |  | mem[0x64] |
| 00000000 (00) | miss |  | ca | used by | onflict 65] |
| 01100100 (64) | miss |  | - |  |  |
| ag index offset |  |  | 0 |  |  |

$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$m=8$ bit addresses
$t=m-(s+b)=5$ tag bits
$S=4=2^{s}$ sets
$s=2$ (set) index bits

## exercise

4 byte blocks, 4 sets

| address (hex) | result | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 00000000 (00) |  |  |  |  |  |
| 00000001 (01) |  |  |  |  |  |
| 01100011 (63) |  | 01 |  |  |  |
| 01100001 (61) |  |  |  |  |  |
| 01100010 (62) |  | 10 |  |  |  |
| 00000000 (00) |  |  |  |  |  |
| 01100100 (64) |  | 11 |  |  |  |

## exercise

4 byte blocks, 4 sets

| address (hex) | result | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 00000000 (00) |  |  |  |  |  |
| 00000001 (01) |  |  |  |  |  |
| 01100011 (63) |  | 01 |  |  |  |
| 01100001 (61) |  |  |  |  |  |
| 01100010 (62) |  | 10 |  |  |  |
| 00000000 (00) |  |  |  |  |  |
| 01100100 (64) |  | 11 |  |  |  |

how is the 8-bit address 61 (01100001) split up into tag/index/offset?
$b$ block offset bits;
$B=2^{b}$ byte block size;
$s$ set index bits; $S=2^{s}$ sets ;
$t=m-(s+b)$ tag bits (leftover

## exercise

4 byte blocks, 4 sets

| address (hex) | result | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 00000000 (00) |  |  |  |  |  |
| 00000001 (01) |  |  |  |  |  |
| 01100011 (63) |  | 01 |  |  |  |
| 01100001 (61) |  |  |  |  |  |
| 01100010 (62) |  | 10 |  |  |  |
| 00000000 (00) |  |  |  |  |  |
| 01100100 (64) |  | 11 |  |  |  |

## exercise

4 byte blocks, 4 sets

| address (hex) | result | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 00000000 (00) |  |  |  |  |  |
| 00000001 (01) |  |  |  |  |  |
| 01100011 (63) |  | 01 |  |  |  |
| 01100001 (61) |  |  |  |  |  |
| 01100010 (62) |  | 10 |  |  |  |
| 00000000 (00) |  |  |  |  |  |
| 01100100 (64) |  | 11 |  |  |  |

## exercise

4 byte blocks, 4 sets

| address (hex) | result | index | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 00000000 (00) |  | 00 |  |  |  |
| 00000001 (01) |  |  |  |  |  |
| 01100011 (63) |  | 01 |  |  |  |
| 01100001 (61) |  |  |  |  |  |
| 01100010 (62) |  | 10 |  |  |  |
| 00000000 (00) |  |  |  |  |  |
| 01100100 (64) |  | 11 |  |  |  |

exercise: which accesses are hits?

## cache accesses and C code (1)

```
int scaleFactor;
int scaleByFactor(int value) {
    return value * scaleFactor;
}
```

scaleByFactor:
movl scaleFactor, \%eax
imull \%edi, \%eax
ret
exericse: what data cache accesses does this function do?

## cache accesses and C code (1)

```
int scaleFactor;
int scaleByFactor(int value) {
    return value * scaleFactor;
}
```

scaleByFactor:
movl scaleFactor, \%eax
imull \%edi, \%eax
ret
exericse: what data cache accesses does this function do?
4-byte read of scaleFactor
8 -byte read of return address

## possible scaleFactor use

```
for (int i = 0; i < size; ++i) {
    array[i] = scaleByFactor(array[i]);
}
```


## misses and code (2)

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

suppose each time this is called in the loop: return address located at address $0 x 7 f f f f f f e 43 b 8$ scaleFactor located at address $0 \times 6 b c 3 a 0$
with direct-mapped 32 KB cache $\mathrm{w} / 64 \mathrm{~B}$ blocks, what is their:

|  | return address | scaleFactor |
| :--- | :--- | :--- |
| tag |  |  |
| index |  |  |
| offset |  |  |

## misses and code (2)

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

suppose each time this is called in the loop: return address located at address $0 x 7 f f f f f f e 43 b 8$ scaleFactor located at address $0 \times 6 \mathrm{bc} 3 \mathrm{a} 0$
with direct-mapped 32 KB cache $\mathrm{w} / 64 \mathrm{~B}$ blocks, what is their:

|  | return address | scaleFactor |
| :--- | :--- | :--- |
| tag | $0 \times f f f f f f f c$ | $0 \times d 7$ |
| index | $0 \times 10 \mathrm{e}$ | $0 \times 10 \mathrm{e}$ |
| offset | $0 \times 38$ | $0 \times 20$ |

## misses and code (2)

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

suppose each time this is called in the loop: return address located at address $0 x 7 f f f f f f e 43 b 8$ scaleFactor located at address $0 \times 6 \mathrm{bc} 3 \mathrm{a} 0$
with direct-mapped 32 KB cache $\mathrm{w} / 64 \mathrm{~B}$ blocks, what is their:

|  | return address | scaleFactor |
| :--- | :--- | :--- |
| tag | $0 \times f f f f f f f c$ | $0 \times d 7$ |
| index | $0 \times 10 \mathrm{e}$ | $0 \times 10 \mathrm{e}$ |
| offset | $0 \times 38$ | $0 \times 20$ |

## conflict miss coincidences?

obviously I set that up to have the same index have to use exactly the right amount of stack space...
but one of the reasons we'll want something better than direct-mapped cache

## C and cache misses (warmup 1)

```
int array[4];
```

int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 1 -set direct-mapped cache with 8B blocks?

## some possiblities



Q1: how do cache blocks correspond to array elements? not enough information provided!

## aside: alignment

compilers and malloc/new implementations usually try align values align $=$ make address be multiple of something
most important reason: don't cross cache block boundaries

## C and cache misses (warmup 2)

```
int array[4];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
odd_sum += array[1];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

Assume array[0] at beginning of cache block.
How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

## C and cache misses (warmup 3)

```
int array[8];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
even_sum += array[4];
odd_sum += array[5];
even_sum += array[6];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny), and array $[0]$ at beginning of cache block.

How many data cache misses on a 2-set direct-mapped cache with 8B blocks?

## C and cache misses (warmup 4)

```
int array[8];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
even_sum += array[4];
even_sum += array[6];
odd_sum += array[1];
odd_sum += array[3];
odd_sum += array[5];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2-set direct-mapped cache with 8B blocks?

## arrays and cache misses (1)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
even_sum += array[i + 0];
odd_sum += array[i + 1];
}
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2 KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
odd_sum += array[i + 1];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2 KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (2b)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
odd_sum += array[i + 1];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 4KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (3)

```
int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {
    int local_sum = 0;
    for (int j = i - 8; j < i + 8; j += 1) {
        local_sum += array[i] * (j - i);
    }
    sum += (local_sum - array[i]);
}
Assume everything but array is kept in registers (and the compiler does not do anything funny).
```

How many data cache misses on initially empty 2 KB direct-mapped cache with 16B cache blocks?

## simulated misses: BST lookups



## (simulated 16KB direct-mapped data cache; excluding BST setup)

## actual misses: BST lookups


(actual 32 KB more complex data cache)
(only one set of measurements + other things on machine + excluding initial load)

## simulated misses: matrix multiplies


(simulated 16KB direct-mapped data cache; excluding initial load)

## actual misses: matrix multiplies


(actual 32 KB more complex data cache; excluding matrix initial load) (only one set of measurements + other things on machine)

## misses with skipping

int array1[512]; int array2[512];
for (int i = 0; i < 512; i += 1)
sum += array1[i] * array2[i];
\}
Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2 KB direct-mapped cache with 16 B cache blocks?
Hint: depends on relative placement of array1, array2

## best/worst case

array1[i] and array2 [i] always different sets:
$=$ distance from array 1 to array 2 not multiple of $\#$ sets $\times$ bytes $/$ set 2 misses every 4 i
blocks of 4 array $1[X]$ values loaded, then used 4 times before loading next block (and same for array2[X])
array1[i] and array2 [i] same sets:
$=$ distance from array 1 to array 2 is multiple of \# sets $\times$ bytes/set 2 misses every i
block of 4 array $1[X]$ values loaded, one value used from it, then, block of 4 array $2[X]$ values replaces it, one value used from it, ...

## worst case in practice?

two rows of matrix?
often sizeof(row) bytes apart
if the row size is multiple of number of sets $\times$ bytes per block, oops!

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index |  |  |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | valid | tag | value | valid | tag | value |
|  | 0 |  |  | 0 |  |  |
| 1 | 0 |  |  | 0 |  |  |

multiple places to put values with same index avoid misses from two active values using same set ("conflict misses"))

## adding associativity

2-way set associative, 2 byte blocks, 2 sets

| index |  |  |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|  | valid | tag | value | valid | tag | value |
|  | 0 |  | set 0 | 0 |  |  |
| 1 | 0 |  | set 1 | 0 |  |  |
|  |  |  |  |  |  |  |

## adding associativity

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | way 0 |  | $\bigcirc$ | way 1 |  |
| 1 | 0 |  |  | 0 |  |  |

## adding associativity

2-way set associative, 2 byte blocks, 2 sets

| index |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | | valid | tag | value | valid |
| :---: | :---: | :---: | :---: |
|  | 0 |  |  |
| 1 | 0 |  |  |
|  |  |  |  |

$m=8$ bit addresses
$S=2=2^{s}$ sets
$s=1$ (set) index bits
$B=2=2^{b}$ byte block size
$b=1$ (block) offset bits
$t=m-(s+b)=6$ tag bits

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | $\begin{aligned} & \mathrm{mem}[0 \times 00] \\ & \mathrm{mem}[0 \times 01] \end{aligned}$ | 0 |  |  |
| 1 | $\bigcirc$ |  |  | 0 |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ |  |
| $01100011(63)$ |  |
| $01100001(61)$ |  |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index |  |  |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|  | valid | tag | value | valid | tag | value |
|  | 1 | 000000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ | 0 |  |  |
| 1 | 0 |  | 0 |  |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ |  |
| $01100001(61)$ |  |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index |  |  |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|  | valid | tag | value | valid | tag | value |
|  | 1 | 000000 | mem $[0 \times 00]$ <br> mem $[0 \times 01]$ | 0 |  |  |
| 1 | 1 | 011000 | mem $[0 \times 62]$ <br> $m e m[0 \times 63]$ | 0 |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ |  |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | $\begin{aligned} & \operatorname{mem}[0 \times 00] \\ & \operatorname{mem}[0 \times 01] \end{aligned}$ | 1 | 011000 | $\begin{aligned} & \operatorname{mem}[0 \times 60] \\ & \text { mem }[0 \times 61] \end{aligned}$ |
| 1 | 1 | 011000 | $\begin{aligned} & \text { mem }[0 \times 62] \\ & \text { mem }[0 \times 63] \end{aligned}$ | 0 |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ |  |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\bigcirc$ | 1 | 000000 | $\begin{array}{\|l\|} \hline \text { mem }[0 \times 00] \\ \text { mem }[0 \times 01] \\ \hline \end{array}$ | 1 | 011000 | $\begin{aligned} & \operatorname{mem}[0 \times 60] \\ & \operatorname{mem}[0 \times 61] \end{aligned}$ |
| 1 | 1 | 011000 | $\begin{aligned} & \mathrm{mem}[0 \times 62] \\ & \mathrm{mem}[0 \times 63] \end{aligned}$ | $\bigcirc$ |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ |  |
| $01100100(64)$ |  |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\bigcirc$ | 1 | 000000 | $\begin{array}{\|l\|} \hline \text { mem }[0 \times 00] \\ \text { mem }[0 \times 01] \\ \hline \end{array}$ | 1 | 011000 | $\begin{aligned} & \operatorname{mem}[0 \times 60] \\ & \operatorname{mem}[0 \times 61] \end{aligned}$ |
| 1 | 1 | 011000 | $\begin{aligned} & \mathrm{mem}[0 \times 62] \\ & \mathrm{mem}[0 \times 63] \end{aligned}$ | $\bigcirc$ |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ | hit |
| $01100100(64)$ |  |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | $\begin{aligned} & \text { mem }[0 \times 00] \\ & \text { mem }[0 x 01] \end{aligned}$ | 1 | 011000 | $\begin{aligned} & \operatorname{mem}[0 \times 60] \\ & \text { mem }[0 \times 61] \end{aligned}$ |
| 1 | 1 | 011000 | $\begin{aligned} & \mathrm{mem}[0 \times 62] \\ & \text { mem }[0 \times 63] \end{aligned}$ | 0 |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | mice |
| $01100010(62)$ | hit |
| 0. |  |
| $00000000(00)$ | hit |
| $01100100(64)$ | miss to replace block in set $0!$ |
| tag indexoffset |  |

## adding associativity

2 -way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | $\begin{aligned} & \operatorname{mem}[0 \times 00] \\ & \operatorname{mem}[0 \times 01] \end{aligned}$ | 1 | 011000 | $\begin{aligned} & \operatorname{mem}[0 \times 60] \\ & \text { mem }[0 \times 61] \end{aligned}$ |
| 1 | 1 | 011000 | $\begin{aligned} & \text { mem }[0 \times 62] \\ & \text { mem }[0 \times 63] \end{aligned}$ | 0 |  |  |


| address (hex) | result |
| :--- | :--- |
| $00000000(00)$ | miss |
| $00000001(01)$ | hit |
| $01100011(63)$ | miss |
| $01100001(61)$ | miss |
| $01100010(62)$ | hit |
| $00000000(00)$ | hit |
| $01100100(64)$ | miss |

tag indexoffset

## cache operation (associative)

111001


## cache operation (associative)

111001


## cache operation (associative)

111001


## associative lookup possibilities

none of the blocks for the index are valid
none of the valid blocks for the index match the tag something else is stored there
one of the blocks for the index is valid and matches the tag

## replacement policies

2-way set associative, 2 byte blocks, 2 sets


## replacement policies

2-way set associative, 2 byte blocks, 2 sets


## example replacement policies

least recently used
take advantage of temporal locality
at least $\left\lceil\log _{2}(E!)\right\rceil$ bits per set for $E$-way cache (need to store order of all blocks)
approximations of least recently used implementing least recently used is expensive really just need "avoid recently used" - much faster/simpler good approximations: $E$ to $2 E$ bits
first-in, first-out
counter per set - where to replace next
(pseudo-)random
no extra information!
actually works pretty well in practice

## associativity terminology

direct-mapped - one block per set
$E$-way set associative - $E$ blocks per set
$E$ ways in the cache
fully associative - one set total (everything in one set)

## Tag-Index-Offset formulas

$m$
E
$S=2^{s}$
$s$
$B=2^{b} \quad$ block size
b
$t=m-(s+b)$ tag bits
$C=B \times S \times E \quad$ cache size (excluding metadata)
memory addreses bits number of blocks per set ("ways")
number of sets
(set) index bits
(block) offset bits

## Tag-Index-Offset exercise

```
m
E
S=2
s
B=2
b
t=m-(s+b) tag bits
C=B\timesS\timesE cache size (excluding metadata)
My desktop:
```

L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks
L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks
L3 Cache: $8 \mathrm{MB}, 16$ blocks/set, 64 byte blocks
Divide the address $0 \times 34567$ into tag, index, offset for each cache.

## T-I-O exercise: L1

## T-I-O results

## T-I-O: splitting

## misses with skipping

int array1[512]; int array2[512];
for (int i = 0; i < 512; i += 1)
sum += array1[i] * array2[i];
\}
Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2 KB direct-mapped cache with 16 B cache blocks?
Hint: depends on relative placement of array1, array2
How about on a two-way set associative cache?

## arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
odd_sum += array[i + 1];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2 KB direct-mapped cache with 16B cache blocks? Would a set-associtiave cache be better?

## simulated misses: BST lookups



## simulated misses: matrix multiplies



## handling writes

what about writing to the cache?
two decision points:
if the value is not in cache, do we add it?
if yes: need to load rest of block if no: missing out on locality?
if value is in cache, when do we update next level?
if immediately: extra writing
if later: need to remember to do so

## allocate on write?

processor writes less than whole cache block
block not yet in cache two options:

## write-allocate

fetch rest of cache block, replace written part
(then follow write-through or write-back policy)
write-no-allocate
don't use cache at all (send write to memory instead) guess: not read soon?

## write-through v. write-back

 option 1: write-through(1) write 10


## write-through v. write-back

## option 1: write-through



## write-through v. write-back

## option 2: write-back



## write-through v. write-back

## option 2: write-back



## write-through v. write-back



## writeback policy

changed value!

| 2 -way set associative, 4 byte blocks, 2 sets |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| index | valid | tag | value | dirty | valid | tag | value | dirty | LRU |
| 0 | 1 | 000000 | $\underset{\substack{\operatorname{men}[0 x \times 0] \\ \text { men }[0 x 01]}}{ }$ | 0 | 1 | 011000 | $\operatorname{mem}[0 \times 60] \star$ $\operatorname{mem}[0 \times 61] \star$ | 1 | 1 |
| 1 | 1 | 011000 | $\operatorname{mem}[0 \times 62]$ $\operatorname{mem}[0 \times 63]$ | 0 | 0 |  |  |  | 0 |
|  |  |  | $1=$ dirty (different than memory) needs to be written if evicted |  |  |  |  |  |  |

## write-allocate + write-back

2-way set associative, LRU, writeback

| index | valid | tag | value | dirty | valid | tag | value | dirty | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | mem [0x00] <br> mem [0x01] | 0 | 1 | 011000 | mem $[0 \times 60]$ * mem $[0 \times 61]$ * | * 1 | 1 |
| 1 | 1 | 011000 | $\left\lvert\, \begin{aligned} & \operatorname{mem}[0 \times 62] \\ & \operatorname{mem}[0 \times 63] \end{aligned}\right.$ | 0 | 0 |  |  |  | 0 |

writing 0xFF into address $0 \times 04$ ?
index 0, tag 000001

## write-allocate + write-back

2-way set associative, LRU, writeback

| index | valid | tag | value | dirty | valid | tag | value | dirty | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | mem[0x00] <br> mem [0x01] | 0 | 1 | 011000 | $\begin{aligned} & \operatorname{mem}[0 \times 60] \star \star \\ & \operatorname{mem}[0 \times 61] \star \end{aligned}$ | $\pm 1$ | 1 |
| 1 | 1 | 011000 | $\left\lvert\, \begin{aligned} & \operatorname{mem}[0 \times 62] \\ & \operatorname{mem}[0 \times 63] \end{aligned}\right.$ | 0 | 0 |  |  |  | 0 |

writing 0xFF into address $0 \times 04$ ?
index 0 , tag 000001
step 1: find least recently used block

## write-allocate + write-back

2-way set associative, LRU, writeback

| index | valid | tag | value | dirty | valid | tag | value dirty | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | $\left\lvert\, \begin{aligned} & \operatorname{mem}[0 \times 00] \\ & \operatorname{mem}[0 \times 01] \end{aligned}\right.$ | 0 | 1 | 011000 | $l_{\left.\operatorname{mem}[0 x 60]]^{*}[0 x 61]\right]^{*}} \quad \text { I }$ | 1 |
| 1 | 1 | 011000 | $\underset{\operatorname{mem}[0 \times 63]}{\operatorname{mem}[0 \times 62]}$ | 0 | 0 |  |  | $\bigcirc$ |

writing OxFFF into address $0 \times 04$ ?
index 0 , tag 000001
step 1: find least recently used block
step 2: possibly writeback old block

## write-allocate + write-back

2-way set associative, LRU, writeback

| index | valid | tag | value | dirty | valid | tag | value | dirty | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 000000 | $\left\lvert\, \begin{aligned} & \operatorname{mem}[0 \times 00] \\ & \operatorname{mem}[0 \times 01] \end{aligned}\right.$ | 0 | 1 | 000001 | $\begin{array}{c\|} \hline 0 \times F F \\ \operatorname{mem}[0 \times 05] \end{array}$ | 1 | $\bigcirc$ |
| 1 | 1 | 011000 | $\begin{array}{\|c\|} \hline \operatorname{mem}[0 \times 62] \\ \operatorname{mem}[0 \times 63] \end{array}$ | 0 | 0 |  |  |  | $\bigcirc$ |

writing OxFFF into address $0 \times 04$ ?
index 0 , tag 000001
step 1: find least recently used block
step 2: possibly writeback old block
step 3a: read in new block - to get mem[0x05]
step 3b: update LRU information

## write-no-allocate + write-back

2-way set associative, LRU, writeback

| index | valid | tag | value | dirty | valid | tag | value | dirty | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\bigcirc$ | 1 | 000000 | mem [0x00] mem [0x01] | 0 | 1 | 011000 | $\operatorname{mem}[0 \times 60] \star_{\operatorname{mem}[0 \times 61]}^{*}$ | ${ }_{\star}^{*} 1$ | 1 |
| 1 | 1 | 011000 | $\left\lvert\, \begin{aligned} & \operatorname{mem}[0 \times 62] \\ & \operatorname{mem}[0 \times 63] \end{aligned}\right.$ | 0 | $\bigcirc$ |  |  |  | $\bigcirc$ |

writing 0xFF into address $0 \times 04$ ?
step 1: is it in cache yet?
step 2: no, just send it to memory

## exercise (1)

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag | value | dirty | valid | tag | value | dirty | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 001100 | $\left\|\begin{array}{c} \operatorname{mem}[0 \times 30] \\ \operatorname{mem}[0 \times 31] \end{array}\right\|$ | 0 | 1 | 010000 | $\begin{aligned} & \operatorname{mem}[0 \times 40]_{\star}^{\star} \\ & \operatorname{mem}[0 \times 41] \end{aligned}$ | * 1 | 0 |
| 1 | 1 | 011000 | $\begin{array}{\|l\|} \hline \operatorname{mem}[0 \times 62] \\ \operatorname{mem}[0 \times 63] \end{array}$ | $\bigcirc$ | 1 | 001100 | $\operatorname{mem}[0 \times 32] \star_{\operatorname{mem}[0 \times 33] \star}^{*}$ | * 1 | 1 |

for each of the following accesses, performed alone, would it require (a) reading a value from memory (or next level of cache) and (b) writing a value to the memory (or next level of cache)?
writing 1 byte to $0 \times 33$
reading 1 byte from $0 \times 52$
reading 1 byte from $0 \times 50$

## exercise (2)

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag | value | valid | tag | value | LRU |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 1 | 001100 | $\left\|\begin{array}{l} \operatorname{mem}[0 \times 30] \\ \operatorname{mem}[0 \times 31] \end{array}\right\|$ | 1 | 010000 | $\begin{array}{\|c\|} \hline \operatorname{mem}[0 \times 40] \\ \operatorname{mem}[0 \times 41] \\ \hline \end{array}$ | $\bigcirc$ |
| 1 | 1 | 011000 | $\begin{array}{\|l\|} \hline \operatorname{mem}[0 \times 62] \\ \operatorname{mem}[0 \times 63] \end{array}$ | 1 | 001100 | $\begin{array}{\|l\|} \hline \operatorname{mem}[0 \times 32] \\ \operatorname{mem}[0 \times 33] \\ \hline \end{array}$ | 1 |

for each of the following accesses, performed alone, would it require (a) reading a value from memory and (b) writing a value to the memory?
writing 1 byte to $0 \times 33$
reading 1 byte from $0 \times 52$
reading 1 byte from $0 \times 50$

## fast writes


write appears to complete immediately when placed in buffer memory can be much slower

## cache miss types

common to categorize misses:
roughly "cause" of miss assuming cache block size fixed
compulsory (or cold) — first time accessing something adding more sets or blocks/set wouldn't change
conflict — sets aren't big/flexible enough
a fully-associtive (1-set) cache of the same size would have done better
capacity - cache was not big enough
coherence - from sync'ing cache with other caches only issue with multiple cores

## making any cache look bad

1. access enough blocks, to fill the cache
2. access an additional block, replacing something
3. access last block replaced
4. access last block replaced
5. access last block replaced
but - typical real programs have locality

## cache optimizations

(assuming typical locality + keeping cache size constant if possible...)
increase cache size increase associativity increase block size add secondary cache write-allocate writeback LRU replacement prefetching
miss rate hit time miss penalty better worse -
better worse worse?
depends worse worse

-     - better
better - ?
-     - ?
better ? worse?
better prefetching $=$ guess what program will use, access in advance average time $=$ hit time + miss rate $\times$ miss penalty


## cache optimizations by miss type

(assuming other listed parameters remain constant)
capacity conflict
fewer misses fewer misses
more misses?
increase cache size fewer misses increase associativity increase block size

LRU replacement prefetching
fewer misses
more misses?
fewer misses
-
compulsory
fewer misses
fewer misses

## another view



## two-level page table lookup

page table
base register
virtual address
11010101001011000011011111 .

memory (really cache)

## cache accesses and multi-level PTs

four-level page tables - five cache accesses per program memory access

L1 cache hits - typically a couple cycles each?
so add 8 cycles to each program memory access?
not acceptable

## program memory active sets



```
                                    0xFFFF FFFF FFFF FFFF
                                    0xFFFF 8000 0000 0000
                                    0x7F...
                            small areas of memory active at a time one or two pages in each area?
\(0 x 0000000000400000\)
```


## page table entries and locality

page table entries have excellent temporal locality
typically one or two pages of the stack active
typically one or two pages of code active
typically one or two pages of heap/globals active
each page contains whole functions, arrays, stack frames, etc.

## page table entries and locality

page table entries have excellent temporal locality
typically one or two pages of the stack active
typically one or two pages of code active
typically one or two pages of heap/globals active
each page contains whole functions, arrays, stack frames, etc.
needed page table entries are very small

## page table entry cache

caled a TLB (translation lookaside buffer)
very small cache of page table entries

| L1 cache |
| :--- |
| physical addresses |
| bytes from memory |
| tens of bytes per block |
| usually thousands of blocks |

## TLB

virtual page numbers
page table entries
one page table entry per block
usually tens of entries

## page table entry cache

caled a TLB (translation lookaside buffer)
very small cache of page table entries

| L1 cache | TLB |
| :--- | :--- |
| physical addresses <br> bytes from memory | virtual page numbers <br> page table entries |
| tens of bytes per block |  |
| usually thousands of hlocks |  | | one page able entry per block |
| :--- |
| usuallv te |
| only of entries |
| (generally) just entries from the last-level page tables |

## page table entry cache

caled a TLB (translation lookaside buffer)
very small cache of page table entries

| L1 cache |
| :--- |
| physical addresses |
| bytes from memory |
| tens of bytes per block |
| usually thousands of blocks |

## TLB

virtual page numbers page table entries
one page table entry per block usually teng of entries
not much spatial locality between page table entries (they're used for kilobytes of data already)
(and if spatial locality, maybe use larger page size?)

## page table entry cache

caled a TLB (translation lookaside buffer)
very small cache of page table entries

| L1 cache |
| :--- |
| physical addresses |
| bytes from memory |
| tens of bytes per block |
| usually thousands of blocks |

## TLB

virtual page numbers
page table entries
one page table entry per block
usually tens of entries
few active page table entries at a time enables highly associative cache designs

## TLB and multi-level page tables

TLB caches valid last-level page table entries doesn't matter which last-level page table
means TLB output can be used directly to form address

## TLB and two-level lookup



## TLB and two-level lookup

page table base register
virtual address
TLB miss


## TLB organization (2-way set associative)

$\overbrace{11100010110}^{\text {VPN page offset }}$ (program address)


## TLB organization (2-way set associative)



## TLB organization (2-way set associative)



## TLB organization (2-way set associative)

$\overbrace{11100010110}^{\text {VPN page offset }}$ (program address)


## TLB organization (2-way set associative)

$\overbrace{11100010110}^{\text {VPN page offset }}$ (program address)


## address splitting for TLBs (1)

 my desktop:4KB ( $2^{12}$ byte) pages; 48-bit virtual address
64-entry, 4-way L1 data TLB

TLB index bits?
TLB tag bits?

## address splitting for TLBs (2)

my desktop:
4KB ( $2^{12}$ byte) pages; 48-bit virtual address
1536-entry $\left(3 \cdot 2^{9}\right), 12$-way L2 TLB

TLB index bits?
TLB tag bits?

## exercise: TLB access pattern (setup)

4-entry, 2-way TLB, LRU replacement policy, initially empty 4096 byte pages
how many index bits?
TLB index of virtual address $0 \times 12345$ ?

## exercise: TLB access pattern

4-entry, 2-way TLB, LRU replacement policy, initially empty 4096 byte pages

| type | virtual | physical |
| :--- | :--- | :--- |
| read | $0 \times 440030$ | $0 \times 554030$ |
| write | $0 \times 440034$ | $0 \times 554034$ |
| read | $0 \times 7$ FFFE008 | $0 \times 556008$ |
| read | $0 \times 7$ FFFE000 | $0 \times 556000$ |
| read | $0 \times 7$ FFFDFF8 | $0 \times 5 F 8 F F 8$ |
| read | $0 \times 664080$ | $0 \times 5 F 9080$ |
| read | $0 \times 440038$ | $0 \times 554038$ |
| write | $0 \times 7$ FFFDFF0 | $0 \times 5 F 8 F F 0$ |

which are TLB hits? which are TLB misses? final contents of TLB?

## changing page tables

what happens to TLB when page table base pointer is changed?
e.g. context switch
most entries in TLB refer to things from wrong process oops - read from the wrong process's stack?

## changing page tables

what happens to TLB when page table base pointer is changed?
e.g. context switch
most entries in TLB refer to things from wrong process oops - read from the wrong process's stack?
option 1: invalidate all TLB entries side effect on "change page table base register" instruction

## changing page tables

what happens to TLB when page table base pointer is changed?
e.g. context switch
most entries in TLB refer to things from wrong process oops - read from the wrong process's stack?
option 1: invalidate all TLB entries side effect on "change page table base register" instruction
option 2: TLB entries contain process ID
set by OS (special register)
checked by TLB in addition to TLB tag, valid bit

## editing page tables

what happens to TLB when OS changes a page table entry? most common choice: has to be handled in software

## editing page tables

what happens to TLB when OS changes a page table entry? most common choice: has to be handled in software
invalid to valid - nothing needed
TLB doesn't contain invalid entries
MMU will check memory again
valid to invalid - OS needs to tell processor to invalidate it special instruction (x86: invlpg)
valid to other valid - OS needs to tell processor to invalidate it

## backup slides

## inclusive versus exclusive

L2 inclusive of L1
everything in L1 cache duplicated in L2 adding to L1 also adds to L2

L2 cache


## L2 exclusive of L1

L2 contains different data than L1 adding to L 1 must remove from L2 probably evicting from L1 adds to L2

L2 cache


## inclusive versus exclusive

L2 inclusive of L 1

| everything in L 1 cache duplicated in L 2 |
| :---: |
| adding to L 1 also adds to L 2 |


inclusive policy:
no extra work on eviction
but duplicated data
easier to explain when
$\mathrm{L} k$ shared by multiple $\mathrm{L}(k-1)$ caches?

## inclusive versus exclusive

exclusive policy: avoid duplicated data sometimes called victim cache (contains cache eviction victims)
makes less sense with multicore

## L2 exclusive of L1

L2 contains different data than L1 adding to L 1 must remove from L2 probably evicting from L1 adds to L2

L2 cache

## Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)
$S=2^{s} \quad$ number of sets
$s$
$B=2^{b}$
b
m
$t=m-(s+b)$ tag bits
$C=B \times S \quad$ cache size (if direct-mapped)

## Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)
$S=2^{s} \quad$ number of sets
$s$
$B=2^{b}$
b
m
$t=m-(s+b)$ tag bits
$C=B \times S$
block size
(set) index bits
(block) offset bits
memory addreses bits
cache size (if direct-mapped)

## backup slides - cache performance

## average memory access time

AMAT $=$ hit time + miss penalty $\times$ miss rate
or $\mathrm{AMAT}=$ hit time $\times$ hit rate + miss time $\times$ miss rate
effective speed of memory

## AMAT exercise (1)

$90 \%$ cache hit rate
hit time is 2 cycles
30 cycle miss penalty
what is the average memory access time?
suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
how much do we have to increase the hit rate for this to not increase AMAT?

## AMAT exercise (1)

$90 \%$ cache hit rate
hit time is 2 cycles
30 cycle miss penalty
what is the average memory access time?
suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
how much do we have to increase the hit rate for this to not increase AMAT?

## AMAT exercise (1)

$90 \%$ cache hit rate
hit time is 2 cycles
30 cycle miss penalty
what is the average memory access time?
suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
how much do we have to increase the hit rate for this to not increase AMAT?

## exercise: AMAT and multi-level caches

suppose we have L1 cache with
3 cycle hit time
90\% hit rate
and an L2 cache with
10 cycle hit time
$80 \%$ hit rate (for accesses that make this far)
(assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there's an cache miss, the next level access starts after the hit time
e.g. an access that misses in L1 and hits in L2 will take $10+3$ cycles what is the average memory access time for the L1 cache?

## exercise: AMAT and multi-level caches

suppose we have L1 cache with
3 cycle hit time
90\% hit rate
and an L2 cache with
10 cycle hit time
$80 \%$ hit rate (for accesses that make this far)
(assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there's an cache miss, the next level access starts after the hit time
e.g. an access that misses in L1 and hits in L2 will take $10+3$ cycles what is the average memory access time for the L1 cache?

## exercise: AMAT and multi-level caches

suppose we have L1 cache with
3 cycle hit time
90\% hit rate
and an L2 cache with
10 cycle hit time
$80 \%$ hit rate (for accesses that make this far)
(assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there's an cache miss, the next level access starts after the hit time
e.g. an access that misses in L1 and hits in L2 will take $10+3$ cycles what is the average memory access time for the L1 cache?

## approximate miss analysis

very tedious to precisely count cache misses
even more tedious when we take advanced cache optimizations into account
instead, approximations:
good or bad temporal/spatial locality good temporal locality: value stays in cache good spatial locality: use all parts of cache block
with nested loops: what does inner loop use?
intuition: values used in inner loop loaded into cache once (that is, once each time the inner loop is run) ...if they can all fit in the cache

## approximate miss analysis

very tedious to precisely count cache misses
even more tedious when we take advanced cache optimizations into account
instead, approximations:
good or bad temporal/spatial locality good temporal locality: value stays in cache good spatial locality: use all parts of cache block
with nested loops: what does inner loop use?
intuition: values used in inner loop loaded into cache once (that is, once each time the inner loop is run) ...if they can all fit in the cache

## locality exercise (1)

```
/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]
/* version 2 */
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        A[i] += B[j] * C[i * N + j];
```

exercise: which has better temporal locality in $A$ ? in $B$ ? in $C$ ? how about spatial locality?

## exercise: miss estimating (1)

```
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
    A[i] += B[j] * C[i * N + j]
```

Assume: 4 array elements per block, N very large, nothing in cache at beginning.

Example: $N / 4$ estimated misses for A accesses:
$\mathrm{A}[\mathrm{i}]$ should always be hit on all but first iteration of inner-most loop. first iter: $A[i]$ should be hit about $3 / 4$ s of the time (same block as $A[i-1]$ that often)

Exericse: estimate \# of misses for B, C

## a note on matrix storage

$A-N \times N$ matrix
represent as array
makes dynamic sizes easier:

```
float A_2d_array[N][N];
float *A_flat \(=m a l l o c(N * N)\);
```

A_flat $[i \star N+j]===A \_2 d \_a r r a y[i][j]$

## convertion re: rows/columns

going to call the first index rows
$A_{i, j}$ is A row i, column j
rows are stored together
this is an arbitrary choice

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :---: | :---: | :---: | :---: | :---: |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

if array starts on cache block first cache block $=$ first elements all together in one row!

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

second cache block:
1 from row 0
3 from row 1

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

## $5 \times 5$ array and 4 -element cache blocks

| $\operatorname{array}[0 \star 5+0]$ | $\operatorname{array}[0 \star 5+1]$ | $\operatorname{array}[0 \star 5+2]$ | $\operatorname{array}[0 \star 5+3]$ | $\operatorname{array}[0 \star 5+4]$ |
| :--- | :--- | :--- | :--- | :--- |
| $\operatorname{array}[1 \star 5+0]$ | $\operatorname{array}[1 \star 5+1]$ | $\operatorname{array}[1 \star 5+2]$ | $\operatorname{array}[1 \star 5+3]$ | $\operatorname{array}[1 \star 5+4]$ |
| $\operatorname{array}[2 \star 5+0]$ | $\operatorname{array}[2 \star 5+1]$ | $\operatorname{array}[2 \star 5+2]$ | $\operatorname{array}[2 \star 5+3]$ | $\operatorname{array}[2 \star 5+4]$ |
| $\operatorname{array}[3 \star 5+0]$ | $\operatorname{array}[3 \star 5+1]$ | $\operatorname{array}[3 \star 5+2]$ | $\operatorname{array}[3 \star 5+3]$ | $\operatorname{array}[3 \star 5+4]$ |
| $\operatorname{array}[4 \star 5+0]$ | $\operatorname{array}[4 \star 5+1]$ | $\operatorname{array}[4 \star 5+2]$ | $\operatorname{array}[4 \star 5+3]$ | $\operatorname{array}[4 \star 5+4]$ |

generally: cache blocks contain data from 1 or 2 rows $\rightarrow$ better performance from reusing rows

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is k, middle is j */
for (int i = 0; i < N; ++i)
for (int $\mathrm{j}=0 ; \mathrm{j}<\mathrm{N} ;++\mathrm{j})$
for (int $k=0 ; k<N ;++k)$
$C[i * N+j]+=A[i * N+k] * B[k * N+j] ;$

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is $k$, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i * N+j]+=A[i \star N+k] \star B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i \star N+j]+=A[i * N+k] * B[k \star N+j] ;$

## loop orders and locality

loop body: $C_{i j}+=A_{i k} B_{k j}$
kij order: $C_{i j}, B_{k j}$ have spatial locality
kij order: $A_{i k}$ has temporal locality
... better than ...
$i j k$ order: $A_{i k}$ has spatial locality
$i j k$ order: $C_{i j}$ has temporal locality

## loop orders and locality

loop body: $C_{i j}+=A_{i k} B_{k j}$
kij order: $C_{i j}, B_{k j}$ have spatial locality
kij order: $A_{i k}$ has temporal locality
... better than ...
$i j k$ order: $A_{i k}$ has spatial locality
$i j k$ order: $C_{i j}$ has temporal locality

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is $k$, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i * N+j]+=A[i \star N+k] \star B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i \star N+j]+=A[i * N+k] * B[k \star N+j] ;$

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is k, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i \star N+j]+=A[i \star N+k] \star B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i \nless N+k] \star B[k \star N+j] ;$

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is $k$, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i \star N+j]+=A[i \star N+k] * B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i \not * N+k] \star B[k \star N+j] ;$

## which is better?

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

```
/* version 1: inner loop is k, middle is j*/
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        for (int k = 0; k < N; ++k)
        C[i*N+j] += A[i * N + k] * B[k * N + j];
```

/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i * N+k] * B[k * N+j] ;$
exercise: Which version has better spatial/temporal locality for... ... accesses to C? ...accesses to A? ... accesses to B ?

## array usage: $i j k$ order



## array usage: $i j k$ order



## array usage: $i j k$ order


for all $i$ : for all $j$ : for all $k$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$


looking only at innermost loop: temporal locality in C
bad temporal locality in everything else (everything accessed exactly once)

## array usage: $i j k$ order


$A_{x 0} \quad A_{x N}$
for all $i$ :
for all $j$ :
for all $k$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$

looking only at innermost loop: row of A (elements used once) column of $B$ (elements used once) single element of $C$ (used many times)

## array usage: $i j k$ order


looking only at two innermost loops together: some temporal locality in A (column reused) some temporal locality in B (row reused) some temporal locality in C (row reused)

## array usage: kij order


for all $k$ :
for all $i$ :
for all $j$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$


if $N$ large:
using $C_{i j}$ once per load into cache (but using $C_{i, j+1}$ right after)
using $A_{i k}$ many times per load-into-cache using $B_{k j}$ once per load into cache (but using $B_{k, j+1}$ right after)

## array usage: kij order


looking only at innermost loop: spatial locality in B, C (use most of loaded B, C cache blocks) no useful spatial locality in A (rest of A's cache block wasted)

## array usage: kij order


$A_{x 0} \quad A_{x N}$
for all $k$ : for all $i$ : for all $j$ :

$$
C_{i j}+=A_{i k} \times B_{k j}
$$

looking only at innermost loop: temporal locality in A no temporal locality in B, C
( $B, C$ values used exactly once)

## array usage: kij order


looking only at innermost loop: processing one element of A (use many times) row of B (each element used once) $C_{i j}+=A_{i k} \times B_{k j}$ column of C (each element used once)

## array usage: kij order


looking only at two innermost loops together: for all $i$ : for all $j$ :
$C_{i j}+=A_{i k} \times B_{k j}$ good temporal locality in A (column reused) good temporal locality in B (row reused) bad temporal locality in C (nothing reused)

## matrix multiply

$$
C_{i j}=\sum_{k=1}^{n} A_{i k} \times B_{k j}
$$

/* version 1: inner loop is k, middle is $j * /$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
for (int $k=0 ; k<N ;++k)$
$C[i \star N+j]+=A[i \star N+k] * B[k \star N+j] ;$
/* version 2: outer loop is k, middle is i */
for (int $k=0 ; k<N ;++k)$
for (int $i=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i \star N+k] \star B[k \star N+j] ;$

## performance (with $A=B$ )



## alternate view 1: cycles/instruction



## alternate view 2: cycles/operation



## counting misses: version 1

```
for (int \(i=0 ; i<N ;++i)\)
    for (int j \(=0 ; j<N ;++j)\)
        for (int \(k=0 ; k<N ;++k)\)
            \(C[i * N+j]+=A[i * N+k] * B[k * N+j] ;\)
```

if $N$ really large
assumption: can't get close to storing $N$ values in cache at once
for A: about $N \div$ block size misses per k-loop
total misses: $N^{3} \div$ block size
for B: about $N$ misses per k-loop
total misses: $N^{3}$
for C : about $1 \div$ block size miss per k -loop
total misses: $N^{2} \div$ block size

## counting misses: version 2

```
for (int \(k=0 ; k<N ;++k)\)
    for (int i \(=0 ; i<N ;++i)\)
    for (int \(j=0 ; j<N ;++j)\)
    \(C[i * N+j]+=A[i * N+k] * B[k * N+j] ;\)
```

for $A$ : about 1 misses per j-loop total misses: $N^{2}$
for B: about $N \div$ block size miss per j-loop total misses: $N^{3} \div$ block size
for C : about $N \div$ block size miss per j-loop total misses: $N^{3} \div$ block size

## exercise: miss estimating (2)

```
for (int k = 0; k < 1000; k += 1)
    for (int i = 0; i < 1000; i += 1)
    for (int j = 0; j < 1000; j += 1)
    A[k*N+j] += B[i*N+j];
```

assuming: 4 elements per block
assuming: cache not close to big enough to hold 1 K elements
estimate: approximately how many misses for $A, B$ ?

## $L 1$ misses (with $A=B$ )



## L1 miss detail (1)



## L1 miss detail (2)

read misses/1K instruction


## addresses

| $B[k \star 114+j]$ | is at | 10 | 0000 | 0000 | 0100 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $B[k \star 114+j+1]$ | is at | 10 | 0000 | 0000 | 1000 |
| $B[(k+1) \star 114+j]$ | is at | 10 | 0011 | 1001 | 0100 |
| $B[(k+2) \star 114+j]$ | is at | 10 | 0101 | 0101 | 1100 |
| $\cdots$ |  |  |  |  |  |
| $B[(k+9) \star 114+j]$ | is at | 11 | 0000 | 0000 | 1100 |

## addresses

| $B[k \star 114+j]$ | is at | 10 | 0000 | 0000 | 0100 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $B[k \star 114+j+1]$ | is at | 10 | 0000 | 0000 | 1000 |
| $B[(k+1) \star 114+j]$ | is at | 10 | 0011 | 1001 | 0100 |
| $B[(k+2) \star 114+j]$ | is at | 10 | 0101 | 0101 | 1100 |
| $\cdots$ |  |  |  |  |  |
| $B[(k+9) \star 114+j]$ | is at | 11 | 0000 | 0000 | 1100 |

test system L1 cache: 6 index bits, 6 block offset bits

## conflict misses

powers of two - lower order bits unchanged
$B[k * 93+j]$ and $B[(k+11) * 93+j]:$
1023 elements apart ( 4092 bytes; 63.9 cache blocks)
64 sets in L1 cache: usually maps to same set
$B[k * 93+(j+1)]$ will not be cached (next $i$ loop)
even if in same block as $B[k * 93+j]$
how to fix? improve spatial locality
(maybe even if it requires copying)

## locality exercise (2)

```
/* version 2 */
for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]
/* version 3 */
for (int ii = 0; ii < N; ii += 32)
    for (int jj = 0; jj < N; jj += 32)
        for (int i = ii; i < ij + 32; ++i)
        for (int j = jj; j < jj + 32; ++j)
        A[i] += B[j] * C[i * N + j];
```

exercise: which has better temporal locality in $A$ ? in $B$ ? in $C$ ? how about spatial locality?

## a transformation

for (int $k=0 ; k<N ; k+=1)$
for (int $\mathbf{i}=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i * N+k]$ * $B[k * N+j] ;$
for (int $k k=0 ; k k<N ; k k+=2$ )
for (int $k=k k ; k<k k+2 ;++k)$
for (int $\mathbf{i}=0 ; i<N ;++i)$
for (int $j=0 ; j<N ;++j)$
$C[i * N+j]+=A[i * N+k] * B[k * N+j] ;$
split the loop over $k$ - should be exactly the same (assuming even $N$ )

## a transformation

for (int k = 0; k < N ; k += 1)
for (int i $=0$; i < N; ++i)
for (int $\mathrm{j}=0$; $\mathrm{j}<\mathrm{N}$; ++j)
$C[i * N+j]+=A[i * N+k]$ * $B[k * N+j] ;$
for (int kk = 0; kk < N; kk += 2)

```
    for (int \(k=k k ; k<k k+2 ;++k)\)
        for (int i = 0; i < N; ++i)
        for (int \(j=0 ; j<N ;++j)\)
        \(C[i * N+j]+=A[i * N+k]\) * \(B[k * N+j] ;\)
```

split the loop over $k$ - should be exactly the same (assuming even $N$ )

## simple blocking

```
for (int kk = 0; kk < N; kk += 2)
    /* was here: for (int k = kk; k < kk + 2; ++k) */
        for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            /* load Aik, Aik+1 into cache and process: */
        for (int k = kk; k < kk + 2; ++k)
        C[i*N+j] += A[i*N+k] * B[k*N+j];
```

now reorder split loop — same calculations

## simple blocking

```
for (int kk = 0; kk < N; kk += 2)
    /* was here: for (int k = kk; k < kk + 2; ++k) */
        for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
        /* load Aik, Aik+1 into cache and process: */
        for (int k = kk; k < kk + 2; ++k)
        C[i*N+j] += A[i*N+k] * B[k*N+j];
```

now reorder split loop - same calculations
now handle $B_{i j}$ for $k+1$ right after $B_{i j}$ for $k$
(previously: $B_{i, j+1}$ for $k$ right after $B_{i j}$ for $k$ )

## simple blocking

```
for (int kk = 0; kk < N; kk += 2)
    /* was here: for (int \(k=k k ; k<k k+2 ;++k\) ) */
        for (int \(\mathrm{i}=0\); \(\mathrm{i}<\mathrm{N}\); ++i)
        for (int \(\mathrm{j}=0 ; \mathrm{j}<\mathrm{N} ;++\mathrm{j}\) )
        /* load Aik, Aik+1 into cache and process: */
        for (int \(k=k k ; k<k k+2 ;++k\) )
        \(C[i * N+j]+=A[i * N+k]\) * \(B[k * N+j] ;\)
```

now reorder split loop - same calculations
now handle $B_{i j}$ for $k+1$ right after $B_{i j}$ for $k$
(previously: $B_{i, j+1}$ for $k$ right after $B_{i j}$ for $k$ )

## simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            /* process a "block" of 2 k values: */
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
    }
}
```


## simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            /* process a "block" of 2 k values: */
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
    }
}
Temporal locality in \(C_{i j} \mathrm{~S}\)
```


## simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            /* process a "block" of 2 k values: */
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
    }
}
```

More spatial locality in $A_{i k}$

## simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            /* process a "block" of 2 k values: */
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
    }
}
```

Still have good spatial locality in $B_{k j}, C_{i j}$

## counting misses for $\mathbf{A}(1)$

```
for (int kk = 0; kk < N; kk += 2)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
```

access pattern for $A$ :
$\mathrm{A}[0 * \mathrm{~N}+0], \mathrm{A}\left[0^{*} \mathrm{~N}+1\right], \mathrm{A}\left[0^{*} \mathrm{~N}+0\right], \mathrm{A}\left[0^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)
$\mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right], \mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)

## counting misses for $\mathbf{A}$ (1)

```
for (int kk = 0; kk < N; kk += 2)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
```

access pattern for A :
$\mathrm{A}[0 * \mathrm{~N}+0], \mathrm{A}[0 * \mathrm{~N}+1], \mathrm{A}\left[0^{*} \mathrm{~N}+0\right], \mathrm{A}[0 * \mathrm{~N}+1] \ldots$ (repeats N times)
$\mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right], \mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)
$\mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+0\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+1\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+0\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+1\right] \ldots$
$A\left[0^{*} N+2\right], A\left[0^{*} N+3\right], A\left[0^{*} N+2\right], A[0 * N+3] \ldots$

## counting misses for $\mathbf{A}$ (1)

```
for (int kk = 0; kk < N; kk += 2)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
```

access pattern for A :
$\mathrm{A}[0 * \mathrm{~N}+0], \mathrm{A}[0 * \mathrm{~N}+1], \mathrm{A}\left[0^{*} \mathrm{~N}+0\right], \mathrm{A}[0 * \mathrm{~N}+1] \ldots$ (repeats N times)
$\mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right], \mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)
$\mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+0\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+1\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+0\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+1\right] \ldots$
$A\left[0^{*} N+2\right], A\left[0^{*} N+3\right], A\left[0^{*} N+2\right], A[0 * N+3] \ldots$

## counting misses for $\mathbf{A}$ (2)

$\mathrm{A}\left[0^{*} \mathrm{~N}+0\right], \mathrm{A}[0 * \mathrm{~N}+1], \mathrm{A}\left[0^{*} \mathrm{~N}+0\right], \mathrm{A}\left[0^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times) $\mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right], \mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)

## counting misses for $\mathbf{A}$ (2)

$\mathrm{A}[0 * \mathrm{~N}+0], \mathrm{A}[0 * \mathrm{~N}+1], \mathrm{A}\left[0^{*} \mathrm{~N}+0\right], \mathrm{A}\left[0^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times) $\mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right], \mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)
$\mathrm{A}[(\mathrm{N}-1) * \mathrm{~N}+0], \mathrm{A}[(\mathrm{N}-1) * \mathrm{~N}+1], \mathrm{A}[(\mathrm{N}-1) * \mathrm{~N}+0], \mathrm{A}[(\mathrm{N}-1) * \mathrm{~N}+1] \ldots$ $A[0 * N+2], A[0 * N+3], A[0 * N+2], A[0 * N+3] \ldots$
likely cache misses: only first iterations of $j$ loop how many cache misses per iteration? usually one $\mathrm{A}[0 * \mathrm{~N}+0]$ and $\mathrm{A}[0 * \mathrm{~N}+1]$ usually in same cache block

## counting misses for $\mathbf{A}$ (2)

$\mathrm{A}[0 * \mathrm{~N}+0], \mathrm{A}[0 * \mathrm{~N}+1], \mathrm{A}[0 * \mathrm{~N}+0], \mathrm{A}[0 * \mathrm{~N}+1] \ldots$ (repeats N times) $\mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right], \mathrm{A}\left[1^{*} \mathrm{~N}+0\right], \mathrm{A}\left[1^{*} \mathrm{~N}+1\right] \ldots$ (repeats N times)
$\mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+0\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+1\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+0\right], \mathrm{A}\left[(\mathrm{N}-1)^{*} \mathrm{~N}+1\right] \ldots$ $A[0 * N+2], A[0 * N+3], A\left[0^{*} N+2\right], A[0 * N+3] \ldots$
likely cache misses: only first iterations of $j$ loop
how many cache misses per iteration? usually one
$\mathrm{A}[0 * \mathrm{~N}+0]$ and $\mathrm{A}[0 * \mathrm{~N}+1]$ usually in same cache block
about $\frac{N}{2} \cdot N$ misses total

## counting misses for $B$ (1)

```
for (int kk = 0; kk < N; kk += 2)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
```

access pattern for $B$ :
$\mathrm{B}\left[0^{*} \mathrm{~N}+0\right], \mathrm{B}\left[1^{*} \mathrm{~N}+0\right], \ldots \mathrm{B}\left[0^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[1^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$B\left[2^{*} N+0\right], B[3 * N+0], \ldots B[2 * N+(N-1)], B\left[3^{*} N+(N-1)\right]$
$B\left[4^{*} N+0\right], B\left[5^{*} N+0\right], \ldots B\left[4^{*} N+(N-1)\right], B\left[5^{*} N+(N-1)\right]$
$B\left[0^{*} N+0\right], B\left[1^{*} N+0\right], \ldots B\left[0^{*} N+(N-1)\right], B\left[1^{*} N+(N-1)\right]$

## counting misses for $B$ (2)

access pattern for B :
$\mathrm{B}\left[0^{*} \mathrm{~N}+0\right], \mathrm{B}\left[1^{*} \mathrm{~N}+0\right], \ldots \mathrm{B}\left[0^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[1^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$B\left[2^{*} \mathrm{~N}+0\right], \mathrm{B}\left[3^{*} \mathrm{~N}+0\right], \ldots \mathrm{B}\left[2^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[3^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$\mathrm{B}\left[4^{*} \mathrm{~N}+0\right], \mathrm{B}\left[5^{*} \mathrm{~N}+0\right], \ldots \mathrm{B}\left[4^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[5^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$\mathrm{B}\left[0^{*} \mathrm{~N}+0\right], \mathrm{B}\left[1^{*} \mathrm{~N}+0\right], \ldots \mathrm{B}\left[0^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[1^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$

## counting misses for $B$ (2)

access pattern for B :
$\mathrm{B}\left[0^{*} \mathrm{~N}+0\right], \mathrm{B}\left[1^{*} \mathrm{~N}+0\right], \ldots \mathrm{B}\left[0^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[1^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$B[2 * N+0], B[3 * N+0], \ldots B[2 * N+(N-1)], B[3 * N+(N-1)]$
$B\left[4^{*} N+0\right], B\left[5^{*} N+0\right], \ldots B\left[4^{*} N+(N-1)\right], B\left[5^{*} N+(N-1)\right]$
$B\left[0^{*} N+0\right], B\left[1^{*} N+0\right], \ldots B\left[0^{*} N+(N-1)\right], B\left[1^{*} N+(N-1)\right]$
likely cache misses: any access, each time

## counting misses for $B$ (2)

access pattern for B :
$B\left[0^{*} N+0\right], B\left[1^{*} N+0\right], \ldots B\left[0^{*} N+(N-1)\right], B\left[1^{*} N+(N-1)\right]$
$\mathrm{B}\left[2^{*} \mathrm{~N}+0\right], \mathrm{B}[3 * \mathrm{~N}+0], \ldots \mathrm{B}\left[2^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[3^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$B\left[4^{*} N+0\right], B\left[5^{*} N+0\right], \ldots B\left[4^{*} N+(N-1)\right], B\left[5^{*} N+(N-1)\right]$
$B\left[0^{*} N+0\right], B\left[1^{*} N+0\right], \ldots B\left[0^{*} N+(N-1)\right], B\left[1^{*} N+(N-1)\right]$
likely cache misses: any access, each time
how many cache misses per iteration? equal to \# cache blocks in 2 rows

## counting misses for $B$ (2)

access pattern for $B$ :
$B\left[0^{*} N+0\right], B\left[1^{*} N+0\right], \ldots B[0 * N+(N-1)], B\left[1^{*} N+(N-1)\right]$
$\mathrm{B}\left[2^{*} \mathrm{~N}+0\right], \mathrm{B}[3 * \mathrm{~N}+0], \ldots \mathrm{B}\left[2^{*} \mathrm{~N}+(\mathrm{N}-1)\right], \mathrm{B}\left[3^{*} \mathrm{~N}+(\mathrm{N}-1)\right]$
$B\left[4^{*} N+0\right], B\left[5^{*} N+0\right], \ldots B\left[4^{*} N+(N-1)\right], B\left[5^{*} N+(N-1)\right]$
$B\left[0^{*} N+0\right], B\left[1^{*} N+0\right], \ldots B\left[0^{*} N+(N-1)\right], B\left[1^{*} N+(N-1)\right]$
likely cache misses: any access, each time
how many cache misses per iteration? equal to \# cache blocks in 2 rows
about $\frac{N}{2} \cdot N \cdot \frac{2 N}{\text { block size }}=N^{3} \div$ block size misses

## simple blocking - counting misses

```
for (int kk = 0; kk < N; kk += 2)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
```

$\frac{2}{2} \cdot N$ j-loop executions and (assuming $N$ large):
about 1 misses from $A$ per j-loop
$N^{2} / 2$ total misses (before blocking: $N^{2}$ )
about $2 N \div$ block size misses from $B$ per j-loop
$N^{3} \div$ block size total misses (same as before blocking)
about $N \div$ block size misses from $C$ per j-loop
$N^{3} \div\left(2 \cdot\right.$ block size) total misses (before: $N^{3} \div$ block size)

## simple blocking - counting misses

```
for (int kk = 0; kk < N; kk += 2)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
        }
```

$\overline{2} \cdot N$ j-loop executions and (assuming $N$ large):
about 1 misses from $A$ per j-loop
$N^{2} / 2$ total misses (before blocking: $N^{2}$ )
about $2 N \div$ block size misses from $B$ per j-loop
$N^{3} \div$ block size total misses (same as before blocking)
about $N \div$ block size misses from $C$ per j-loop
$N^{3} \div\left(2 \cdot\right.$ block size) total misses (before: $N^{3} \div$ block size)

## improvement in read misses



## simple blocking (2)

same thing for $i$ in addition to $k$ ?

```
for (int kk = 0; kk < N; kk += 2) {
    for (int ii = 0; ii < N; ii += 2) {
            for (int j = 0; j < N; ++j) {
            /* process a "block": */
            for (int k = kk; k < kk + 2; ++k)
            for (int i = 0; i < ii + 2; ++i)
                        C[i*N+j] += A[i*N+k] * B[k*N+j];
        }
    }
}
```


## simple blocking - locality

```
for (int k = 0; k < N; k += 2) {
    for (int i = 0; i < N; i += 2) {
        /* load a block around Aik */
        for (int j = 0; j < N; ++j) {
            /* process a "block": */
                Ci+0,j}+=\mp@subsup{A}{i+0,k+0}{*}\mp@subsup{B}{k+0,j}{
                C i+0,j += A
                Ci+1,j += A
                C
        }
    }
}
```


## simple blocking - locality

```
for (int k = 0; k < N; k += 2) {
    for (int i = 0; i < N; i += 2) {
        /* load a block around Aik */
        for (int j = 0; j < N; ++j) {
            /* process a "block": */
                Ci+0,j}+=\mp@subsup{A}{i+0,k+0}{*}\mp@subsup{B}{k+0,j}{
                C
                C
                C
        }
    }
}
```

now: more temporal locality in $B$ previously: access $B_{k j}$, then don't use it again for a long time

## simple blocking - counting misses for $A$

for (int $k=0 ; k<N ; k+=2)$
for (int i = 0; i < N; i += 2)
for (int j $=0 ; \mathrm{j}<\mathrm{N} ;++\mathrm{j}$ ) \{
$C_{i+0, j}+=A_{i+0, k+0} * B_{k+0, j}$
$C_{i+0, j}+=A_{i+0, k+1} * B_{k+1, j}$
$C_{i+1, j}+=A_{i+1, k+0} * B_{k+0, j}$
$C_{i+1, j}+=A_{i+1, k+1} * B_{k+1, j}$

$$
\text { \} }
$$

$\frac{N}{2} \cdot \frac{N}{2}$ iterations of $j$ loop
likely 2 misses per loop with $A$ (2 cache blocks)
total misses: $\frac{N^{2}}{2}$ (same as only blocking in K )

## simple blocking - counting misses for $B$

```
for (int k = 0; k < N; k += 2)
    for (int i = 0; i < N; i += 2)
        for (int j = 0; j < N; ++j) {
            C}\mp@subsup{C}{i+0,j}{+=}\mp@subsup{A}{i+0,k+0}{* * B B+0,j
        C Ci+0,j += A Ai+0,k+1 * B B 
        C}\mp@subsup{C}{i+1,j}{+=}\mp@subsup{A}{i+1,k+0}{** B
        C}\mp@subsup{C}{i+1,j}{+=}\mp@subsup{A}{i+1,k+1}{** B
        }
```

$\frac{N}{2} \cdot \frac{N}{2}$ iterations of $j$ loop
likely $2 \div$ block size misses per iteration with $B$
total misses: $\frac{N^{3}}{2 \cdot \text { block size }}$ (before: $\frac{N^{3}}{\text { block size }}$ )

## simple blocking - counting misses for C

$$
\begin{aligned}
& \text { for (int } \mathrm{k}=0 ; \mathrm{k}<\mathrm{N} ; \mathrm{k}+=2 \text { ) } \\
& \text { for (int } \mathrm{i}=0 ; \mathrm{i}<\mathrm{N} ; \mathrm{i}+=2 \text { ) } \\
& \text { for (int } \mathrm{j}=0 ; \mathrm{j}<\mathrm{N} ;++\mathrm{j}) \text { ) } \\
& C_{i+0, j}+=A_{i+0, k+0} \star B_{k+0, j} \\
& C_{i+0, j}+=A_{i+0, k+1} \star B_{k+1, j} \\
& C_{i+1, j}+=A_{i+1, k+0} \star B_{k+0, j} \\
& \text { \} } C_{i+1, j}+=A_{i+1, k+1} \star B_{k+1, j}
\end{aligned}
$$

$\frac{N}{2} \cdot \frac{N}{2}$ iterations of $j$ loop
likely $\frac{2}{\text { block size }}$ misses per iteration with $C$
total misses: $\frac{N^{3}}{2 \cdot \text { block size }}$ (same as blocking only in K)

## simple blocking - counting misses (total)

for (int $k=0 ; k<N ; k+=2)$
for (int i = 0; i < N; i += 2)
for (int $\mathrm{j}=0 ; \mathrm{j}<\mathrm{N} ;++\mathrm{j})$ \{
$C_{i+0, j}+=A_{i+0, k+0} * B_{k+0, j}$
$C_{i+0, j}+=A_{i+0, k+1} * B_{k+1, j}$
$C_{i+1, j}+=A_{i+1, k+0} * B_{k+0, j}$
$C_{i+1, j}+=A_{i+1, k+1} * B_{k+1, j}$
\}
before:
A: $\frac{N^{2}}{2} ; \mathrm{B}: \frac{N^{3}}{1 \cdot \text { block size }} ; \mathrm{C} \frac{N^{3}}{1 \cdot \text { block size }}$
after:
A: $\frac{N^{2}}{2} ; \mathrm{B}: \frac{N^{3}}{2 \cdot \text { block size }} ; \mathrm{C} \frac{N^{3}}{2 \cdot \text { block size }}$

## generalizing: divide and conquer

```
partial_matrixmultiply(float *A, float *B, float *C
                int startI, int endI, ...) {
    for (int i = startI; i < endI; ++i) {
        for (int j = startJ; j < endJ; ++j) {
            for (int k = startK; k < endK; ++k) {
}
matrix_multiply(float *A, float *B, float *C, int N) {
    for (int ii = 0; ii < N; ii += BLOCK_I)
    for (int jj = 0; jj < N; jj += BLOCK_J)
    for (int kk = 0; kk < N; kk += BLOCK_K)
                            /* do everything for segment of A, B, C
                that fits in cache! */
                partial_matmul(A, B, C,
                        ii, ii + BLOCK_I, jj, jj + BLOCK_J,
                        kk, kk + BLOCK_K)
```


## array usage: matrix block $\mathrm{C}_{\mathrm{ij}}+=\mathrm{A}_{\mathrm{ik}} \cdot \mathrm{B}_{\mathrm{kj}}$


$C_{i j}$ block
$(I \times J)$
inner loops work on "matrix block" of A, B, C rather than rows of some, little blocks of others blocks fit into cache (b/c we choose $I, K, J$ ) where previous rows might not

## array usage: matrix block $\mathrm{C}_{\mathrm{ij}}+=\mathrm{A}_{\mathrm{ik}} \cdot \mathrm{B}_{\mathrm{kj}}$


$C_{i j}$ block
$(I \times J)$
now (versus loop ordering example) some spatial locality in $A, B$, and $C$ some temporal locality in $A, B$, and $C$

## array usage: matrix block $\mathrm{C}_{\mathrm{ij}}+=\mathrm{A}_{\mathrm{ik}} \cdot \mathrm{B}_{\mathrm{kj}}$


$C_{i j}$ block
$(I \times J)$
$C_{i j}$ calculation uses strips from $A, B$ $K$ calculations for one cache miss good temporal locality!

## array usage: matrix block $\mathrm{C}_{\mathrm{ij}}+=\mathrm{A}_{\mathrm{ik}} \cdot \mathrm{B}_{\mathrm{kj}}$


$A_{i k}$ used with entire strip of $B J$ calculations for one cache miss good temporal locality!

## array usage: matrix block $\mathrm{C}_{\mathrm{ij}}+=\mathrm{A}_{\mathrm{ik}} \cdot \mathrm{B}_{\mathrm{kj}}$


(approx.) $K I J$ fully cached calculations
for $K I+I J+K J$ values need to be lodaed per "matrix block" (assuming everything stays in cache)

## cache blocking efficiency

for each of $N^{3} / I J K$ matrix blocks:
load $I \times K$ elements of $A_{i k}$ :
$\approx I K \div$ block size misses per matrix block
$\approx N^{3} /(J \cdot$ blocksize $)$ misses total
load $K \times J$ elements of $B_{k j}$ :
$\approx N^{3} /(I \cdot$ blocksize $)$ misses total
load $I \times J$ elements of $C_{i j}$ :
$\approx N^{3} /(K \cdot$ blocksize $)$ misses total
bigger blocks - more work per load!
catch: $I K+K J+I J$ elements must fit in cache otherwise estimates above don't work

## cache blocking rule of thumb

fill the most of the cache with useful data
and do as much work as possible from that
example: my desktop 32 KB L1 cache
$I=J=K=48$ uses $48^{2} \times 3$ elements, or 27 KB .
assumption: conflict misses aren't important

## systematic approach

for (int $k=0 ; k<N ;++k)$ \{
for (int i = 0; i < N; ++i) \{
$A_{i k}$ loaded once in this loop:
for (int $\mathrm{j}=0$; j < N ; ++j)
$C_{i j}, B_{k j}$ loaded each iteration (if $N$ big):
$B[i * N+j]+=A[i * N+k]$ * $A[k * N+j] ;$
values from $A_{i k}$ used $N$ times per load
values from $B_{k j}$ used 1 times per load but good spatial locality, so cache block of $B_{k j}$ together
values from $C_{i j}$ used 1 times per load but good spatial locality, so cache block of $C_{i j}$ together

## exercise: miss estimating (3)

$$
\begin{aligned}
& \text { for (int kk = 0; kk < 1000; kk += 10) } \\
& \text { for (int jj = 0; jj < 1000; jj += 10) } \\
& \text { for (int i = 0; i < 1000; i += 1) } \\
& \text { for (int } \mathrm{j}=\mathrm{jj} ; \mathrm{j}<\mathrm{jj}+10 \text {; } \mathrm{j}+=1 \text { ) } \\
& \text { for (int } k=k k ; k<k k+10 ; k+=1) \\
& A[k * N+j]+=B[i * N+j] ;
\end{aligned}
$$

assuming: 4 elements per block
assuming: cache not close to big enough to hold 1 K elements, but big enough to hold 500 or so
estimate: approximately how many misses for $A, B$ ?
hint 1: part of $A, B$ loaded in two inner-most loops only needs to be loaded once

## loop ordering compromises

loop ordering forces compromises:
for $k$ : for $i$ : for $j: c[i, j]+=a[i, k] * b[j, k]$
perfect temporal locality in $a[i, k]$
bad temporal locality for $c[i, j], b[j, k]$
perfect spatial locality in $c[i, j]$
bad spatial locality in $b[j, k], a[i, k]$

## loop ordering compromises

loop ordering forces compromises:
for $k$ : for i : for $\mathrm{j}: c[i, j]+=a[i, k] * b[j, k]$
perfect temporal locality in $a[i, k]$
bad temporal locality for $c[i, j], b[j, k]$
perfect spatial locality in $c[i, j]$
bad spatial locality in $b[j, k], a[i, k]$
cache blocking: work on blocks rather than rows/columns have some temporal, spatial locality in everything

## cache blocking pattern

no perfect loop order? work on rectangular matrix blocks
size amount used in inner loops based on cache size
in practice:
test performance to determine 'size' of blocks

## backup slides

## cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size
LRU replacement policies
data cache miss rates:

| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| :--- | ---: | ---: | ---: | ---: |
| 1KB | $8.63 \%$ | $6.97 \%$ | $5.63 \%$ | $5.34 \%$ |
| 2KB | $5.71 \%$ | $4.23 \%$ | $3.30 \%$ | $3.05 \%$ |
| 4KB | $3.70 \%$ | $2.60 \%$ | $2.03 \%$ | $1.90 \%$ |
| 16 KB | $1.59 \%$ | $0.86 \%$ | $0.56 \%$ | $0.50 \%$ |
| 64 KB | $0.66 \%$ | $0.37 \%$ | $0.10 \%$ | $0.001 \%$ |
| 128 KB | $0.27 \%$ | $0.001 \%$ | $0.0006 \%$ | $0.0006 \%$ |

## cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size
LRU replacement policies
data cache miss rates:

| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| :--- | ---: | ---: | ---: | ---: |
| 1 KB | $8.63 \%$ | $6.97 \%$ | $5.63 \%$ | $5.34 \%$ |
| 2 KB | $5.71 \%$ | $4.23 \%$ | $3.30 \%$ | $3.05 \%$ |
| 4 KB | $3.70 \%$ | $2.60 \%$ | $2.03 \%$ | $1.90 \%$ |
| 16 KB | $1.59 \%$ | $0.86 \%$ | $0.56 \%$ | $0.50 \%$ |
| 64 KB | $0.66 \%$ | $0.37 \%$ | $0.10 \%$ | $0.001 \%$ |
| 128 KB | $0.27 \%$ | $0.001 \%$ | $0.0006 \%$ | $0.0006 \%$ |

## exercise (1)

initial cache: 64 -byte blocks, 64 sets, 8 ways/set

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)
A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
B. quadrupling the number of sets
C. quadrupling the number of ways/set

## exercise (2)

initial cache: 64 -byte blocks, 8 ways/set, 64 KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)
A. quadrupling the block size ( 256 -byte block, 8 ways/set, 64 KB cache)
B. quadrupling the number of ways/set
C. quadrupling the cache size

## exercise (3)

initial cache: 64 -byte blocks, 8 ways/set, 64 KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)
A. quadrupling the block size ( 256 -byte block, 8 ways/set, 64 KB cache)
B. quadrupling the number of ways/set
C. quadrupling the cache size

## prefetching

seems like we can't really improve cold misses...
have to have a miss to bring value into the cache?

## prefetching

seems like we can't really improve cold misses...
have to have a miss to bring value into the cache?
solution: don't require miss: 'prefetch' the value before it's accessed
remaining problem: how do we know what to fetch?

## common access patterns

suppose recently accessed 16B cache blocks are at: $0 \times 48010,0 \times 48020,0 \times 48030,0 \times 48040$
guess what's accessed next

## common access patterns

suppose recently accessed 16B cache blocks are at: $0 \times 48010,0 \times 48020,0 \times 48030,0 \times 48040$
guess what's accessed next
common pattern with instruction fetches and array accesses

## prefetching idea

look for sequential accesses
bring in guess at next-to-be-accessed value
if right: no cache miss (even if never accessed before)
if wrong: possibly evicted something else - could cause more misses
fortunately, sequential access guesses almost always right

## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (direct-mapped)


memory


## mapping of sets to memory (3-way)


memory


## mapping of sets to memory (3-way)


memory


## mapping of sets to memory (3-way)


memory


## mapping of sets to memory (3-way)



## C and cache misses (4)

```
typedef struct {
    int a_value, b_value;
    int other_values[6];
} item;
item items[5];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 5; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 5; ++i)
    b_sum += items[i].b_value;
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

## C and cache misses (4, rewrite)

```
int array[40]
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 40; i += 8)
    a_sum += array[i];
for (int i = 1; i < 40; i += 8)
    b_sum += array[i];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array starts at beginning of cache block.

How many data cache misses on a 2-way set associative 128B cache with 16B cache blocks and LRU replacement?

## C and cache misses (4, solution pt 1 )

 ints 4 byte $\rightarrow$ array [0 to 3] and array[16 to 19] in same cache set $64 \mathrm{~B}=16$ ints stored per way4 sets total
accessing $0,8,16,24,32,1,9,17,25,33$

## C and cache misses (4, solution pt 1 )

ints 4 byte $\rightarrow$ array $[0$ to 3 ] and array[ 16 to 19 ] in same cache set $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing $0,8,16,24,32,1,9,17,25,33$
$0($ set 0$), 8(\operatorname{set} 2), 16(\operatorname{set} 0), 24(\operatorname{set} 2), 32(\operatorname{set} 0)$
$1(\operatorname{set} 0), 9(\operatorname{set} 2), 17(\operatorname{set} 0), 25(\operatorname{set} 2), 33(\operatorname{set} 0)$

## C and cache misses (4, solution pt 2 )

| access | set 0 after (LRU first) | result |  |
| :--- | :--- | :--- | :--- |
| - | -, - |  |  |
| array[0] | -, array[0 to 3] | miss |  |
| array[16] | array[0 to 3], array[16 to 19] | miss | 6 misses for set 0 |
| array[32] | array[16 to 19], array[32 to 35] | miss |  |
| array[1] | array[32 to 35], array[0 to 3] | miss |  |
| array[17] | array[0 to 3], array[16 to 19] | miss |  |
| array[32] | array[16 to 19], array[32 to 35] | miss |  |

## $C$ and cache misses (4, solution pt 3 )

| access | set 2 after (LRU first) | result |  |
| :--- | :--- | :--- | :--- |
| - | -, |  |  |
| array[8] | -, array[8 to 11] | miss | 2 misses for set 1 |
| array[24] | array[8 to 11], array[24 to 27] | miss |  |
| array[9] | array[8 to 11], array[24 to 27] | hit |  |
| array[25] | array[16 to 19], array[32 to 35] | hit |  |

## C and cache misses (3)

```
typedef struct {
    int a_value, b_value;
    int other_values[10];
} item;
item items[5];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 5; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 5; ++i)
    b_sum += items[i].b_value;
```

observation: 12 ints in struct: only first two used
equivalent to accessing array[0], array[12], array[24], etc.
...then accessing array[1], array[13], array[25], etc.

## C and cache misses (3, rewritten?)

```
int array[60];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 60; i += 12)
    a_sum += array[i];
for (int i = 1; i < 60; i += 12)
    b_sum += array[i];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array at beginning of cache block.

How many data cache misses on a 128B two-way set associative cache with 16B cache blocks and LRU replacement? observation 1: first loop has 5 misses - first accesses to blocks observation 2: array[0] and array[1], array[12] and array[13], etc. in same cache block

## C and cache misses (3, solution)

ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing array indices $0,12,24,36,48,1,13,25,37,49$
so access to $1,21,41,61,81$ all hits:
set 0 contains block with array [0 to 3]
set 5 contains block with array[20 to 23]
etc.

## C and cache misses (3, solution)

ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing array indices $0,12,24,36,48,1,13,25,37,49$
so access to $1,21,41,61,81$ all hits:
set 0 contains block with array [0 to 3]
set 5 contains block with array[20 to 23]
etc.

## C and cache misses (3, solution)

ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set $64 \mathrm{~B}=16$ ints stored per way
4 sets total
accessing array indices $0,12,24,36,48,1,13,25,37,49$
0 (set 0, array[0 to 3]), 12 (set 3), 24 (set 2 ), 36 (set 1 ), 48 (set 0 )
each set used at most twice no replacement needed
so access to $1,21,41,61,81$ all hits:
set 0 contains block with array [0 to 3]
set 5 contains block with array[20 to 23] etc.

## C and cache misses (3)

```
typedef struct {
    int a_value, b_value;
    int boring_values[126];
} item;
item items[8]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
        b_sum += items[i].b_value;
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2 KB direct-mapped cache with 16B cache blocks?

## C and cache misses (3, rewritten?)

item array[1024]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i $=0$; i < 1024; i += 128)
a_sum += array[i];
for (int i = 1; i < 1024; i += 128) b_sum += array[i];

## C and cache misses (4)

```
typedef struct {
    int a_value, b_value;
    int boring_values[126];
} item;
item items[8]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b_sum += items[i].b_value;
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 4-way set associative 2 KB direct-mapped cache with 16B cache blocks?

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ set 1: address 16 to 31 , (16 to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ set 1: address 16 to 31 , (16 to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks -
set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ block at 0: array[0] through array[3]
set 1 : address 16 to 31 , $(16$ to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$ block at 16: array[4] through array[7]
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511]

## thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks -
set 0 : address 0 to $15,(0$ to 15$)+2 \mathrm{~KB},(0$ to 15$)+4 \mathrm{~KB}, \ldots$ block at 0: array[0] through array[3] block at $0+2 \mathrm{~KB}$ : array [512] through array [515]
set 1 : address 16 to 31 , $(16$ to 31$)+2 \mathrm{~KB},(16$ to 31$)+4 \mathrm{~KB}, \ldots$ block at 16: array[4] through array[7] block at $16+2 \mathrm{~KB}$ : array[516] through array[519]
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511] block at $2032+2 \mathrm{~KB}$ : array[1020] through array[1023]

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$
set 63: address 1008, $2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} . .$.

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$ block at 0: array[0] through array[3]
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$ address 16: array[4] through array[7]
set 63: address 1008, $2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} .$.
address 1008: array[252] through array[255]

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$
block at 0: array[0] through array[3]
block at $0+1 \mathrm{~KB}$ : array[256] through array[259] block at $0+2 \mathrm{~KB}$ : array[512] through array[515]
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$ address 16: array[4] through array[7]
set 63: address $1008,2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} . .$. address 1008: array[252] through array[255]

## thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses
set 0 : address $0,0+2 \mathrm{~KB}, 0+4 \mathrm{~KB}, \ldots$
block at 0: array[0] through array[3]
block at $0+1 \mathrm{~KB}$ : array $[256]$ through array[259] block at $0+2 \mathrm{~KB}$ : array[512] through array[515]
set 1: address $16,16+2 \mathrm{~KB}, 16+4 \mathrm{~KB}, \ldots$ address 16: array[4] through array[7]
set 63: address $1008,2032+2 \mathrm{~KB}, 2032+4 \mathrm{~KB} .$. address 1008: array[252] through array[255]

## array usage: $i j k$ order


$A_{x 0} \quad A_{x N}$
for all $i$ :
for all $j$ :
for all $k$ :
$C_{i j}+=A_{i k} \times B_{k j}$
looking only at two innermost loops together: good spatial locality in A poor spatial locality in $B$ good spatial locality in C

## array usage: kij order



## simple blocking - with 3 ?

```
for (int kk = 0; kk < N; kk += 3)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*NN+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
            C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
        }
```

$\frac{N}{3} \cdot N \mathrm{j}$-loop iterations, and (assuming $N$ large):
about 1 misses from $A$ per j-loop iteration
$N^{2} / 3$ total misses (before blocking: $N^{2}$ )
about $3 N \div$ block size misses from $B$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)
about $3 N \div$ block size misses from $C$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)

## simple blocking - with 3 ?

```
for (int kk = 0; kk < N; kk += 3)
    for (int i = 0; i < N; i += 1)
        for (int j = 0; j < N; ++j) {
            C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
            C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
            C[i*N+j] += A[i*N+kk+2] * B[(kk+2)*N+j];
        }
```

$\frac{N}{3} \cdot N \mathrm{j}$-loop iterations, and (assuming $N$ large):
about 1 misses from $A$ per j-loop iteration
$N^{2} / 3$ total misses (before blocking: $N^{2}$ )
about $3 N \div$ block size misses from $B$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)
about $3 N \div$ block size misses from $C$ per j-loop iteration $N^{3} \div$ block size total misses (same as before)

## more than 3 ?

can we just keep doing this increase from 3 to some large $X$ ? ... assumption: $X$ values from A would stay in cache $X$ too large - cache not big enough
assumption: $X$ blocks from B would help with spatial locality $X$ too large - evicted from cache before next iteration

## array usage (2 $k$ at a time)


$B_{k i}$ to $B_{k+1, i}$

for each kk: for each i:
for each j :
for $k=k k, k k+1$ :

$$
C_{i j}+=A_{i k} \cdot B_{k j}
$$

## array usage (2k at a time)


for each kk: for each i:
for each j :

$$
\begin{aligned}
& \text { for } \mathrm{k}=\mathrm{kk}, \mathrm{kk}+1 \text { : } \\
& \qquad C_{i j}+=A_{i k} \cdot B_{k j}
\end{aligned}
$$

within innermost loop good spatial locality in $A$ bad locality in $B$
good temporal locality in $C$

## array usage (2k at a time)


for each kk: for each i:
for each j :
for $k=k k, k k+1$ : $C_{i j}+=A_{i k} \cdot B_{k j}$
loop over $j$ : better spatial locality over $A$ than before; still good temporal locality for $A$

## array usage (2k at a time)


for each kk: for each i:
for each j :

$$
\begin{aligned}
& \text { for } \mathrm{k}=\mathrm{kk}, \mathrm{kk}+1 \text { : } \\
& \qquad C_{i j}+=A_{i k} \cdot B_{k j}
\end{aligned}
$$

loop over $j$ : spatial locality over $B$ is worse but probably not more misses cache needs to keep two cache blocks for next iter instead of one (probably has the space left over!)

## array usage (2k at a time)


for each kk: for each i:
for each j :
for $k=k k, k k+1$ : have more than 4 cache blocks? $C_{i j}+=A_{i k}$. increasing $k k$ increment would use more of them
right now: only really care about keeping 4 cache blocks in $j$ loop

## keeping values in cache

can't explicitly ensure values are kept in cache
...but reusing values effectively does this cache will try to keep recently used values
cache optimization ideas: choose what's in the cache for thinking about it: load values explicitly for implementing it: access only values we want loaded

## TLB and the MMU (1)



## TLB and the MMU (2)



## TLB and the MMU (2)



## TLB and the MMU (2)



## TLB and the MMU (2)

TLB miss: page table access happens


## TLB and the MMU (2)

TLB miss: TLB gets a copy of the page table entry
110101010011011111

data or instruction cache

## TLB and the MMU (2)



## changing page tables

what happens to TLB when page table base pointer is changed?
e.g. context switch
most entries in TLB refer to things from wrong process oops - read from the wrong process's stack?

## changing page tables

what happens to TLB when page table base pointer is changed?
e.g. context switch
most entries in TLB refer to things from wrong process oops - read from the wrong process's stack?
option 1: invalidate all TLB entries side effect on "change page table base register" instruction

## changing page tables

what happens to TLB when page table base pointer is changed?
e.g. context switch
most entries in TLB refer to things from wrong process oops - read from the wrong process's stack?
option 1: invalidate all TLB entries side effect on "change page table base register" instruction
option 2: TLB entries contain process ID
set by OS (special register)
checked by TLB in addition to TLB tag, valid bit

## editing page tables

what happens to TLB when OS changes a page table entry? most common choice: has to be handled in software

## editing page tables

what happens to TLB when OS changes a page table entry? most common choice: has to be handled in software
invalid to valid - nothing needed
TLB doesn't contain invalid entries
MMU will check memory again
valid to invalid - OS needs to tell processor to invalidate it special instruction (x86: invlpg)
valid to other valid - OS needs to tell processor to invalidate it

