## last time

### direct-mapped cache design

divide cache, memory into 'blocks' power of two number of rows ('sets') with one block each address into tag / [set] index / [block] offset always store whole blocks (addresses with offset 0 to offset MAX) always store what was just read

## anonymous feedback

# licenses.txt (for part 2 of pagetable)

want you to know open source code online is not just free for all

understand how authors intend things be used

and purported legal restrictions for use not a class on what restrictions are enforceable, etc.

look at [at least] three licenses, decide what matches your goals for  $\operatorname{code}$ 

 $2 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$ 

| address (hex) | result |
|---------------|--------|
| 00000000 (00) | )      |
| 00000001 (01  | )      |
| 01100011 (63) | )      |
| 01100001 (61) | )      |
| 01100010 (62) | )      |
| 00000000 (00) | )      |
| 01100100 (64  | )      |

| index | valid | tag | value |
|-------|-------|-----|-------|
| 00    | Θ     |     |       |
| 01    | Θ     |     |       |
| 10    | Θ     |     |       |
| 11    | Θ     |     |       |

 $2 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$ 

| address (hex) | result |
|---------------|--------|
| 00000000 (00) |        |
| 00000001 (01) |        |
| 01100011 (63) |        |
| 01100001 (61) |        |
| 01100010 (62) |        |
| 00000000 (00) |        |
| 01100100 (64) |        |

| index | valid | tag | value |
|-------|-------|-----|-------|
| 00    | Θ     |     |       |
| 01    | Θ     |     |       |
| 10    | Θ     |     |       |
| 11    | 0     |     |       |

 $B = 2 = 2^{b}$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits m = 8 bit addresses

2 byte blocks, 4 sets

|   | add | res | s ( | he | ex)      | result |
|---|-----|-----|-----|----|----------|--------|
|   | 000 | 000 | 00  | 0  | (00)     |        |
|   | 000 | 000 | 00  | 1  | (01)     |        |
|   | 011 | .00 | 01  | 1  | (63)     |        |
|   | 011 | .00 | 00  | 1  | (61)     |        |
|   | 011 | .00 | 01  | 0  | (62)     |        |
|   | 000 | 000 | 00  | 0  | (00)     |        |
|   | 011 | .00 | 10  | 0  | (64)     |        |
| t | ag  | ind | dex | 0  | offset   | _      |
|   | D   | 0   |     | ab | la de la | 1 l :  |

 $B = 2 = 2^{b}$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag | value |
|-------|-------|-----|-------|
| 00    | Θ     |     |       |
| 01    | 0     |     |       |
| 10    | 0     |     |       |
| 11    | Θ     |     |       |

m = 8 bit addresses

 $2 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$ 

| address | result     |        |           |
|---------|------------|--------|-----------|
| 000000  | 000        | (00)   | miss      |
| 00000   | 001        | (01)   |           |
| 01100   | <b>911</b> | (63)   |           |
| 01100   | 001        | (61)   |           |
| 01100   | 010        | (62)   |           |
| 00000   | 000        | (00)   |           |
| 011001  | L00        | (64)   |           |
| ag ind  | ex o       | ffset  |           |
| B = 2   | $= 2^{b}$  | byte b | lock size |

 $B = 2 = 2^{o}$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value                          |
|-------|-------|-------|--------------------------------|
| 00    | 1     | 00000 | <pre>mem[0x00] mem[0x01]</pre> |
| 01    | Θ     |       |                                |
| 10    | 0     |       |                                |
| 11    | 0     |       |                                |

m = 8 bit addresses

 $2 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$ 

|   | addres                        | s (he | ex)  | result |  |  |  |
|---|-------------------------------|-------|------|--------|--|--|--|
|   | 00000                         | 000   | (00) | miss   |  |  |  |
|   | 00000                         | 001   | (01) | hit    |  |  |  |
|   | 01100                         | 011   | (63) |        |  |  |  |
|   | 01100                         | 001   | (61) |        |  |  |  |
|   | 01100                         | 010   | (62) |        |  |  |  |
|   | 00000                         | 000   | (00) |        |  |  |  |
|   | 01100                         | 100   | (64) |        |  |  |  |
| t | tag index offset              |       |      |        |  |  |  |
|   | $B = 2 = 2^b$ byte block size |       |      |        |  |  |  |

 $B = 2 = 2^{o}$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value                  |
|-------|-------|-------|------------------------|
| 00    | 1     | 00000 | mem[0x00]<br>mem[0x01] |
| 01    | Θ     |       |                        |
| 10    | Θ     |       |                        |
| 11    | 0     |       |                        |

m = 8 bit addresses

 $2 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$ 

|   | addres | s (he   | ex)      | result   | in |
|---|--------|---------|----------|----------|----|
|   | 00000  | 0000    | (00)     | miss     | 0  |
|   | 00000  | 0001    | (01)     | hit      | 0  |
|   | 01100  | 011     | (63)     | miss     | 0  |
|   | 01100  | 0001    | (61)     |          | 0  |
|   | 01100  | 010     | (62)     |          | 1  |
|   | 00000  | 0000    | (00)     |          | Т  |
|   | 01100  | 100     | (64)     |          | 1  |
| t | ag in  | dex c   | offset   | <u>-</u> | Т  |
|   | ם ס    | $a^{b}$ | ما مدر ا |          |    |

 $B = 2 = 2^{b}$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value     |
|-------|-------|-------|-----------|
| 00    | 1     | 00000 | mem[0x00] |
| 00    |       |       | mem[0x01] |
| 01    | 1     | 01100 | mem[0x62] |
| 01    |       | 01100 | mem[0x63] |
| 10    | 0     |       |           |
| -     |       |       |           |
| 11    | 0     |       |           |
|       |       |       |           |

m = 8 bit addresses

t = m - (s + b) = 5 tag bits

 $2 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$ 

| address (he | ex)    | result |
|-------------|--------|--------|
| 000000000   | (00)   | miss   |
| 00000001    | (01)   | hit    |
| 01100011    | (63)   | miss   |
| 01100001    | (61)   | miss   |
| 01100010    | (62)   |        |
| 0000000000  | (00)   |        |
| 01100100    | (64)   |        |
| tag index o | offset |        |
| -           |        |        |

 $B = 2 = 2^{b}$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value                |
|-------|-------|-------|----------------------|
| 00    | 1     | 01100 | mem[0x60]            |
| 00    | -     | 01100 | <pre>mem[0x61]</pre> |
| 01    | 1     | 01100 | mem[0x62]            |
| 01    | -     | 01100 | mem[0x63]            |
| 10    | 0     |       |                      |
|       |       |       |                      |
| 11    | 0     |       |                      |
|       |       |       |                      |

m = 8 bit addresses

2 byte blocks, 4 sets

| addres | s (he | ex)    | result |
|--------|-------|--------|--------|
| 00000  | 000   | (00)   | miss   |
| 00000  | 001   | (01)   | hit    |
| 01100  | 011   | (63)   | miss   |
| 01100  | 001   | (61)   | miss   |
| 01100  | 010   | (62)   | hit    |
| 00000  | 000   | (00)   |        |
| 01100  | 100   | (64)   |        |
| ag ind | dex o | offset | -      |

 $B = 2 = 2^b$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value     |
|-------|-------|-------|-----------|
| 00    | 1     | 01100 | mem[0x60] |
| 00    | -     | 01100 | mem[0x61] |
| 01    | 1     | 01100 | mem[0x62] |
| 01    | -     | 01100 | mem[0x63] |
| 10    | 0     |       |           |
|       |       |       |           |
| 11    | 0     |       |           |
|       |       |       |           |

m=8 bit addresses

2 byte blocks, 4 sets

| addres | s (he | x)    | result |
|--------|-------|-------|--------|
| 00000  | 000   | (00)  | miss   |
| 00000  | 001   | (01)  | hit    |
| 01100  | 011   | (63)  | miss   |
| 01100  | 001   | (61)  | miss   |
| 01100  | 010   | (62)  | hit    |
| 00000  | 000   | (00)  | miss   |
| 01100  | 100   | (64)  |        |
| ag ind | dex o | ffset | -      |

 $B = 2 = 2^b$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value     |
|-------|-------|-------|-----------|
| 00    | 1     | 00000 | mem[0x00] |
| 00    | -     | mem[( |           |
| 01    | 1     | 01100 | mem[0x62] |
| 01    | -     | 01100 | mem[0x63] |
| 10    | 0     |       |           |
| 20    |       |       |           |
| 11    | 0     |       |           |
|       |       |       |           |

m = 8 bit addresses

2 byte blocks, 4 sets

| addres          | s (he | ex)  | result |
|-----------------|-------|------|--------|
| 00000           | 000   | (00) | miss   |
| 00000           | 001   | (01) | hit    |
| 01100           | 011   | (63) | miss   |
| 01100           | 001   | (61) | miss   |
| 01100           | 010   | (62) | hit    |
| 00000           | 000   | (00) | miss   |
| 01100           | 100   | (64) | miss   |
| ag index offset |       |      |        |

 $B = 2 = 2^b$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value     |
|-------|-------|-------|-----------|
| 00    | 1     | 00000 | mem[0x00] |
| 00    | -     | 00000 | mem[0x01] |
| 01    | 1     | 01100 | mem[0x62] |
| 01    | -     | 01100 | mem[0x63] |
| 10    | 1     | 01100 | mem[0x64] |
| 10    | -     | 01100 | mem[0x65] |
| 11    | Θ     |       |           |

m = 8 bit addresses

2 byte blocks, 4 sets

|                                      | result  |
|--------------------------------------|---------|
| 000000000 (00                        | ) miss  |
| 00000001 (01                         | L) hit  |
| 01100011 (63                         | 3) miss |
| 01100001 (61                         | L) miss |
| 01100010 (62                         | 2) hit  |
| 000000000000000000000000000000000000 | ) miss  |
| 01100100 (64                         | 1) miss |

 $B = 2 = 2^b$  byte block size b = 1 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag   | value     |
|-------|-------|-------|-----------|
| 00    | 1     | 00000 | mem[0x00] |
| 00    | -     | 00000 | mem[0x01] |
| 01    | 1     | 01100 | mem[0x62] |
| 01    | -     | 01100 | mem[0x63] |
| 10    | 1     | 01100 | mem[0x64] |
| 10    | -     | 01100 | mem[0x65] |
| 11    | Θ     |       |           |

m = 8 bit addresses

2 byte blocks, 4 sets

| address (hex)                           | result    | index | valid    | tag      | value                  |
|-----------------------------------------|-----------|-------|----------|----------|------------------------|
| 000000000000000000000000000000000000000 | miss      | 00    | 1        | 00000    | mem[0x00]              |
| 00000001 (01)                           | hit       | 00    | <b>-</b> | 00000    | mem[0x01]              |
| 01100011 (63)                           | miss      | 01    | 1        | 01100    | mem[0x62]              |
| 01100001 (61)                           | miss      | 01    | <b>-</b> | 01100    | mem[0x63]              |
| 01100 <mark>01</mark> 0 (62)            | hit       | 10    | 1        |          | mem[0x64]              |
| 000000000 (00)                          | miss      | mi 🔨  | ss cai   | used by  | conflict <sup>65</sup> |
|                                         | miss      | 11    | 0        | ,        |                        |
| ag index offset                         |           |       |          |          |                        |
| $B = 2 = 2^b$ byte b                    | lock size | m =   | 8 bit a  | ddresses |                        |

b = 1 (block) offset bits

 $S = 4 = 2^{s}$  sets s = 2 (set) index bits

#### $4~{\rm byte}~{\rm blocks},~4~{\rm sets}$

| address (hex) | result |
|---------------|--------|
| 00000000 (00) |        |
| 00000001 (01) |        |
| 01100011 (63) |        |
| 01100001 (61) |        |
| 01100010 (62) |        |
| 00000000 (00) |        |
| 01100100 (64) |        |

| index | valid | tag | value |
|-------|-------|-----|-------|
| 00    |       |     |       |
| 01    |       |     |       |
| 10    |       |     |       |
| 11    |       |     |       |

#### $4~{\rm byte}~{\rm blocks},~4~{\rm sets}$

| address (hex)                                                             | result | index | valid | tag | value                                                                                        |
|---------------------------------------------------------------------------|--------|-------|-------|-----|----------------------------------------------------------------------------------------------|
| 00000000 (00)                                                             |        | 00    |       |     |                                                                                              |
| 00000001 (01)                                                             |        | 00    | !     |     |                                                                                              |
| 01100011 (63)                                                             |        | 01    |       |     |                                                                                              |
| 01100001 (61)                                                             |        | 01    | '     |     |                                                                                              |
| 01100010 (62)                                                             |        | 10    |       |     |                                                                                              |
| 00000000 (00)                                                             |        | TO    | !     |     |                                                                                              |
| 01100100 (64)                                                             |        | ] 11  |       |     |                                                                                              |
|                                                                           | _      | **    |       |     |                                                                                              |
| how is the 8-bit address 61 (01100001) split<br>up into tag/index/offset? |        |       |       |     | b block offset bits;<br>$B = 2^{b}$ byte block size;<br>s set index bits; $S = 2^{s}$ sets ; |
|                                                                           |        |       |       |     | t = m - (s + b) tag bits (left                                                               |

#### 4 byte blocks, 4 sets

| address (he | result |  |
|-------------|--------|--|
| 000000000   | (00)   |  |
| 00000001    | (01)   |  |
| 01100011    | (63)   |  |
| 01100001    | (61)   |  |
| 01100010    | (62)   |  |
| 000000000   | (00)   |  |
| 01100100    | (64)   |  |

| index | valid | tag | value |
|-------|-------|-----|-------|
| 00    |       |     |       |
| 01    |       |     |       |
| 10    |       |     |       |
| 11    |       |     |       |

 $B = 4 = 2^b$  byte block size b=2 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

m = 8 bit addresses

#### 4 byte blocks, 4 sets

|        | address (hex) |       |  |  |  |  |  |
|--------|---------------|-------|--|--|--|--|--|
| 0000   | 0000          | (00)  |  |  |  |  |  |
| 0000   | 0001          | (01)  |  |  |  |  |  |
| 0110   | 0011          | (63)  |  |  |  |  |  |
| 0110   | 0001          | (61)  |  |  |  |  |  |
| 0110   | 0010          | (62)  |  |  |  |  |  |
| 0000   | 0000          | (00)  |  |  |  |  |  |
| 0110   | 0100          | (64)  |  |  |  |  |  |
| ag ind | dex of        | ffset |  |  |  |  |  |

 $B = 4 = 2^b$  byte block size b = 2 (block) offset bits  $S = 4 = 2^{s}$  sets s = 2 (set) index bits

| index | valid | tag | value |
|-------|-------|-----|-------|
| 00    |       |     |       |
| 01    |       |     |       |
| 10    |       |     |       |
| 11    |       |     |       |

m=8 bit addresses

#### $4 \ {\rm byte} \ {\rm blocks}, \ 4 \ {\rm sets}$

| address (hex)               | result  | index      | valid | tag | value |
|-----------------------------|---------|------------|-------|-----|-------|
| 00000000 (00)               |         | 00         |       |     |       |
| 00000001 (01)               |         | 00         |       |     |       |
| 01100011 (63)               |         | 01         |       |     |       |
| 01100001 (61)               |         | 01         |       |     |       |
| 01100010 (62)               |         | 10         |       |     |       |
| 000000000 (00)              |         | IO         |       |     |       |
| 011001 <mark>00</mark> (64) |         | 11         |       |     |       |
| tag index offset            |         | <b>T</b> T |       |     |       |
| exercise: which             | accesse | es are hit | s?    |     |       |

# mapping of sets to memory (direct-mapped)





mapping of sets to memory (direct-mapped)



memory

values which would be stored in same set (cache size) bytes apart





## simulated misses: BST lookups



(simulated 16KB direct-mapped data cache; excluding BST setup)

## actual misses: BST lookups



(actual 32KB more complex data cache) (only one set of measurements + other things on machine + excluding initial load)

## simulated misses: matrix multiplies



(simulated 16KB direct-mapped data cache; excluding initial load)

## actual misses: matrix multiplies



(actual 32KB more complex data cache; excluding matrix initial load) (only one set of measurements + other things on machine)

11

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     |       | Θ     |     |       |
| 1     | Θ     |     |       | Θ     |     |       |

multiple places to put values with same index avoid misses from two active values using same set ("conflict misses")

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| Θ     | 0     |     | set 0 | Θ     |     |       |
| 1     | Θ     |     | set 1 | Θ     |     |       |

-way set associative, 2 byte blocks, 2 sets



2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| Θ     | 0     |     |       | Θ     |     |       |
| 1     | Θ     |     |       | 0     |     |       |

m = 8 bit addresses  $S = 2 = 2^s$  sets s = 1 (set) index bits

$$B = 2 = 2^{b}$$
 byte block size  
 $b = 1$  (block) offset bits  
 $t = m - (s + b) = 6$  tag bits

-way set associative, 2 byte blocks, 2 sets

| index |   | <u> </u> | value                          | valid | tag | value |
|-------|---|----------|--------------------------------|-------|-----|-------|
| 0     | 1 | 000000   | <pre>mem[0x00] mem[0x01]</pre> | Θ     |     |       |
| 1     | Θ |          |                                | Θ     |     |       |

| address (hex)               | result |
|-----------------------------|--------|
| 00000000(00)                | miss   |
| 00000001(01)                |        |
| 01100011 (63)               |        |
| 01100001 (61)               |        |
| 0110001 <mark>0</mark> (62) |        |
| 00000000(00)                |        |
| 01100100 (64)               |        |

2-way set associative, 2 byte blocks, 2 sets

| index |   | <b>U</b> | value                          | valid | tag | value |
|-------|---|----------|--------------------------------|-------|-----|-------|
| 0     | 1 | 000000   | <pre>mem[0x00] mem[0x01]</pre> | Θ     |     |       |
| 1     | Θ |          |                                | Θ     |     |       |

| address (hex)               | result |
|-----------------------------|--------|
| 00000000 (00)               | miss   |
| 00000001(01)                | hit    |
| 01100011 (63)               |        |
| 01100001 (61)               |        |
| 0110001 <mark>0</mark> (62) |        |
| 00000000 (00)               |        |
| 01100100 (64)               |        |

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag    | value                          | valid | tag | value |
|-------|-------|--------|--------------------------------|-------|-----|-------|
| 0 1   | 1     | 000000 | mem[0x00]                      | 0     |     |       |
|       |       |        | mem[0x01]                      |       |     |       |
| 1 1   | 1     | 011000 | <pre>mem[0x62] mem[0x63]</pre> | Θ     |     |       |
|       | 1     | 011000 | mem[0x63]                      |       |     |       |

| address | result  |      |
|---------|---------|------|
| 000000  | 00(00)  | miss |
| 000000  | 01 (01) | hit  |
| 011000  | 11 (63) | miss |
| 011000  | 01 (61) |      |
| 011000  | 10 (62) |      |
| 000000  | 00(00)  |      |
| 011001  | 00 (64) |      |

-way set associative, 2 byte blocks, 2 sets

| index    |   |          | value                          | valid | tag    | value                |
|----------|---|----------|--------------------------------|-------|--------|----------------------|
| Θ        | 1 | ممممم    | <pre>mem[0x00] mem[0x01]</pre> | 1     | 011000 | mem[0x60]            |
| 0        | - | 000000   | mem[0x01]                      | T     |        | <pre>mem[0x61]</pre> |
| 1 1      |   | 1 011000 | <pre>mem[0x62] mem[0x63]</pre> | 0     |        |                      |
| <b>–</b> | T | 011000   | mem[0x63]                      |       |        |                      |

| address (he | ex)  | result |
|-------------|------|--------|
| 000000000   | (00) | miss   |
| 00000001    | (01) | hit    |
| 01100011    | (63) | miss   |
| 01100001    | (61) | miss   |
| 01100010    | (62) |        |
| 000000000   | (00) |        |
| 01100100    | (64) |        |

2-way set associative, 2 byte blocks, 2 sets

| index    | valid      | tag    | value                          | valid    | tag    | value     |
|----------|------------|--------|--------------------------------|----------|--------|-----------|
| Θ        | 1          | ممممم  | <pre>mem[0x00] mem[0x01]</pre> | 1        | 011000 | mem[0x60] |
|          | _ <b>_</b> | 000000 | mem[0x01]                      | <b>-</b> |        | mem[0x61] |
| 1 1      |            | 011000 | <pre>mem[0x62] mem[0x63]</pre> | 0        |        |           |
| <b>–</b> | T          | 011000 | mem[0x63]                      | 0        |        |           |

| address (he            | ex)  | result |
|------------------------|------|--------|
| 0000000000             | (00) | miss   |
| 00000001               | (01) | hit    |
| 01100011               | (63) | miss   |
| 01100001               | (61) | miss   |
| 0110001 <mark>0</mark> | (62) | hit    |
| 0000000000             | (00) |        |
| 01100100               | (64) |        |

-way set associative, 2 byte blocks, 2 sets

| index | valid | tag    | value                          | valid | tag    | value     |
|-------|-------|--------|--------------------------------|-------|--------|-----------|
| Θ     | 1     | ممممم  | <pre>mem[0x00] mem[0x01]</pre> | 1     | 011000 | mem[0x60] |
| 0     | -     | 000000 | mem[0x01]                      |       |        | mem[0x61] |
| 1 1   |       | 011000 | <pre>mem[0x62] mem[0x63]</pre> | 0     |        |           |
| 1     | T     | 011000 | mem[0x63]                      |       |        |           |

| address (he                             | ex)  | result |
|-----------------------------------------|------|--------|
| 000000000000000000000000000000000000000 | (00) | miss   |
| 00000001                                | (01) | hit    |
| 01100011                                | (63) | miss   |
| 01100001                                | (61) | miss   |
| 0110001 <mark>0</mark>                  | (62) | hit    |
| 000000000000000000000000000000000000000 | (00) | hit    |
| 01100100                                | (64) |        |

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag    | value                          | valid | tag    | value                          |
|-------|-------|--------|--------------------------------|-------|--------|--------------------------------|
| Θ     | 1     | 000000 | <pre>mem[0x00] mem[0x01]</pre> | 1     | 011000 | <pre>mem[0x60] mem[0x61]</pre> |
| °     |       |        |                                |       |        | mem[0x61]                      |
| 1     | 1     | 011000 | mem[0x62]                      | 0     |        |                                |
|       |       |        | mem[0x63]                      |       |        |                                |



-way set associative, 2 byte blocks, 2 sets

| index    | valid      | tag    | value                          | valid    | tag    | value     |
|----------|------------|--------|--------------------------------|----------|--------|-----------|
| Θ        | 1          | ممممم  | <pre>mem[0x00] mem[0x01]</pre> | 1        | 011000 | mem[0x60] |
|          | _ <b>_</b> | 000000 | mem[0x01]                      | <b>-</b> |        | mem[0x61] |
| 1 1      |            | 011000 | <pre>mem[0x62] mem[0x63]</pre> | 0        |        |           |
| <b>–</b> | T          | 011000 | mem[0x63]                      | 0        |        |           |

| address (h | ex)  | result |
|------------|------|--------|
| 0000000000 | (00) | miss   |
| 00000001   | (01) | hit    |
| 01100011   | (63) | miss   |
| 01100001   | (61) | miss   |
| 01100010   | (62) | hit    |
| 0000000000 | (00) | hit    |
| 01100100   | (64) | miss   |

#### associative lookup possibilities

none of the blocks for the index are valid

none of the valid blocks for the index match the tag something else is stored there

one of the blocks for the index is valid and matches the tag

## cache operation (associative)

111001valid tag data valid tag data  $00 \ 11$ 00 AA BB 1 10 1 index tag 33 44 B4 B5 01

#### replacement policies

-way set associative, 2 byte blocks, 2 sets

| index                                                          | valid | tag       | valı           | le           | valid | tag    | value                  |
|----------------------------------------------------------------|-------|-----------|----------------|--------------|-------|--------|------------------------|
| Θ                                                              | 1     | 000000    | mem[0<br>mem[0 |              | 1     | 011000 | mem[0x60]<br>mem[0x61] |
| 1                                                              | 1     | 011000    | mem[0<br>mem[0 | x62]<br>x63] | Θ     |        |                        |
| address (hex) result $000$ how to decide where to insert 0x64? |       |           |                |              |       |        |                        |
| 000 <del>0</del> 0                                             |       | ( - / - / | пι             | -            |       |        |                        |
| 01100                                                          | 9011  | (63) I    | niss           |              |       |        |                        |
| 01100                                                          | 9001  | (61)      | niss           |              |       |        |                        |
| 01100                                                          | 9010  | (62)      | nit            | ]            |       |        |                        |
| 00000                                                          | 0000  | (00)      | nit            | 1            |       |        |                        |
| 01100                                                          | 9100  | (64)      | niss           | ]            |       |        |                        |

#### replacement policies

-way set associative, 2 byte blocks, 2 sets

| index | valid  | tag    | valı             | le   | valid  | tag     | value                  | LRU     |        |
|-------|--------|--------|------------------|------|--------|---------|------------------------|---------|--------|
| 0     | 1      | 000000 | mem[0:<br>mem[0: |      | 1      |         | mem[0x60]<br>mem[0x61] | 1       |        |
| 1     | 1      | 011000 | mem[0:<br>mem[0: |      | Θ      |         |                        | 1       |        |
| addre | ss (he | x)     | result           |      |        |         |                        |         |        |
| 00000 | 0000   |        | mi trac          | ·k w | hich h | lock w  | as read lea            | ast red | cently |
| 00000 | 0001   |        | hit              |      |        | very ac |                        |         | centry |
| 01100 | 0011   | (63)   | mi upo           | alec | i on e | very ac | Less                   |         |        |
| 01100 | 0001   | (61)   | miss             |      |        |         |                        |         |        |
| 01100 | 010    | (62)   | hit              |      |        |         |                        |         |        |
| 00000 | 0000   | (00)   | hit              |      |        |         |                        |         |        |
| 01100 | 9100   | (64)   | miss             |      |        |         |                        |         |        |

## example replacement policies

least recently used

take advantage of temporal locality at least  $\lceil \log_2(E!) \rceil$  bits per set for *E*-way cache (need to store order of all blocks)

#### approximations of least recently used implementing least recently used is expensive really just need "avoid recently used" — much faster/simpler good approximations: E to 2E bits

first-in, first-out

counter per set — where to replace next

(pseudo-)random no extra information! actually works pretty well in practice

## associativity terminology

direct-mapped — one block per set

E-way set associative — E blocks per set E ways in the cache

fully associative — one set total (everything in one set)

# **Tag-Index-Offset formulas**

| m                         | memory addreses bits              |
|---------------------------|-----------------------------------|
| E                         | number of blocks per set ("ways") |
| $S = 2^s$                 | number of sets                    |
| 8                         | (set) index bits                  |
| $B = 2^b$                 | block size                        |
| b                         | (block) offset bits               |
| t = m - (s + b)           | tag bits                          |
| $C = B \times S \times E$ | cache size (excluding metadata)   |

# cache accesses and C code (1)

int scaleFactor;

```
int scaleByFactor(int value) {
    return value * scaleFactor;
}
```

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

exericse: what data cache accesses does this function do?

# cache accesses and C code (1)

int scaleFactor;

```
int scaleByFactor(int value) {
    return value * scaleFactor;
}
```

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

exericse: what data cache accesses does this function do?

4-byte read of scaleFactor8-byte read of return address

#### possible scaleFactor use

```
for (int i = 0; i < size; ++i) {
    array[i] = scaleByFactor(array[i]);
}</pre>
```

# misses and code (2)

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

tag index offset

suppose each time this is called in the loop: return address located at address 0x7ffffffe43b8 scaleFactor located at address 0x6bc3a0

with direct-mapped 32KB cache w/64 B blocks, what is their: return address scaleFactor

# misses and code (2)

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

suppose each time this is called in the loop: return address located at address 0x7ffffffe43b8 scaleFactor located at address 0x6bc3a0

with direct-mapped 32KB cache w/64 B blocks, what is their: return address scaleFactor tag 0xfffffffc 0xd7 index 0x10e 0x10e offset 0x38 0x20

# misses and code (2)

```
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

suppose each time this is called in the loop: return address located at address 0x7ffffffe43b8 scaleFactor located at address 0x6bc3a0

with direct-mapped 32KB cache w/64 B blocks, what is their: return address scaleFactor tag 0xfffffffc 0xd7 index 0x10e 0x10e offset 0x38 0x20

#### conflict miss coincidences?

obviously I set that up to have the same index have to use exactly the right amount of stack space...

but one of the reasons we'll want something better than direct-mapped cache

# C and cache misses (warmup 1)

```
int array[4];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

#### some possiblities



Q1: how do cache blocks correspond to array elements? not enough information provided!

#### some possiblities

#### one cache block



if array[0] starts at beginning of a cache block... array split across two cache blocks

| memory access        | cache contents afterwards       |
|----------------------|---------------------------------|
|                      | (empty)                         |
| read array[0] (miss) | {array[0], array[1]}            |
| read array[1] (hit)  | {array[0], array[1]}            |
| read array[2] (miss) | {array[2], array[3]}            |
| read array[3] (hit)  | <pre>{array[2], array[3]}</pre> |

#### some possiblities

one cache block



if array[0] starts right in the middle of a cache block array split across three cache blocks

| memory access        | cache contents afterwards       |  |  |  |  |  |  |
|----------------------|---------------------------------|--|--|--|--|--|--|
|                      | (empty)                         |  |  |  |  |  |  |
| read array[0] (miss) | {****, array[0]}                |  |  |  |  |  |  |
| read array[1] (miss) | <pre>{array[1], array[2]}</pre> |  |  |  |  |  |  |
| read array[2] (hit)  | {array[1], array[2]}            |  |  |  |  |  |  |
| read array[3] (miss) | {array[3], ++++}                |  |  |  |  |  |  |

#### some possiblities one cache block

| **** array[0] array[1] array[2] array[3] ++ | ·+++ |  |
|---------------------------------------------|------|--|
|---------------------------------------------|------|--|

if array[0] starts at an odd place in a cache block, need to read two cache blocks to get most array elements

| memory access                 | cache contents afterwards                            |
|-------------------------------|------------------------------------------------------|
| —                             | (empty)                                              |
| read array[0] byte 0 (miss)   | { ****, array[0] byte 0 }                            |
| read array[0] byte 1-3 (miss) | $\{ array[0] byte 1-3, array[2], array[3] byte 0 \}$ |
| read array[1] (hit)           | $\{ array[0] byte 1-3, array[2], array[3] byte 0 \}$ |
| read array[2] byte 0 (hit)    | $\{ array[0] byte 1-3, array[2], array[3] byte 0 \}$ |
| read array[2] byte 1-3 (miss) | {part of array[2], array[3], $++++$ }                |
| read array[3] (hit)           | {part of array[2], array[3], ++++}                   |

## aside: alignment

compilers and malloc/new implementations usually try align values

align = make address be multiple of something

most important reason: don't cross cache block boundaries

# C and cache misses (warmup 2)

```
int array[4];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
odd_sum += array[1];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

Assume array[0] at beginning of cache block.

How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

#### exercise solution

#### one cache block

|  |  |  |  |  |  |  |  |  |  |  | i | array[0] | array[1] | array[2] | array[3] |  |  |  |  |  |  |  | Τ |  |  |
|--|--|--|--|--|--|--|--|--|--|--|---|----------|----------|----------|----------|--|--|--|--|--|--|--|---|--|--|
|--|--|--|--|--|--|--|--|--|--|--|---|----------|----------|----------|----------|--|--|--|--|--|--|--|---|--|--|

| memory access        | cache contents afterwards       |
|----------------------|---------------------------------|
|                      | (empty)                         |
| read array[0] (miss) | {array[0], array[1]}            |
| read array[2] (miss) | {array[2], array[3]}            |
| read array[1] (miss) | {array[0], array[1]}            |
| read array[3] (miss) | <pre>{array[2], array[3]}</pre> |

## backup slides

# arrays and cache misses (1)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
    even_sum += array[i + 0];
    odd_sum += array[i + 1];
}</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many *data cache misses* on initially empty 2KB direct-mapped cache with 16B cache blocks?

# arrays and cache misses (2)

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many *data cache misses* on initially empty 2KB direct-mapped cache with 16B cache blocks?

# arrays and cache misses (2b)

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many *data cache misses* on initially empty 4KB direct-mapped cache with 16B cache blocks?

## inclusive versus exclusive

L2 inclusive of L1 everything in L1 cache duplicated in L2 adding to L1 also adds to L2



#### L2 exclusive of L1 $\,$

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache



## inclusive versus exclusive

L2 cache

L2 inclusive of L1 everything in L1 cache duplicated in L2 adding to L1 also adds to L2

# L1 cache

#### L2 exclusive of L1

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache

inclusive policy: no extra work on eviction but duplicated data

easier to explain when Lk shared by multiple L(k-1) caches?

## inclusive versus exclusive

L2 inclusive of L1 everything in L1 cache duplicated in L2 adding to L1 also adds to L2

#### L2 cache

exclusive policy: avoid duplicated data sometimes called *victim cache* (contains cache eviction victims)

makes less sense with multicore

#### L2 exclusive of L1

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache



# Tag-Index-Offset formulas (direct-mapped)

#### (formulas derivable from prior slides)

| $S = 2^s$    | number of sets |
|--------------|----------------|
| $D = \Delta$ | number of sets |

- s (set) index bits
- $B = 2^b$  block size
- *b* (block) offset bits
- m memory addreses bits

#### $t=m-(s+b) \quad {\rm tag\ bits}$

 $C = B \times S$  cache size (if direct-mapped)

# Tag-Index-Offset formulas (direct-mapped)

#### (formulas derivable from prior slides)

| $S = 2^s$    | number of sets |
|--------------|----------------|
| $D = \Delta$ | number of sets |

- s (set) index bits
- $B = 2^b$  block size
- *b* (block) offset bits
- m memory addreses bits

#### $t=m-(s+b) \quad {\rm tag\ bits}$

 $C = B \times S$  cache size (if direct-mapped)

#### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

data cache miss rates: Cache size direct-mapped fully assoc. 2-wav 8-wav 1KB 8 63% 6 97% 5.63% 5 34% 5.71% 4.23% 3.30% 3.05% 2KB4KR 3.70% 2.60% 2.03% 1.90% 16KB 1.59%0.86% 0.56% 0.50% 0.66% 0.37% 0.10% 0.001% 64KB 128KB 0.27% 0.001% 0.0006% 0.0006%

Data: Cantin and Hill, "Cache Performance for SPEC CPU2000 Benchmarks" http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/

#### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

data cache miss rates: Cache size direct-mapped fully assoc. 2-wav 8-wav 1KB 8 63% 6 97% 5.63% 5 34% 5.71% 4.23% 3.30% 3.05% 2KB4KR 3.70% 2.60% 2.03% 1.90% 16KB 1.59%0.86% 0.56% 0.50% 0.66% 0.37% 0.10% 0.001% 64KB 128KB 0.27% 0.001% 0.0006% 0.0006%

> Data: Cantin and Hill, "Cache Performance for SPEC CPU2000 Benchmarks" http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/

# exercise (1)

initial cache: 64-byte blocks, 64 sets, 8 ways/set

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
- B. quadrupling the number of sets
- C. quadrupling the number of ways/set

# exercise (2)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

# exercise (3)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

#### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

#### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

solution: don't require miss: 'prefetch' the value before it's accessed

remaining problem: how do we know what to fetch?

#### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

#### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

common pattern with instruction fetches and array accesses

#### prefetching idea

look for sequential accesses

bring in guess at next-to-be-accessed value

if right: no cache miss (even if never accessed before)

if wrong: possibly evicted something else — could cause more misses

fortunately, sequential access guesses almost always right

#### quiz exercise solution

...

one cache block one cache block one cache block one cache block (set index 1) (set index 0) (set index 1) (set index 0) array[0]array[1]array[2]array[3]array[4]array[5]array[6]array[7]array

| memory access        | set 0 afterwards                                                 | set 1 afterwards                                        |
|----------------------|------------------------------------------------------------------|---------------------------------------------------------|
| —                    | (empty)                                                          | (empty)                                                 |
| read array[0] (miss) | {array[0], array[1]}                                             | (empty)                                                 |
| read array[3] (miss) | {array[0], array[1]}                                             | {array[2], array[3]}                                    |
| read array[6] (miss) | {array[0], array[1]}                                             | {array[6], array[7]}                                    |
| read array[1] (hit)  | {array[0], array[1]}                                             | {array[6], array[7]}                                    |
| read array[4] (miss) | {array[4], array[5]}                                             | {array[6], array[7]}                                    |
| read array[7] (hit)  | {array[4], array[5]}                                             | {array[6], array[7]}                                    |
| read array[2] (miss) | {array[4], array[5]}                                             | {array[2], array[3]}                                    |
| used array([E] (hit) | $\left[ 2rr2v \left[ A \right] - 2rr2v \left[ 5 \right] \right]$ | $\left[2rray\left[6\right], 2rray\left[7\right]\right]$ |

•••

#### quiz exercise solution

...

one cache block one cache block one cache block one cache block (set index 1) (set index 0) (set index 1) (set index 0) array[0]array[1]array[2]array[3]array[4]array[5]array[6]array[7]array

| memory access        | set 0 afterwards                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | set 1 afterwards     |
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| —                    | (empty)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | (empty)              |
| read array[0] (miss) | {array[0], array[1]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | (empty)              |
| read array[3] (miss) | {array[0], array[1]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | {array[2],array[3]}  |
|                      | {array[0],array[1]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | {array[6],array[7]}  |
| read array[1] (hit)  | {array[0], array[1]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | {array[6], array[7]} |
| read array[4] (miss) | {array[4], array[5]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | {array[6], array[7]} |
| read array[7] (hit)  | {array[4],array[5]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | {array[6],array[7]}  |
|                      | {array[4],array[5]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | {array[2],array[3]}  |
| mand array [E] (hit) | $\left[ 2 - \frac{1}{2} - \frac{1}{2}$ | [orroy[6] orroy[7]]  |

•••

#### quiz exercise solution

one cache block one cache block one cache block one cache block
(set index 1) (set index 0) (set index 1) (set index 0)
array[0]array[1]array[2]array[3]array[4]array[5]array[6]array[7]array

•••

| memory access        | set 0 afterwards                | set 1 afterwards                |
|----------------------|---------------------------------|---------------------------------|
| —                    | (empty)                         | (empty)                         |
| read array[0] (miss) | {array[0], array[1]}            | (empty)                         |
| read array[3] (miss) | {array[0], array[1]}            | <pre>{array[2], array[3]}</pre> |
| read array[6] (miss) | {array[0], array[1]}            | <pre>{array[6], array[7]}</pre> |
| read array[1] (hit)  | {array[0], array[1]}            | {array[6],array[7]}             |
|                      | {array[4],array[5]}             | {array[6],array[7]}             |
| read array[7] (hit)  | {array[4], array[5]}            | <pre>{array[6], array[7]}</pre> |
| read array[2] (miss) | <pre>{array[4], array[5]}</pre> | <pre>{array[2], array[3]}</pre> |
| read array [E] (hit) | [array[4] array[5]]             |                                 |

#### not the quiz problem

...

one cache block one cache block one cache bloc one cache block

array[0]array[1]array[2]array[3]array[4]array[5]array[6]array[7]arra

if 1-set 2-way cache instead of 2-set 1-way cache:

| memory access        | single set with 2-ways, LRU first          |  |
|----------------------|--------------------------------------------|--|
| —                    | ,                                          |  |
| read array[0] (miss) | , {array[0], array[1]}                     |  |
| read array[3] (miss) | {array[0], array[1]}, {array[2], array[3]} |  |
| read array[6] (miss) | {array[2], array[3]}, {array[6], array[7]} |  |
| read array[1] (miss) | {array[6], array[7]}, {array[0], array[1]} |  |
| read array[4] (miss) | {array[0], array[1]}, {array[3], array[4]} |  |
| read array[7] (miss) | {array[3], array[4]}, {array[6], array[7]} |  |
| read array[2] (miss) | {array[6], array[7]}, {array[2], array[3]} |  |
| read array[5] (miss) | {array[2], array[3]}, {array[5], array[6]} |  |
| read array[8] (miss) | {arrav[5], arrav[6]}, {arrav[8], arrav[9]} |  |

# C and cache misses (4)

```
typedef struct {
    int a value, b value;
    int other values[6];
} item:
item items[5];
int a sum = 0, b sum = 0;
for (int i = 0; i < 5; ++i)
    a sum += items[i].a value;
for (int i = 0; i < 5; ++i)
    b sum += items[i].b value:
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

## C and cache misses (4, rewrite)

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array starts at beginning of cache block.

How many *data cache misses* on a 2-way set associative 128B cache with 16B cache blocks and LRU replacement?

## C and cache misses (4, solution pt 1)

#### ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total

accessing 0, 8, 16, 24, 32, 1, 9, 17, 25, 33

#### C and cache misses (4, solution pt 1)

- ints 4 byte  $\rightarrow$  array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total
- accessing 0, 8, 16, 24, 32, 1, 9, 17, 25, 33
- 0 (set 0), 8 (set 2), 16 (set 0), 24 (set 2), 32 (set 0)
- 1 (set 0), 9 (set 2), 17 (set 0), 25 (set 2), 33 (set 0)

#### C and cache misses (4, solution pt 2) set 0 after (LRU first) result access arrav[0] — arrav[0 to 3]miss array[16] array[0 to 3], array[16 to 19] miss 6 misses for set 0 array[32] array[16 to 19], array[32 to 35] miss array[32 to 35], array[0 to 3] array[1] miss array[17] array[0 to 3], array[16 to 19] miss array[16 to 19], array[32 to 35] arrav[32] miss

#### C and cache misses (4, solution pt 3) access set 2 after (LRU first) result — —, array[8] —, array[8 to 11] miss array[24] array[8 to 11], array[24 to 27] miss 2 misses for set 1

hit

hit

array[8 to 11], array[24 to 27]

array[16 to 19], array[32 to 35]

array[9]

array[25]

# C and cache misses (3)

```
typedef struct {
    int a_value, b_value;
    int other values[10];
} item;
item items[5]:
int a sum = 0, b sum = 0;
for (int i = 0; i < 5; ++i)
    a sum += items[i].a value:
for (int i = 0; i < 5; ++i)
    b sum += items[i].b value:
observation: 12 ints in struct: only first two used
equivalent to accessing array[0], array[12], array[24], etc.
...then accessing array[1], array[13], array[25], etc.
```

## C and cache misses (3, rewritten?)

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array at beginning of cache block.

How many *data cache misses* on a 128B two-way set associative cache with 16B cache blocks and LRU replacement?

observation 1: first loop has 5 misses — first accesses to blocks

observation 2: array[0] and array[1], array[12] and array[13], etc. in 57

## C and cache misses (3, solution)

#### ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total

accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49

so access to 1, 21, 41, 61, 81 all hits: set 0 contains block with array[0 to 3] set 5 contains block with array[20 to 23] etc.

## C and cache misses (3, solution)

#### ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total

accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49

so access to 1, 21, 41, 61, 81 all hits: set 0 contains block with array[0 to 3] set 5 contains block with array[20 to 23] etc.

## C and cache misses (3, solution)

# ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total

accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49

0 (set 0, array[0 to 3]), 12 (set 3), 24 (set 2), 36 (set 1), 48 (set 0) each set used at most twice no replacement needed

```
so access to 1, 21, 41, 61, 81 all hits:
set 0 contains block with array[0 to 3]
set 5 contains block with array[20 to 23]
etc.
```

# C and cache misses (3)

```
typedef struct {
    int a value, b value;
    int boring values[126]:
} item;
item items[8]; // 4 KB array
int a sum = 0, b sum = 0;
for (int i = 0; i < 8; ++i)
    a sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b sum += items[i].b value:
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many *data cache misses* on a 2KB direct-mapped cache with 16B cache blocks?

## C and cache misses (3, rewritten?)

# C and cache misses (4)

```
typedef struct {
    int a value, b value;
    int boring values[126]:
} item;
item items[8]; // 4 KB array
int a sum = 0, b sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b sum += items[i].b value:
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many *data cache misses* on a 4-way set associative 2KB direct-mapped cache with 16B cache blocks?

2KB direct-mapped cache with 16B blocks — set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ...

#### set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ...

•••

set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

2KB direct-mapped cache with 16B blocks — set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ...

#### set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ...

•••

set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

2KB direct-mapped cache with 16B blocks —

- set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ... block at 0: array[0] through array[3]
- set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ... block at 16: array[4] through array[7]

...

set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511]

2KB direct-mapped cache with 16B blocks —

- set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ... block at 0: array[0] through array[3] block at 0+2KB: array[512] through array[515]
- set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ... block at 16: array[4] through array[7] block at 16+2KB: array[516] through array[519]
- •••
- set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511] block at 2032+2KB: array[1020] through array[1023]

2KB 2-way set associative cache with 16B blocks: block addresses

set 0: address 0, 0 + 2KB, 0 + 4KB, ...

set 1: address 16, 16 + 2KB, 16 + 4KB, ...

•••

set 63: address 1008, 2032 + 2KB, 2032 + 4KB  $\ldots$ 

2KB 2-way set associative cache with 16B blocks: block addresses

set 0: address 0, 0 + 2KB, 0 + 4KB, ... block at 0: array[0] through array[3]

set 1: address 16, 16 + 2KB, 16 + 4KB, ... address 16: array[4] through array[7]

...

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ... address 1008: arrav[252] through arrav[255]

2KB 2-way set associative cache with 16B blocks: block addresses

```
set 0: address 0, 0 + 2KB, 0 + 4KB, ...
block at 0: array[0] through array[3]
block at 0+1KB: array[256] through array[259]
block at 0+2KB: array[512] through array[515]
```

```
set 1: address 16, 16 + 2KB, 16 + 4KB, ...
address 16: array[4] through array[7]
```

...

...

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ... address 1008: arrav[252] through arrav[255]

2KB 2-way set associative cache with 16B blocks: block addresses

```
set 0: address 0, 0 + 2KB, 0 + 4KB, ...
block at 0: array[0] through array[3]
block at 0+1KB: array[256] through array[259]
block at 0+2KB: array[512] through array[515]
```

```
set 1: address 16, 16 + 2KB, 16 + 4KB, ...
address 16: array[4] through array[7]
```

...

...

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ... address 1008: arrav[252] through arrav[255]

## arrays and cache misses (3)

```
int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {</pre>
    int local_sum = 0;
    for (int j = i - 8; j < i + 8; j += 1) {
        local sum += array[i] * (j - i);
    }
    sum += (local_sum - array[i]);
}
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many *data cache misses* on initially empty 2KB direct-mapped cache with 16B cache blocks?

## **Tag-Index-Offset exercise**

mmemory addreses bits (Y86-64: 64)Enumber of blocks per set ("ways") $S = 2^s$ number of setss(set) index bits $B = 2^b$ block sizeb(block) offset bitst = m - (s + b)tag bits $C = B \times S \times E$ cache size (excluding metadata)

My desktop:

L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks L3 Cache: 8 MB, 16 blocks/set, 64 byte blocks

Divide the address 0x34567 into tag, index, offset for each cache.

| quantity           | value for L1                   |
|--------------------|--------------------------------|
| block size (given) | B = 64Byte                     |
|                    | $B=2^b$ (b: block offset bits) |

| quantity           | value for L1                   |
|--------------------|--------------------------------|
| block size (given) | B = 64Byte                     |
|                    | $B=2^b$ (b: block offset bits) |
| block offset bits  | b = 6                          |

| quantity           | value for L1                     |
|--------------------|----------------------------------|
| block size (given) | B = 64Byte                       |
|                    | $B=2^b$ (b: block offset bits)   |
| block offset bits  | b = 6                            |
| blocks/set (given) | E = 8                            |
| cache size (given) | $C = 32KB = E \times B \times S$ |

quantity value for L1 block size (given) B = 64Byte  $B = 2^b$  (b: block offset bits) block offset bits b = 6blocks/set (given) E = 8cache size (given)  $C = 32 \text{KB} = E \times B \times S$  $S = \frac{C}{R \lor E}$  (S: number of sets)

quantity value for L1 block size (given) B = 64Byte  $B = 2^{b}$  (b: block offset bits) block offset bits b = 6blocks/set (given) E = 8cache size (given)  $C = 32 \text{KB} = E \times B \times S$  $S = \frac{C}{B \times E}$  (S: number of sets)  $S = \frac{-32\overline{\mathsf{K}}\mathsf{B}}{64\mathsf{Bvte} \times 8} = 64$ number of sets

quantity value for L1 block size (given) B = 64Byte  $B = 2^b$  (b: block offset bits) block offset bits b = 6blocks/set (given) E = 8cache size (given)  $C = 32 \text{KB} = E \times B \times S$  $S = \frac{C}{B \times E}$  (S: number of sets)  $S = \frac{32\overline{\mathsf{K}}\mathsf{B}}{64\mathsf{Byte} \times 8} = 64$ number of sets  $S = 2^s$  (s: set index bits)  $s = \log_2(64) = 6$ set index bits

## **T-I-O results**

|                   | L1         | L2   | L3   |
|-------------------|------------|------|------|
| sets              | 64         | 1024 | 8192 |
| block offset bits | 6          | 6    | 6    |
| set index bits    | 6          | 10   | 13   |
| tag bits          | (the rest) |      |      |















## cache operation (associative)



## cache operation (associative)



## cache operation (associative)



### backup slides — cache performance

## cache miss types

common to categorize misses: roughly "cause" of miss assuming cache block size fixed

*compulsory* (or *cold*) — first time accessing something adding more sets or blocks/set wouldn't change

capacity — cache was not big enough

*coherence* — from sync'ing cache with other caches only issue with multiple cores

## making any cache look bad

- 1. access enough blocks, to fill the cache
- 2. access an additional block, replacing something
- 3. access last block replaced
- 4. access last block replaced
- 5. access last block replaced

...

but — typical real programs have locality

## cache optimizations

(assuming typical locality + keeping cache size constant if possible...)miss rate hit time miss penalty increase cache size better worse increase associativity better worse? worse increase block size depends worse worse add secondary cache better \_\_\_\_ write-allocate better writeback 7 7 LRU replacement better worse? prefetching better prefetching = guess what program will use, access in advance

average time = hit time + miss rate  $\times$  miss penalty

## cache optimizations by miss type

(assuming other listed parameters remain constant) conflict compulsory capacity fewer misses fewer misses increase cache size fewer misses increase associativity increase block size more misses? more misses? fewer misses LRU replacement fewer misses prefetching fewer misses

#### average memory access time

# $\begin{aligned} \mathsf{AMAT} &= \mathsf{hit time} + \mathsf{miss penalty} \times \mathsf{miss rate} \\ & \mathsf{or AMAT} = \mathsf{hit time} \times \mathsf{hit rate} + \mathsf{miss time} \times \mathsf{miss rate} \end{aligned}$

effective speed of memory

## AMAT exercise (1)

- 90% cache hit rate
- hit time is 2 cycles
- 30 cycle miss penalty
- what is the average memory access time?

suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles

how much do we have to increase the hit rate for this to not increase AMAT?

## AMAT exercise (1)

- 90% cache hit rate
- hit time is 2 cycles
- 30 cycle miss penalty
- what is the average memory access time?
- 5 cycles

suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles

how much do we have to increase the hit rate for this to not increase AMAT?

## AMAT exercise (1)

- 90% cache hit rate
- hit time is 2 cycles
- 30 cycle miss penalty
- what is the average memory access time?
- 5 cycles

suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles

how much do we have to increase the hit rate for this to not increase AMAT?

### exercise: AMAT and multi-level caches

suppose we have L1 cache with

3 cycle hit time 90% hit rate

and an L2 cache with

10 cycle hit time 80% hit rate (for accesses that make this far) (assume all accesses come via this L1)

and main memory has a 100 cycle access time

assume when there's an cache miss, the next level access starts after the hit time

e.g. an access that misses in L1 and hits in L2 will take  $10{+}3\ \mbox{cycles}$ 

what is the average memory access time for the L1 cache?

### exercise: AMAT and multi-level caches

- suppose we have L1 cache with
  - 3 cycle hit time 90% hit rate
- and an L2 cache with

10 cycle hit time 80% hit rate (for accesses that make this far) (assume all accesses come via this L1)

and main memory has a 100 cycle access time

assume when there's an cache miss, the next level access starts after the hit time

e.g. an access that misses in L1 and hits in L2 will take 10+3 cycles

what is the average memory access time for the L1 cache?

### exercise: AMAT and multi-level caches

- suppose we have L1 cache with
  - 3 cycle hit time 90% hit rate
- and an L2 cache with

10 cycle hit time 80% hit rate (for accesses that make this far) (assume all accesses come via this L1)

and main memory has a 100 cycle access time

assume when there's an cache miss, the next level access starts after the hit time

e.g. an access that misses in L1 and hits in L2 will take 10+3 cycles

what is the average memory access time for the L1 cache?

### approximate miss analysis

#### very tedious to precisely count cache misses even more tedious when we take advanced cache optimizations into account

instead, approximations:

good or bad temporal/spatial locality good temporal locality: value stays in cache good spatial locality: use all parts of cache block

with nested loops: what does inner loop use? intuition: values used in inner loop loaded into cache once (that is, once each time the inner loop is run) ...if they can all fit in the cache

## approximate miss analysis

#### very tedious to precisely count cache misses even more tedious when we take advanced cache optimizations into account

instead, approximations:

good or bad temporal/spatial locality good temporal locality: value stays in cache good spatial locality: use all parts of cache block

with nested loops: what does inner loop use? intuition: values used in inner loop loaded into cache once (that is, once each time the inner loop is run) ...if they can all fit in the cache

## locality exercise (1)

exercise: which has better temporal locality in A? in B? in C? how about spatial locality?

## exercise: miss estimating (1)

Assume: 4 array elements per block, N very large, nothing in cache at beginning.

Example: N/4 estimated misses for A accesses: A[i] should always be hit on all but first iteration of inner-most loop. first iter: A[i] should be hit about 3/4s of the time (same block as A[i-1] that often)

Exericse: estimate # of misses for B, C

### a note on matrix storage

 $A - N \times N \text{ matrix}$ 

represent as array

makes dynamic sizes easier:

float A\_2d\_array[N][N];
float \*A\_flat = malloc(N \* N);

A\_flat[i \* N + j] === A\_2d\_array[i][j]

## convertion re: rows/columns

going to call the first index rows

```
A_{i,j} is A row i, column j
```

rows are stored together

this is an arbitrary choice

| array[0*5 + 0] | array[0*5 + 1] | array[0*5 + 2] | array[0*5 + 3] | array[0*5 + 4] |
|----------------|----------------|----------------|----------------|----------------|
| array[1*5 + 0] | array[1*5 + 1] | array[1*5 + 2] | array[1*5 + 3] | array[1*5 + 4] |
| array[2*5 + 0] | array[2*5 + 1] | array[2*5 + 2] | array[2*5 + 3] | array[2*5 + 4] |
| array[3*5 + 0] | array[3*5 + 1] | array[3*5 + 2] | array[3*5 + 3] | array[3*5 + 4] |
| array[4*5 + 0] | array[4*5 + 1] | array[4*5 + 2] | array[4*5 + 3] | array[4*5 + 4] |

| array[0*5 + 0] | array[0*5 + 1] | array[0*5 + 2] | array[0*5 + 3] | array[0*5 + 4] |
|----------------|----------------|----------------|----------------|----------------|
| array[1*5 + 0] | array[1*5 + 1] | array[1*5 + 2] | array[1*5 + 3] | array[1*5 + 4] |
| array[2*5 + 0] | array[2*5 + 1] | array[2*5 + 2] | array[2*5 + 3] | array[2*5 + 4] |
| array[3*5 + 0] | array[3*5 + 1] | array[3*5 + 2] | array[3*5 + 3] | array[3*5 + 4] |
| array[4*5 + 0] | array[4*5 + 1] | array[4*5 + 2] | array[4*5 + 3] | array[4*5 + 4] |

if array starts on cache block first cache block = first elements all together in one row!

| array[0*5 + 0] | array[0*5 + 1] | array[0*5 + 2] | array[0*5 + 3] | array[0*5 + 4] |
|----------------|----------------|----------------|----------------|----------------|
| array[1*5+0]   | array[1*5 + 1] | array[1*5 + 2] | array[1*5 + 3] | array[1*5 + 4] |
| array[2*5 + 0] | array[2*5 + 1] | array[2*5 + 2] | array[2*5 + 3] | array[2*5 + 4] |
| array[3*5 + 0] | array[3*5 + 1] | array[3*5 + 2] | array[3*5 + 3] | array[3*5 + 4] |
| array[4*5 + 0] | array[4*5 + 1] | array[4*5 + 2] | array[4*5 + 3] | array[4*5 + 4] |

second cache block: 1 from row 0 3 from row 1

| array[0*5 + 0] | array[0*5 + 1] | array[0*5 + 2] | array[0*5 + 3] | array[0*5 + 4] |
|----------------|----------------|----------------|----------------|----------------|
| array[1*5 + 0] | array[1*5 + 1] | array[1*5 + 2] | array[1*5 + 3] | array[1*5 + 4] |
| array[2*5+0]   | array[2*5 + 1] | array[2*5 + 2] | array[2*5 + 3] | array[2*5 + 4] |
| array[3*5 + 0] | array[3*5 + 1] | array[3*5 + 2] | array[3*5 + 3] | array[3*5 + 4] |
| array[4*5 + 0] | array[4*5 + 1] | array[4*5 + 2] | array[4*5 + 3] | array[4*5 + 4] |

| array[0*5 + 0] | array[0*5 + 1] | array[0*5 + 2] | array[0*5 + 3] | array[0*5 + 4] |
|----------------|----------------|----------------|----------------|----------------|
| array[1*5 + 0] | array[1*5 + 1] | array[1*5 + 2] | array[1*5 + 3] | array[1*5 + 4] |
| array[2*5 + 0] | array[2*5 + 1] | array[2*5 + 2] | array[2*5 + 3] | array[2*5 + 4] |
| array[3*5 + 0] | array[3*5 + 1] | array[3*5 + 2] | array[3*5 + 3] | array[3*5 + 4] |
| array[4*5 + 0] | array[4*5 + 1] | array[4*5 + 2] | array[4*5 + 3] | array[4*5 + 4] |

generally: cache blocks contain data from 1 or 2 rows  $\rightarrow$  better performance from reusing rows

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

/\* version 1: inner loop is k, middle is j \*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i \* N + j] += A[i \* N + k] \* B[k \* N + j];</pre>

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

/\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

## loop orders and locality

loop body:  $C_{ij} + = A_{ik}B_{kj}$ 

kij order:  $C_{ij}$ ,  $B_{kj}$  have spatial locality

kij order:  $A_{ik}$  has temporal locality

... better than ...

ijk order:  $A_{ik}$  has spatial locality

ijk order:  $C_{ij}$  has temporal locality

## loop orders and locality

loop body:  $C_{ij} + = A_{ik}B_{kj}$ 

kij order:  $C_{ij}$ ,  $B_{kj}$  have spatial locality

kij order:  $A_{ik}$  has temporal locality

... better than ...

ijk order:  $A_{ik}$  has spatial locality

ijk order:  $C_{ij}$  has temporal locality

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

/\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

/\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

/\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

#### which is better?

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

exercise: Which version has better spatial/temporal locality for...









-89













$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

/\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];</pre>

## performance (with A=B)



## alternate view 1: cycles/instruction



#### alternate view 2: cycles/operation



### counting misses: version 1

# if N really large assumption: can't get close to storing N values in cache at once

- for A: about  $N \div \text{block}$  size misses per k-loop total misses:  $N^3 \div \text{block}$  size
- for B: about N misses per k-loop total misses:  $N^3$
- for C: about  $1 \div \text{block}$  size miss per k-loop total misses:  $N^2 \div \text{block}$  size

## counting misses: version 2

for A: about 1 misses per j-loop total misses:  $N^2$ 

for B: about  $N \div \text{block size miss per j-loop}$ total misses:  $N^3 \div \text{block size}$ 

for C: about  $N \div \text{block}$  size miss per j-loop total misses:  $N^3 \div \text{block}$  size

## exercise: miss estimating (2)

assuming: 4 elements per block

assuming: cache not close to big enough to hold 1K elements

estimate: approximately how many misses for A, B?

## L1 misses (with A=B)



# L1 miss detail (1)



# L1 miss detail (2)



#### addresses

B[k\*114+j]is at10000000000100B[k\*114+j+1]is at10000000001000B[(k+1)\*114+j]is at10001110010100B[(k+2)\*114+j]is at10010101011100

•••

B[(k+9)\*114+j] is at 11 0000 0000 1100

#### addresses

 B[k\*114+j]
 is at 10 0000 0000 0100

 B[k\*114+j+1]
 is at 10 0000 0000 1000

 B[(k+1)\*114+j]
 is at 10 0011 1001 0100

 B[(k+2)\*114+j]
 is at 10 0101 0101 1100

 ...
 B[(k+9)\*114+j]

test system L1 cache: 6 index bits, 6 block offset bits

#### conflict misses

powers of two - lower order bits unchanged

B[k\*93+j] and B[(k+11)\*93+j]: 1023 elements apart (4092 bytes; 63.9 cache blocks)

64 sets in L1 cache: usually maps to same set

B[k\*93+(j+1)] will not be cached (next *i* loop)

even if in same block as B[k\*93+j]

how to fix? improve spatial locality (maybe even if it requires copying)

# locality exercise (2)

exercise: which has better temporal locality in A? in B? in C? how about spatial locality?

#### a transformation

split the loop over k — should be exactly the same (assuming even N)

#### a transformation

split the loop over k — should be exactly the same (assuming even N)

# simple blocking

now reorder split loop — same calculations

# simple blocking

now reorder split loop — same calculations

now handle  $B_{ij}$  for k+1 right after  $B_{ij}$  for k

(previously:  $B_{i,j+1}$  for k right after  $B_{ij}$  for k)

# simple blocking

now reorder split loop — same calculations

now handle  $B_{ij}$  for k+1 right after  $B_{ij}$  for k

(previously:  $B_{i,j+1}$  for k right after  $B_{ij}$  for k)

```
for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        /* process a "block" of 2 k values: */
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
}</pre>
```

Temporal locality in  $C_{ij}s$ 

More spatial locality in  $A_{ik}$ 

Still have good spatial locality in  $B_{kj}$ ,  $C_{ij}$ 

# access pattern for A: A[0\*N+0], A[0\*N+1], A[0\*N+0], A[0\*N+1] ...(repeats N times) A[1\*N+0], A[1\*N+1], A[1\*N+0], A[1\*N+1] ...(repeats N times)

#### 107

A[(N-1)\*N+0], A[(N-1)\*N+1], A[(N-1)\*N+0], A[(N-1)\*N+1] ... A[0\*N+2], A[0\*N+3], A[0\*N+2], A[0\*N+3] ...

access pattern for A: A[0\*N+0], A[0\*N+1], A[0\*N+0], A[0\*N+1] ...(repeats N times) A[1\*N+0], A[1\*N+1], A[1\*N+0], A[1\*N+1] ...(repeats N times)

# counting misses for A (1)

#### 107

A[(N-1)\*N+0], A[(N-1)\*N+1], A[(N-1)\*N+0], A[(N-1)\*N+1] ... A[0\*N+2], A[0\*N+3], A[0\*N+2], A[0\*N+3] ...

access pattern for A: A[0\*N+0], A[0\*N+1], A[0\*N+0], A[0\*N+1] ...(repeats N times) A[1\*N+0], A[1\*N+1], A[1\*N+0], A[1\*N+1] ...(repeats N times)

# counting misses for A (1)

A[0\*N+0], A[0\*N+1], A[0\*N+0], A[0\*N+1] ...(repeats N times) A[1\*N+0], A[1\*N+1], A[1\*N+0], A[1\*N+1] ...(repeats N times)

•••

...

...

A[0\*N+0], A[0\*N+1], A[0\*N+0], A[0\*N+1] ...(repeats N times) A[1\*N+0], A[1\*N+1], A[1\*N+0], A[1\*N+1] ...(repeats N times)

 $\begin{array}{l} \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}0], \ \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}1], \ \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}0], \ \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}1] \ \dots \\ \mathsf{A}[0^*\mathsf{N}\text{+}2], \ \mathsf{A}[0^*\mathsf{N}\text{+}3], \ \mathsf{A}[0^*\mathsf{N}\text{+}2], \ \mathsf{A}[0^*\mathsf{N}\text{+}3] \ \dots \end{array}$ 

likely cache misses: only first iterations of  $\boldsymbol{j}$  loop

how many cache misses per iteration? usually one A[0\*N+0] and A[0\*N+1] usually in same cache block

A[0\*N+0], A[0\*N+1], A[0\*N+0], A[0\*N+1] ...(repeats N times) A[1\*N+0], A[1\*N+1], A[1\*N+0], A[1\*N+1] ...(repeats N times)

 $\begin{array}{l} \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}0], \ \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}1], \ \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}0], \ \mathsf{A}[(\mathsf{N}\text{-}1)^*\mathsf{N}\text{+}1] \ \dots \\ \mathsf{A}[0^*\mathsf{N}\text{+}2], \ \mathsf{A}[0^*\mathsf{N}\text{+}3], \ \mathsf{A}[0^*\mathsf{N}\text{+}2], \ \mathsf{A}[0^*\mathsf{N}\text{+}3] \ \dots \end{array}$ 

likely cache misses: only first iterations of  $\boldsymbol{j}$  loop

how many cache misses per iteration? usually one A[0\*N+0] and A[0\*N+1] usually in same cache block

about  $\frac{N}{2} \cdot N$  misses total

...

...

B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)]

...

...

access pattern for B: B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)] B[2\*N+0], B[3\*N+0], ...B[2\*N+(N-1)], B[3\*N+(N-1)] B[4\*N+0], B[5\*N+0], ...B[4\*N+(N-1)], B[5\*N+(N-1)] B[4\*N+0], B[5\*N+0], ...B[4\*N+(N-1)], B[5\*N+(N-1)] B[5\*N+(N-1)]

 $\mathsf{B}[0^*\mathsf{N}{+}0], \; \mathsf{B}[1^*\mathsf{N}{+}0], \; ... \mathsf{B}[0^*\mathsf{N}{+}(\mathsf{N}{-}1)], \; \mathsf{B}[1^*\mathsf{N}{+}(\mathsf{N}{-}1)]$ 

110

access pattern for B: B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)] B[2\*N+0], B[3\*N+0], ...B[2\*N+(N-1)], B[3\*N+(N-1)] B[4\*N+0], B[5\*N+0], ...B[4\*N+(N-1)], B[5\*N+(N-1)] ...

$$B[0*N+0]$$
,  $B[1*N+0]$ , ... $B[0*N+(N-1)]$ ,  $B[1*N+(N-1)]$ 

... likely cache misses: any access, each time

access pattern for B: B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)] B[2\*N+0], B[3\*N+0], ...B[2\*N+(N-1)], B[3\*N+(N-1)] B[4\*N+0], B[5\*N+0], ...B[4\*N+(N-1)], B[5\*N+(N-1)] ...

$$B[0*N+0]$$
,  $B[1*N+0]$ , ... $B[0*N+(N-1)]$ ,  $B[1*N+(N-1)]$ 

likely cache misses: any access, each time

...

how many cache misses per iteration? equal to # cache blocks in 2 rows

access pattern for B: B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)] B[2\*N+0], B[3\*N+0], ...B[2\*N+(N-1)], B[3\*N+(N-1)] B[4\*N+0], B[5\*N+0], ...B[4\*N+(N-1)], B[5\*N+(N-1)]... B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)]

likely cache misses: any access, each time

how many cache misses per iteration? equal to # cache blocks in 2 rows

about 
$$\frac{N}{2} \cdot N \cdot \frac{2N}{\text{block size}} = N^3 \div \text{block size misses}$$

# simple blocking – counting misses

for (int kk = 0; kk < N; kk += 2) for (int i = 0; i < N; i += 1) for (int j = 0; j < N; ++j) {</pre> C[i\*N+i] += A[i\*N+kk+0] \* B[(kk+0)\*N+i];C[i\*N+j] += A[i\*N+kk+1] \* B[(kk+1)\*N+j]; $\frac{N}{2} \cdot N$  j-loop executions and (assuming N large): about 1 misses from A per j-loop  $N^2/2$  total misses (before blocking:  $N^2$ ) about  $2N \div \text{block}$  size misses from B per j-loop  $N^3 \div$  block size total misses (same as before blocking) about  $N \div \text{block}$  size misses from C per j-loop  $\mathbf{x}^3$ ,  $(\mathbf{a} + \mathbf{i} + \mathbf{i})$ ,  $\mathbf{x} + \mathbf{i} + \mathbf{i}$ ,  $(\mathbf{i} + \mathbf{i})$ ,  $\mathbf{x}^3$ ,  $\mathbf{x} + \mathbf{i} + \mathbf{i}$ )

111

## simple blocking – counting misses

for (int kk = 0; kk < N; kk += 2) for (int i = 0; i < N; i += 1) for (int j = 0; j < N; ++j) {</pre> C[i\*N+i] += A[i\*N+kk+0] \* B[(kk+0)\*N+i];C[i\*N+j] += A[i\*N+kk+1] \* B[(kk+1)\*N+j]; $\frac{N}{2} \cdot N$  j-loop executions and (assuming N large): about 1 misses from A per j-loop  $N^2/2$  total misses (before blocking:  $N^2$ ) about  $2N \div \text{block}$  size misses from B per j-loop  $N^3 \div$  block size total misses (same as before blocking) about  $N \div \text{block}$  size misses from C per j-loop

111

#### improvement in read misses



# simple blocking (2)

same thing for i in addition to k?

# simple blocking — locality

for (int k = 0; k < N; k += 2) {  
for (int i = 0; i < N; i += 2) {  
 /\* load a block around Aik \*/  
 for (int j = 0; j < N; ++j) {  
 /\* process a "block": \*/  

$$C_{i+0,j}$$
 +=  $A_{i+0,k+0}$  \*  $B_{k+0,j}$   
 $C_{i+0,j}$  +=  $A_{i+0,k+1}$  \*  $B_{k+1,j}$   
 $C_{i+1,j}$  +=  $A_{i+1,k+0}$  \*  $B_{k+0,j}$   
 $C_{i+1,j}$  +=  $A_{i+1,k+1}$  \*  $B_{k+1,j}$   
 }  
 }  
}

# simple blocking — locality

for (int k = 0; k < N; k += 2) {  
for (int i = 0; i < N; i += 2) {  
 /\* load a block around Aik \*/  
 for (int j = 0; j < N; ++j) {  
 /\* process a "block": \*/  

$$C_{i+0,j}$$
 +=  $A_{i+0,k+0}$  \*  $B_{k+0,j}$   
 $C_{i+0,j}$  +=  $A_{i+0,k+1}$  \*  $B_{k+1,j}$   
 $C_{i+1,j}$  +=  $A_{i+1,k+0}$  \*  $B_{k+0,j}$   
 $C_{i+1,j}$  +=  $A_{i+1,k+1}$  \*  $B_{k+1,j}$   
 }  
 }  
}

now: more temporal locality in B previously: access  $B_{kj}$ , then don't use it again for a long time

# simple blocking — counting misses for A

for (int k = 0; k < N; k += 2)  
for (int i = 0; i < N; i += 2)  
for (int j = 0; j < N; ++j) {  

$$C_{i+0,j} = A_{i+0,k+1} + B_{k+0,j}$$
  
 $C_{i+1,j} = A_{i+1,k+0} + B_{k+1,j}$   
 $C_{i+1,j} = A_{i+1,k+1} + B_{k+1,j}$   
}  
 $\frac{N}{2} \cdot \frac{N}{2}$  iterations of j loop  
likely 2 misses per loop with A (2 cache blocks)

total misses:  $\frac{N^2}{2}$  (same as only blocking in K)

# simple blocking — counting misses for B

for (int k = 0; k < N; k += 2)  
for (int i = 0; i < N; i += 2)  
for (int j = 0; j < N; ++j) {  

$$C_{i+0,j} = A_{i+0,k+0} + B_{k+0,j}$$
  
 $C_{i+1,j} = A_{i+1,k+0} + B_{k+1,j}$   
 $C_{i+1,j} = A_{i+1,k+1} + B_{k+1,j}$   
}  
 $\frac{N}{2} \cdot \frac{N}{2}$  iterations of j loop  
likely 2 ÷ block size misses per iteration with B  
total misses:  $\frac{N^3}{2 \cdot \text{block size}}$  (before:  $\frac{N^3}{\text{block size}}$ )

# simple blocking — counting misses for C

for (int k = 0; k < N; k += 2)  
for (int i = 0; i < N; i += 2)  
for (int j = 0; j < N; ++j) {  

$$C_{i+0,j} = A_{i+0,k+0} + B_{k+0,j}$$
  
 $C_{i+1,j} = A_{i+1,k+0} + B_{k+1,j}$   
 $C_{i+1,j} = A_{i+1,k+1} + B_{k+1,j}$   
}  
 $\frac{N}{2} \cdot \frac{N}{2}$  iterations of j loop  
likely  $\frac{2}{\text{block size}}$  misses per iteration with C  
total misses:  $\frac{N^3}{2}$  (same as blocking only in K)

117

# simple blocking — counting misses (total)

for (int k = 0; k < N; k += 2)  
for (int i = 0; i < N; i += 2)  
for (int j = 0; j < N; ++j) {  

$$C_{i+0,j} = A_{i+0,k+1} * B_{k+0,j}$$
  
 $C_{i+1,j} = A_{i+1,k+0} * B_{k+1,j}$   
 $C_{i+1,j} = A_{i+1,k+1} * B_{k+1,j}$   
}  
before:  
A:  $\frac{N^2}{2}$ ; B:  $\frac{N^3}{1 \cdot \text{block size}}$ ; C  $\frac{N^3}{1 \cdot \text{block size}}$   
after:  
A:  $\frac{N^2}{2}$ ; B:  $\frac{N^3}{2 \cdot 1 + 1 \cdot 1 \cdot 1}$ ; C  $\frac{N^3}{2 \cdot 1 + 1 \cdot 1 \cdot 1}$ 

#### generalizing: divide and conquer

```
partial matrixmultiply(float *A, float *B, float *C
               int startI, int endI, ...) {
  for (int i = startI; i < endI; ++i) {</pre>
    for (int j = startJ; j < endJ; ++j) {</pre>
      for (int k = startK; k < endK; ++k) {</pre>
        . . .
}
matrix_multiply(float *A, float *B, float *C, int N) {
  for (int ii = 0; ii < N; ii += BLOCK I)
    for (int ij = 0; ij < N; ij += BLOCK_J)
      for (int kk = 0; kk < N; kk += BLOCK K)
         . . .
         /* do everything for segment of A, B, C
            that fits in cache! */
         partial matmul(A, B, C,
```

```
119
```



inner loops work on "matrix block" of A, B, C rather than rows of some, little blocks of others blocks fit into cache (b/c we choose I, K, J)



now (versus loop ordering example) some spatial locality in A, B, and C some temporal locality in A, B, and C



 $C_{ij}$  calculation uses strips from A, BK calculations for one cache miss good temporal locality!



 $A_{ik}$  used with entire strip of  $B \ J$  calculations for one cache miss good temporal locality!



(approx.) KIJ fully cached calculations for KI + IJ + KJ values need to be lodaed per "matrix block" (assuming everything stays in cache)

### cache blocking efficiency

for each of  $N^3/IJK$  matrix blocks:

 $\begin{array}{l} \mathsf{load} \ I \times K \ \mathsf{elements} \ \mathsf{of} \ A_{ik} \text{:} \\ \approx IK \div \mathsf{block} \ \mathsf{size} \ \mathsf{misses} \ \mathsf{per} \ \mathsf{matrix} \ \mathsf{block} \\ \approx N^3/(J \cdot \mathsf{blocksize}) \ \mathsf{misses} \ \mathsf{total} \end{array}$ 

load  $K \times J$  elements of  $B_{kj}$ :  $\approx N^3/(I \cdot \text{blocksize})$  misses total

load  $I \times J$  elements of  $C_{ij}$ :  $\approx N^3/(K \cdot \text{blocksize})$  misses total

bigger blocks — more work per load!

catch: IK + KJ + IJ elements must fit in cache otherwise estimates above don't work

### cache blocking rule of thumb

fill the most of the cache with useful data

and do as much work as possible from that

example: my desktop 32KB L1 cache

I = J = K = 48 uses  $48^2 \times 3$  elements, or 27KB.

assumption: conflict misses aren't important

### systematic approach

values from  $A_{ik}$  used N times per load

values from  $B_{kj}$  used 1 times per load but good spatial locality, so cache block of  $B_{kj}$  together

```
values from C_{ij} used 1 times per load
but good spatial locality, so cache block of C_{ij} together
```

### exercise: miss estimating (3)

assuming: 4 elements per block

assuming: cache not close to big enough to hold 1K elements, but big enough to hold 500 or so

estimate: approximately how many misses for A, B?

### loop ordering compromises

loop ordering forces compromises:

for k: for i: for j: c[i,j] += a[i,k] \* b[j,k]

perfect temporal locality in a[i,k]

bad temporal locality for c[i,j], b[j,k]

perfect spatial locality in c[i,j]

bad spatial locality in b[j,k], a[i,k]

### loop ordering compromises

loop ordering forces compromises:

for k: for i: for j: c[i,j] += a[i,k] \* b[j,k]

perfect temporal locality in a[i,k]

bad temporal locality for c[i,j], b[j,k]

perfect spatial locality in c[i,j]

bad spatial locality in b[j,k], a[i,k]

cache blocking: work on blocks rather than rows/columns have some temporal, spatial locality in everything

### cache blocking pattern

no perfect loop order? work on rectangular matrix blocks

size amount used in inner loops based on cache size

in practice:

test performance to determine 'size' of blocks

### backup slides

### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

data cache miss rates: Cache size direct-mapped fully assoc. 2-wav 8-wav 1KB 8 63% 6 97% 5.63% 5 34% 5.71% 4.23% 3.30% 3.05% 2KB4KR 3.70% 2.60% 2.03% 1.90% 16KB 1.59%0.86% 0.56% 0.50% 0.66% 0.37% 0.10% 0.001% 64KB 128KB 0.27% 0.001% 0.0006% 0.0006%

Data: Cantin and Hill, "Cache Performance for SPEC CPU2000 Benchmarks" http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/

### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

data cache miss rates: Cache size direct-mapped fully assoc. 2-wav 8-wav 1KB 8 63% 6 97% 5.63% 5 34% 5.71% 4.23% 3.30% 3.05% 2KB4KR 3.70% 2.60% 2.03% 1.90% 16KB 1.59%0.86% 0.56% 0.50% 0.66% 0.37% 0.10% 0.001% 64KB 128KB 0.27% 0.001% 0.0006% 0.0006%

> Data: Cantin and Hill, "Cache Performance for SPEC CPU2000 Benchmarks" http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/

## exercise (1)

initial cache: 64-byte blocks, 64 sets, 8 ways/set

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
- B. quadrupling the number of sets
- C. quadrupling the number of ways/set

# exercise (2)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

# exercise (3)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

solution: don't require miss: 'prefetch' the value before it's accessed

remaining problem: how do we know what to fetch?

#### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

#### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

common pattern with instruction fetches and array accesses

### prefetching idea

look for sequential accesses

bring in guess at next-to-be-accessed value

if right: no cache miss (even if never accessed before)

if wrong: possibly evicted something else — could cause more misses

fortunately, sequential access guesses almost always right





### simple blocking – with 3?

for (int kk = 0; kk < N; kk += 3)  
for (int i = 0; i < N; i += 1)  
for (int j = 0; j < N; ++j) {  
 C[i\*N+j] += A[i\*N+kk+0] \* B[(kk+0)\*N+j];  
 C[i\*N+j] += A[i\*N+kk+1] \* B[(kk+1)\*N+j];  
 C[i\*N+j] += A[i\*N+kk+2] \* B[(kk+2)\*N+j];  
 }  
$$\frac{N}{3} \cdot N \text{ j-loop iterations, and (assuming N large):}$$

# about 1 misses from A per j-loop iteration $N^2/3$ total misses (before blocking: $N^2$ )

about  $3N \div$  block size misses from B per j-loop iteration  $N^3 \div$  block size total misses (same as before)

about  $3N \div$  block size misses from C per i-loop iteration

### simple blocking – with 3?

for (int kk = 0; kk < N; kk += 3)
 for (int i = 0; i < N; i += 1)
 for (int j = 0; j < N; ++j) {
 C[i\*N+j] += A[i\*N+kk+0] \* B[(kk+0)\*N+j];
 C[i\*N+j] += A[i\*N+kk+1] \* B[(kk+1)\*N+j];
 C[i\*N+j] += A[i\*N+kk+2] \* B[(kk+2)\*N+j];
 }
 
$$\frac{N}{3} \cdot N$$
 j-loop iterations, and (assuming N large):

# about 1 misses from A per j-loop iteration $N^2/3$ total misses (before blocking: $N^2$ )

about  $3N \div$  block size misses from B per j-loop iteration  $N^3 \div$  block size total misses (same as before)

about  $3N \div \text{block size misses from } C$  per i-loop iteration

### more than 3?

can we just keep doing this increase from 3 to some large X? ...

# assumption: X values from A would stay in cache X too large — cache not big enough

assumption: X blocks from B would help with spatial locality X too large — evicted from cache before next iteration





within innermost loop good spatial locality in A bad locality in B good temporal locality in C



loop over j: better spatial locality over A than before; still good temporal locality for A



loop over j: spatial locality over B is worse but probably not more misses cache needs to keep two cache blocks for next iter instead of one (probably has the space left over!)



for each kk: for each i: for each j: for k=kk,kk+1  $C_{ij}$ + =  $A_{ik}$  right now: only really care about keeping 4 cache blocks in j loop

for k=kk,kk+1: have more than 4 cache blocks?  $C_{ij}+=A_{ik}$ . increasing kk increment would use more of them

### keeping values in cache

can't explicitly ensure values are kept in cache

...but reusing values *effectively* does this cache will try to keep recently used values

cache optimization ideas: choose what's in the cache for thinking about it: load values explicitly for implementing it: access only values we want loaded



# TLB and the MMU (2)



# TLB and the MMU (2)



# TLB and the MMU (2)







# TLB and the MMU (2)



what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

option 2: TLB entries contain process ID set by OS (special register) checked by TLB in addition to TLB tag, valid bit

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

invalid to valid — nothing needed TLB doesn't contain invalid entries MMU will check memory again

valid to invalid — OS needs to tell processor to invalidate it
 special instruction (x86: invlpg)

valid to other valid — OS needs to tell processor to invalidate it

# address splitting for TLBs (1)

my desktop:

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

```
64-entry, 4-way L1 data TLB
```

TLB index bits?

TLB tag bits?

# address splitting for TLBs (1)

my desktop:

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

```
64-entry, 4-way L1 data TLB
```

```
TLB index bits? 64/4 = 16 sets — 4 bits
```

TLB tag bits?

48 - 12 = 36 bit virtual page number — 36 - 4 = 32 bit TLB tag

# address splitting for TLBs (2)

my desktop:

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

1536-entry ( $3 \cdot 2^9$ ), 12-way L2 TLB

TLB index bits?

TLB tag bits?

# address splitting for TLBs (2)

my desktop:

4KB ( $2^{12}$  byte) pages; 48-bit virtual address 1536-entry ( $3 \cdot 2^9$ ), 12-way L2 TLB

```
TLB index bits? 1536/12 = 128 sets — 7 bits
```

TLB tag bits?

48 - 12 = 36 bit virtual page number — 36 - 7 = 29 bit TLB tag

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

option 2: TLB entries contain process ID set by OS (special register) checked by TLB in addition to TLB tag, valid bit

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

invalid to valid — nothing needed TLB doesn't contain invalid entries MMU will check memory again

valid to invalid — OS needs to tell processor to invalidate it
 special instruction (x86: invlpg)

valid to other valid — OS needs to tell processor to invalidate it