Circles: Processor
Diamonds: Memory
Image from “Computer Architecture: A Quantitative Approach”
The “Memory Wall”
Image from “Computer Architecture: A Quantitative Approach”
The “Memory Wall”
Image from “Computer Architecture: A Quantitative Approach”
1989: Intel 80486
with 8 KB cache
Image: approx 2004 AMD press image of Opteron die;
approx register location via chip-architect.org (Hans de Vries)
performance of the fastest (smallest) memory
capacity of the largest (slowest) memory
how to hide 100x latency difference?
temporal locality
‘‘if a value is accessed now, it will be accessed again soon’’
spatial locality
‘‘if a value is accessed now, adjacent values will be accessed soon’’
total, i, length accessed repeatedlyvalues[i+1] accessed after values[i]exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?
exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?
better spatial/temporal locality?
block size is 1 book page
(formulas derivable from prior slides)
| \(S=2^s\) | number of sets |
| \(s\) | (set) index bits |
| \(B=2^b\) | block size |
| \(b\) | (block) offset bits |
| \(m\) | memory addreses bits |
| \(t = m - (s+b)\) | tag bits |
| \(C = B \times{} S\) | cache size (if direct-mapped) |
64-byte blocks, 128 set cache
stores \(64 \times 128 = 8192\) bytes (of data)
if addresses 32-bits, then how many tag/index/offset bits?
which bytes are stored in the same block as byte from 0x1037?
0x10110x10210x10350x1041A system uses a direct-mapped cache with 32 byte blocks.
Two memory accesses are made:
0x082490x0c658What is the fewest number of cache sets this cache could have to prevent the second read from evicting the first?
What is the size of this cache?
simulated 16KB direct-mapped cache; excluding BST setup
actual 32KB more complex cache (only one set of measurements + other things on the machine + excluding initial load)
simulated 16KB direct-mapped data cache; excluding initial load
actual 32KB more complex cache; excluding initial matrix load
none of the blocks for the index are valid
none of the valid blocks for the index match the tag
one of the blocks for the index is valid and matches the tag
need to track total order
worst case: bunch of new accesses to same set
direct-mapped — one block per set
\(E\)-way set associative — \(E\) blocks per set
fully associative — one set total (everything in one set)
| \(m\) | memory addreses bits |
| \(E\) | number of blocks per set (‘‘ways’’) |
| \(S=2^s\) | number of sets |
| \(s\) | (set) index bits |
| \(B=2^b\) | block size |
| \(b\) | (block) offset bits |
| \(t = m - (s+b)\) | tag bits |
| \(C = B \times S \times E\) | cache size (excluding metadata) |
array is kept in registers (and the compiler does not do anything funny).array is kept in registers (and the compiler does not do anything funny).array[0] at beginning of cache block.array is kept in registers (and the compiler does not do anything funny).array is kept in registers (and the compiler does not do anything funny).Assume everything but array is kept in registers (and the compiler does not do anything funny).
observation: array[0] and array[512] exactly 2KB apart
Assume everything but array is kept in registers (and the compiler does not do anything funny).
array1, array2 is kept in registers (and the compiler does not do anything funny).array1, array2array1[i] and array2[i] always different sets:
array1 to array2 not multiple of #sets \(\times\) bytes/setiarray1[X] values loaded, then used 4 times before loading next blockarray2[X])array1[i] and array2[i] same sets:
array1 to array2 is multiple of #sets \(\times\) bytes/setiarray1[X] values loaded, one value used from it,array2[X] values replaces it, one value used from it, …can have worst spacing happen by accident:
two structs/arrays placed at start of newly allocated page
columns of matrix with power-of-two width
Assume everything but array is kept in registers (and the compiler does not do anything funny).
observation: array[0], array[256], array[512], array[768] in same set
Assume everything but array is kept in registers (and the compiler does not do anything funny).
observation: array[0], array[256], array[512], array[768] in same set
for each of the following accesses, performed alone, would it require (a) reading a value from memory and (b) writing a value to the memory?
deciding cache size, associativity, etc.?
lots of tradeoffs:
details depend on programs run
simulation to assess impact of designs
| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| 1KB | 8.63% | 6.97% | 5.63% | 5.34% |
| 2KB | 5.71% | 4.23% | 3.30% | 3.05% |
| 4KB | 3.70% | 2.60% | 2.03% | 1.90% |
| 16KB | 1.59% | 0.86% | 0.56% | 0.50% |
| 64KB | 0.66% | 0.37% | 0.10% | 0.001% |
| 128KB | 0.27% | 0.001% | 0.0006% | 0.0006% |
| L1 cache | TLB |
|---|---|
| physical addresses | virtual page numbers |
| bytes from memory | page table entries |
| tens of bytes per block | one page table entry per block |
| usually thousands of blocks | usually tens of entries |
only caches the page table lookup itself
(generally) just entries from the last-level page tables
virtual page number divided into index + tag
not much spatial locality between page table entries
(they’re used for kilobytes of data already)
0 block offset bits
few active page table entries at a time
enables highly associative cache designs
| type | virtual | physical | result | set 0 | set 1 |
|---|---|---|---|---|---|
| read | 0x440030 | 0x554030 | |||
| write | 0x440034 | 0x554034 | |||
| read | 0x7FFFE008 | 0x556008 | |||
| read | 0x7FFFE000 | 0x556000 | |||
| read | 0x7FFFDFF8 | 0x5F8FF8 | |||
| read | 0x664080 | 0x5F9080 | |||
| read | 0x440038 | 0x554038 | |||
| write | 0x7FFFDFF0 | 0x5F8FF0 |
| type | virtual | physical | result | set 0 | set 1 |
|---|---|---|---|---|---|
| read | 0x440030 | 0x554030 | miss | 0x440 | |
| write | 0x440034 | 0x554034 | hit | 0x440 | |
| read | 0x7FFFE008 | 0x556008 | miss | 0x440, 0x7FFFE | |
| read | 0x7FFFE000 | 0x556000 | hit | 0x440, 0x7FFFE | |
| read | 0x7FFFDFF8 | 0x5F8FF8 | miss | 0x440, 0x7FFFE | 0x7FFFD |
| read | 0x664080 | 0x5F9080 | miss | 0x664, 0x7FFFE | 0x7FFFD |
| read | 0x440038 | 0x554038 | miss | 0x664, 0x440 | 0x7FFFD |
| write | 0x7FFFDFF0 | 0x5F8FF0 | hit | 0x664, 0x440 | 0x7FFFD |
| set idx | V | tag | physical page | write? | user? | … | LRU? |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0x00220 (0x00440 >> 1) | 0x554 | 1 | 1 | … | no |
| 1 | 0x00322 (0x00664 >> 1) | 0x5F9 | 1 | 1 | … | yes | |
| 1 | 1 | 0x3FFFF (0x7FFFD >> 1) | 0x554 | 1 | 1 | … | no |
| 0 | … | yes |
0x7ffffffe43b80x6bc3a0| return address | scaleFactor | |
| tag | 0xfffffffc | 0xd7 |
| index | 0x10e | 0x10e |
| offset | 0x38 | 0x20 |
obviously I set that up to have the same index
but one of the reasons we’ll want something better than direct-mapped cache
int array[8];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
even_sum += array[4];
odd_sum += array[5];
even_sum += array[6];
odd_sum += array[7];
array is kept in registers (and the compiler does not do anything funny), and array[0] at beginning of cache block.int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
even_sum += array[i + 0];
odd_sum += array[i + 1];
}
array is kept in registers (and the compiler does not do anything funny).int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
odd_sum += array[i + 1];
array is kept in registers (and the compiler does not do anything funny).int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
odd_sum += array[i + 1];
array is kept in registers (and the compiler does not do anything funny).(formulas derivable from prior slides)
| \(S=2^s\) | number of sets |
| \(s\) | (set) index bits |
| \(B=2^b\) | block size |
| \(b\) | (block) offset bits |
| \(m\) | memory addreses bits |
| \(t = m - (s+b)\) | tag bits |
| \(C = B \times{} S\) | cache size (if direct-mapped) |
| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| 1KB | 8.63% | 6.97% | 5.63% | 5.34% |
| 2KB | 5.71% | 4.23% | 3.30% | 3.05% |
| 4KB | 3.70% | 2.60% | 2.03% | 1.90% |
| 16KB | 1.59% | 0.86% | 0.56% | 0.50% |
| 64KB | 0.66% | 0.37% | 0.10% | 0.001% |
| 128KB | 0.27% | 0.001% | 0.0006% | 0.0006% |
| A. | quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set) |
| B. | quadrupling the number of sets |
| C. | quadrupling the number of ways/set |
| A. | quadrupling the block size (256-byte block, 8 ways/set, 64KB cache) |
| B. | quadrupling the number of ways/set |
| C. | quadrupling the cache size |
| A. | quadrupling the block size (256-byte block, 8 ways/set, 64KB cache) |
| B. | quadrupling the number of ways/set |
| C. | quadrupling the cache size |
suppose recently accessed 16B cache blocks are at:
guess what’s accessed next
common pattern with instruction fetches and array accesses
look for sequential accesses
bring in guess at next-to-be-accessed value
if right: no cache miss (even if never accessed before)
if wrong: possibly evicted something else — could cause more misses
items is kept in registers (and the compiler does not do anything funny).array is kept in registers (and the compiler does not do anything funny) and array starts at beginning of cache block.| access | set 0 after (LRU first) | result |
| — | —, — | |
| array[0] | —, array[0 to 3] | miss |
| array[16] | array[0 to 3], array[16 to 19] | miss |
| array[32] | array[16 to 19], array[32 to 35] | miss |
| array1 | array[32 to 35], array[0 to 3] | miss |
| array[17] | array[0 to 3], array[16 to 19] | miss |
| array[32] | array[16 to 19], array[32 to 35] | miss |
6 misses for set 0
| access | set 2 after (LRU first) | result |
| — | —, — | |
| array[8] | —, array[8 to 11] | miss |
| array[24] | array[8 to 11], array[24 to 27] | miss |
| array[9] | array[8 to 11], array[24 to 27] | hit |
| array[25] | array[16 to 19], array[32 to 35] | hit |
2 misses for set 1
array is kept in registers (and the compiler does not do anything funny) and array at beginning of cache block.ints 4 byte \(\rightarrow{}\) array[0 to 3] and array[16 to 19] in same cache set
accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49
0 (set 0, array[0 to 3]), 12 (set 3), 24 (set 2), 36 (set 1), 48 (set 0)
so access to 1, 21, 41, 61, 81 all hits:
items is kept in registers (and the compiler does not do anything funny).items is kept in registers (and the compiler does not do anything funny).2KB direct-mapped cache with 16B blocks —
set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, …
set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, …
…
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, …
2KB 2-way set associative cache with 16B blocks: block addresses —
set 0: address 0, 0 + 2KB, 0 + 4KB, …
set 1: address 16, 16 + 2KB, 16 + 4KB, …
…
set 63: address 1008, 2032 + 2KB, 2032 + 4KB …
int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {
int local_sum = 0;
for (int j = i - 8; j < i + 8; j += 1) {
local_sum += array[i] * (j - i);
}
sum += (local_sum - array[i]);
}
array is kept in registers (and the compiler does not do anything funny).| \(m\) | memory addreses bits (Y86-64: 64) |
| \(E\) | number of blocks per set (‘‘ways’’) |
| \(S=2^s\) | number of sets |
| \(s\) | (set) index bits |
| \(B=2^b\) | block size |
| \(b\) | (block) offset bits |
| \(t = m - (s+b)\) | tag bits |
| \(C = B \times{} S \times{} E\) | cache size (excluding metadata) |
My desktop:
L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks
L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks
L3 Cache: 8 MB, 16 blocks/set, 64 byte blocks
Divide the address 0x34567 into tag, index, offset for each cache.
| quantity | value for L1 |
| block size (given) | \(B=64\text{Byte}\) |
| \(B=2^b\) (\(b\): block offset bits) | |
| block offset bits | \(b=\)\(6\) |
| blocks/set (given) | \(E=8\) |
| cache size (given) | \(C = 32\text{KB} = E \times{} B \times{} S\) |
| \(S = \frac{C}{B\times E}\) (\(S\): number of sets) | |
| number of sets | \(S = \frac{32\text{KB}}{64\text{Byte}\times 8} =\) \(64\) |
| \(S = 2^s\) (\(s\): set index bits) | |
| set index bits | \(s = \text{log}_2(64)=\) \(6\) |
| L1 | L2 | L3 | |
| sets | 64 | 1024 | 8192 |
| block offset bits | 6 | 6 | 6 |
| set index bits | 6 | 10 | 13 |
| tag bits | (the rest) | ||
| L1 | L2 | L3 | |
|---|---|---|---|
| block offset bits | 6 | 6 | 6 |
| set index bits | 6 | 10 | 13 |
| tag bits | (the rest) | ||
0x34567:
| 3 | 4 | 5 | 6 | 7 |
0011
|
0100
|
0101
|
0110
|
0111
|
100111 = 0x2701 0101 = 0x15 0x3401 0001 0101 = 0x1150x30 1101 0001 0101 = 0xD150x0common to categorize misses:
compulsory (or cold) — first time accessing something
conflict — sets aren’t big/flexible enough
capacity — cache was not big enough
coherence — from sync’ing cache with other caches
| miss rate | hit time | miss penalty | |
|---|---|---|---|
| increase cache size | good | bad | — |
| increase associativity | good | bad | bad? |
| increase block size | depends | bad | bad |
| add secondary cache | — | — | good |
| write-allocate | good | — | ? |
| writeback | — | — | ? |
| LRU replacement | good | ? | bad? |
| prefetching | good | — | — |
| prefetching = guess what program will use, access in advance | |||
\[ \text{average time} = \text{hit time} + \text{miss rate} \times \text{miss penalty} \]
| capacity | conflict | compulsory | |
| increase cache size | good | good | — |
| increase associativity | — | good | — |
| increase block size | bad? | bad? | good |
| LRU replacement | — | good | — |
| prefetching | — | — | good |
| Cache size | direct-mapped | 2-way | 8-way | fully assoc. |
| 1KB | 8.63% | 6.97% | 5.63% | 5.34% |
| 2KB | 5.71% | 4.23% | 3.30% | 3.05% |
| 4KB | 3.70% | 2.60% | 2.03% | 1.90% |
| 16KB | 1.59% | 0.86% | 0.56% | 0.50% |
| 64KB | 0.66% | 0.37% | 0.10% | 0.001% |
| 128KB | 0.27% | 0.001% | 0.0006% | 0.0006% |
| A. | quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set) |
| B. | quadrupling the number of sets |
| C. | quadrupling the number of ways/set |
| A. | quadrupling the block size (256-byte block, 8 ways/set, 64KB cache) |
| B. | quadrupling the number of ways/set |
| C. | quadrupling the cache size |
| A. | quadrupling the block size (256-byte block, 8 ways/set, 64KB cache) |
| B. | quadrupling the number of ways/set |
| C. | quadrupling the cache size |
suppose recently accessed 16B cache blocks are at:
guess what’s accessed next
common pattern with instruction fetches and array accesses
look for sequential accesses
bring in guess at next-to-be-accessed value
if right: no cache miss (even if never accessed before)
if wrong: possibly evicted something else — could cause more misses
what happens to TLB when page table base pointer is changed?
most entries in TLB refer to things from wrong process
option 1: invalidate all TLB entries
option 2: TLB entries contain process ID
what happens to TLB when OS changes a page table entry?
most common choice: has to be handled in software
invalid to valid — nothing needed
valid to invalid — OS needs to tell processor to invalidate it
invlpg)valid to other valid — OS needs to tell processor to invalidate it
what happens to TLB when page table base pointer is changed?
most entries in TLB refer to things from wrong process
option 1: invalidate all TLB entries
option 2: TLB entries contain process ID
what happens to TLB when OS changes a page table entry?
most common choice: has to be handled in software
invalid to valid — nothing needed
valid to invalid — OS needs to tell processor to invalidate it
invlpg)valid to other valid — OS needs to tell processor to invalidate it