caching

Imagine you are working at Intel in the 1980s

Circles: Processor
Diamonds: Memory

Image from “Computer Architecture: A Quantitative Approach”

Imagine you are working at Intel in the 1980s

The “Memory Wall”

Image from “Computer Architecture: A Quantitative Approach”

Imagine you are working at Intel in the 1980s

The “Memory Wall”

Image from “Computer Architecture: A Quantitative Approach”

1989: Intel 80486
with 8 KB cache

the place of cache

memory hierarchy

Image: approx 2004 AMD press image of Opteron die;
approx register location via chip-architect.org (Hans de Vries)

memory hierarchy goals

performance of the fastest (smallest) memory
capacity of the largest (slowest) memory
how to hide 100x latency difference?
- 99+% hit rate (“hit” means value found in cache)

memory hierarchy assumptions

temporal locality
‘‘if a value is accessed now, it will be accessed again soon’’
- caches should keep recently accessed values

spatial locality
‘‘if a value is accessed now, adjacent values will be accessed soon’’
- caches should store adjacent values at the same time

natural properties of programs — think about loops

locality examples

double computeMean(int length, double *values) {
    double total = 0.0;
    for (int i = 0; i < length; ++i) {
        total += values[i];
    }
    return total / length;
}

temporal locality: machine code of the loop
spatial locality: machine code of most consecutive instructions
temporal locality: total, i, length accessed repeatedly
spatial locality: values[i+1] accessed after values[i]

locality exercise (1)

/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

/* version 2 */
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        A[i] += B[j] * C[i * N + j];

exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?

locality exercise (2)

/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]

/* version 3 */
for (int ii = 0; ii < N; ii += 32)
    for (int jj = 0; jj < N; jj += 32)
        for (int i = ii; i < ii + 32; ++i)
            for (int j = jj; j < jj + 32; ++j)
                A[i] += B[j] * C[i * N + j];

exercise: which has better temporal locality in A? in B? in C?
how about spatial locality?

locality exercise (3)

struct Student {
    char name[128]; long id; float grade;
};
struct Student students[1000];

float averageGrade() {
    float sum = 0.0
    for (int i = 0; i < 1000; i += 1)
        sum += students[i].grade;
    return sum / 1000.0;
}

char studentNames[1000][128];
long studentIds[1000];
float studentGrades[1000];

float average() {
    float sum = 0.0
    for (int i = 0; i < 1000; i += 1)
        sum += studentGrades[i];
    return sum / 1000.0;
}

better spatial/temporal locality?

split caches; multiple cores (one design)

hierarchy and instruction/data caches

typically separate data and instruction caches for L1
(almost) never going to read instructions as data or vice-versa
avoids instructions evicting data and vice-versa
can optimize instruction cache for different access pattern
easier to build fast caches: handle fewer accesses at a time

cache analogy: you finding words in books

library: main memory
- contains all the words in all books
copy of books you took home: L2 cache
- faster to look in books already at home
copy of individual book pages in your backpack: L1 cache
- fastest to look at book pages you have with you

one-block cache

building a (direct-mapped) cache

terminology

row = set
- preview: change how much is in a row

cache analogy: you finding words in books

block size is 1 book page

book title: tag
page number: index
word index on book page: offset

Tag-Index-Offset (TIO)

cache size

cache size = amount of data in cache
not included metadata (tags, valid bits, etc.)

Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)

$S=2^s$	number of sets
$s$	(set) index bits
$B=2^b$	block size
$b$	(block) offset bits
$m$	memory addreses bits
$t = m - (s+b)$	tag bits
$C = B \times{} S$	cache size (if direct-mapped)

TIO: exercise

64-byte blocks, 128 set cache
stores $64 \times 128 = 8192$ bytes (of data)
if addresses 32-bits, then how many tag/index/offset bits?
which bytes are stored in the same block as byte from 0x1037?
- A. byte from 0x1011
- B. byte from 0x1021
- C. byte from 0x1035
- D. byte from 0x1041

cache size exercise

A system uses a direct-mapped cache with 32 byte blocks.
Two memory accesses are made:
- Read 0x08249
- Read 0x0c658
What is the fewest number of cache sets this cache could have to prevent the second read from evicting the first?
What is the size of this cache?

example access pattern (1)

exercise

mapping of sets to memory (direct-mapped)

simulated misses: BST lookups

simulated 16KB direct-mapped cache; excluding BST setup

actual misses: BST lookups

actual 32KB more complex cache (only one set of measurements + other things on the machine + excluding initial load)

simulated misses: matrix multiplies

simulated 16KB direct-mapped data cache; excluding initial load

actual misses: matrix multiplies

actual 32KB more complex cache; excluding initial matrix load

adding associativity

associative lookup possibilities

none of the blocks for the index are valid
none of the valid blocks for the index match the tag
- something else is stored there
one of the blocks for the index is valid and matches the tag

cache operation (associative)

replacement policies

example replacement policies

least recently used
- take advantage of temporal locality
- at least $\left\lceil{}\log_2(E!)\right\rceil$ bits per set for $E$-way cache
  - (need to store order of all blocks)
approximations of least recently used
- implementing least recently used is expensive
- really just need ‘‘avoid recently used’’ — much faster/simpler
- good approximations: $E$ to $2E$ bits
first-in, first-out
- counter per set — where to replace next
(pseudo-)random
- no extra information!
- actually works pretty well in practice

LRU bit updating

LRU w/ more than two ways?

need to track total order
worst case: bunch of new accesses to same set
- first replaces least recently used
- next replaces next least recently used
- etc.

hard to track, so frequently only approximated

associativity terminology

direct-mapped — one block per set
$E$-way set associative — $E$ blocks per set
- $E$ ways in the cache
fully associative — one set total (everything in one set)

Tag-Index-Offset formulas

$m$	memory addreses bits
$E$	number of blocks per set (‘‘ways’’)
$S=2^s$	number of sets
$s$	(set) index bits
$B=2^b$	block size
$b$	(block) offset bits
$t = m - (s+b)$	tag bits
$C = B \times S \times E$	cache size (excluding metadata)

C and cache misses (1)

int array[4];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

some possibilities

aside: alignment

compilers and malloc/new implementations usually try to align values
align = make address be multiple of something
most important reason: don’t cross cache block boundaries

C and cache misses (2)

int array[4];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
odd_sum += array[1];
odd_sum += array[3];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
Assume array[0] at beginning of cache block.
How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

exercise solution

C and cache misses (3)

int array[8]; /* assume aligned */
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
even_sum += array[4];
even_sum += array[6];
odd_sum += array[1];
odd_sum += array[3];
odd_sum += array[5];
odd_sum += array[7];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on a 2-set direct-mapped cache with 8B blocks?

exercise solution

C and cache misses (4)

int array[8]; /* assume aligned */
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[3];
even_sum += array[6];
odd_sum += array[1];
even_sum += array[4];
odd_sum += array[7];
even_sum += array[2];
odd_sum += array[5];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on a 2-set direct-mapped cache with 8B blocks?

C and cache misses (5)

int array[1024]; /* assume aligned */ int even = 0, odd = 0;
even += array[0];
even += array[2];
even += array[512];
even += array[514];
odd += array[1];
odd += array[3];
odd += array[511];
odd += array[513];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
observation: array[0] and array[512] exactly 2KB apart

How many data cache misses on a 2KB direct mapped cache with 16B blocks?

C and cache misses (6)

int array[1024]; /* assume aligned */ int even = 0, odd = 0;
even += array[0];
even += array[2];
even += array[500];
even += array[502];
odd += array[1];
odd += array[3];
odd += array[501];
odd += array[503];

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2KB direct mapped cache with 16B blocks?

misses with skipping

int array1[512]; int array2[512];
...
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).
About how many data cache misses on a 2KB direct-mapped cache with 16B cache blocks?
Hint: depends on relative placement of array1, array2

best/worst case

array1[i] and array2[i] always different sets:
- = distance from array1 to array2 not multiple of #sets $\times$ bytes/set
- 2 misses every 4 i
- blocks of 4 array1[X] values loaded, then used 4 times before loading next block
- (and same for array2[X])
array1[i] and array2[i] same sets:
- = distance from array1 to array2 is multiple of #sets $\times$ bytes/set
- 2 misses every i
- block of 4 array1[X] values loaded, one value used from it,
- then, block of 4 array2[X] values replaces it, one value used from it, …

worst case in practice?

can have worst spacing happen by accident:
two structs/arrays placed at start of newly allocated page
- worst-case spacing likely small multiple of page size
columns of matrix with power-of-two width

mapping of sets to memory (3-way)

C and cache misses (assoc)

int array[1024]; /* assume aligned */
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
even_sum += array[512];
even_sum += array[514];
odd_sum += array[1];
odd_sum += array[3];
odd_sum += array[511];
odd_sum += array[513];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
observation: array[0], array[256], array[512], array[768] in same set

How many data cache misses on a 2KB 2-way set associative cache with 16B blocks?

C and cache misses (assoc)

int array[1024]; /* assume aligned */
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[256];
even_sum += array[512];
even_sum += array[768];
odd_sum += array[1];
odd_sum += array[257];
odd_sum += array[513];
odd_sum += array[769];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
observation: array[0], array[256], array[512], array[768] in same set

How many data cache misses on a 2KB 2-way set associative cache with 16B blocks?

handling writes

what about writing to the cache?
two decision points:
if the value is not in cache, do we add it?
- if yes: need to load rest of block — write-allocate
- if no: missing out on locality? write-no-allocate
if value is in cache, when do we update next level?
- if immediately: extra writing write-through
- if later: need to remember to do so write-back

allocate on write?

say processor writes less than whole cache block
and block not yet in cache
two options:
write-allocate
- fetch rest of cache block, replace written part
- (then follow write-through or write-back policy)
write-no-allocate
- don’t use cache at all (send write to memory instead)
- guess: not read soon?

write-allocate v. write-no-allocate

write-through v. write-back

write-back policy

write-allocate + write-back

write-no-allocate + write-back

cache write exercise (1)

for each of the following accesses, performed alone, would it require (a) reading a value from memory (or next level of cache) and (b) writing a value to the memory (or next level of cache)?
- writing 1 byte to 0x33
- reading 1 byte from 0x52
- reading 1 byte from 0x50

cache write exercise (1, solution)

writing 1 byte to 0x33: (set 1, offset 1) no next-level read or write
reading 1 byte from 0x52: (set 1, offset 0) write back 0x32-0x33; read 0x52-0x53
reading 1 byte from 0x50: (set 0, offset 0) replace 0x30-0x31 (no write back); read 0x50-0x51

cache write exercise (2)

for each of the following accesses, performed alone, would it require (a) reading a value from memory and (b) writing a value to the memory?
- writing 1 byte to 0x33
- reading 1 byte from 0x52
- reading 1 byte from 0x50

cache write exercise (2, solution)

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification
reading 1 byte from 0x52: (set 1, offset 0) replace 0x32-0x33; read 0x52-0x53
reading 1 byte from 0x50: (set 0, offset 0) replace 0x30-0x31; read 0x50-0x51

fast writes with write-through caches

cache tradeoffs briefly

deciding cache size, associativity, etc.?
lots of tradeoffs:
- more cache hits v. slower cache hits?
- faster cache hits v. fewer cache hits?
- (N+1)th-level cache v. larger Nth level cache?
- …
details depend on programs run
- how often is same block used again?
- how often is same index bits used?
- how much {temporal,spatial} locality to take advantage of?
simulation to assess impact of designs

cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size, LRU replacement

data cache miss rates:

Cache size	direct-mapped	2-way	8-way	fully assoc.
1KB	8.63%	6.97%	5.63%	5.34%
2KB	5.71%	4.23%	3.30%	3.05%
4KB	3.70%	2.60%	2.03%	1.90%
16KB	1.59%	0.86%	0.56%	0.50%
64KB	0.66%	0.37%	0.10%	0.001%
128KB	0.27%	0.001%	0.0006%	0.0006%

average memory access time

$\text{AMAT} = \text{hit time} + \text{miss penalty} \times{} \text{miss rate}$
- or $\text{AMAT} = \text{hit time} \times \text{hit rate} + \text{miss time} \times \text{miss rate}$
effective speed of memory

AMAT exercise (1)

90% cache hit rate
hit time is 2 cycles
30 cycle miss penalty
what is the average memory access time?
5 cycles
suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
how much do we have to increase the hit rate for this to not increase AMAT?
to miss rate of 2/30 $\rightarrow$ to approx 93% hit rate

two-level page table lookup

The virtual address is split into a VPN and a page offset. The VPN is further split into two parts. In this example, those parts are equally size, but that is not required.

The first (most significant) part of the VPN is multiplied by the PTE (page table entry) size, then added to the page table base register to yield the 1st PTE address.

The first PTE addres is retrieved, split into parts and checked for validity and permissions (possibly causing a fault instead of continuing the page table lookup.

The physical page number from the 1st PTE is multiplied by the page size to convert it to a physical address. This physical address is used similar how the page table base register was for the 1st PTE: this physical address is added to the second part of the VPN is multipled by the PTE size to form the 2nd PTE address.

The multiplication by the page size converts the physical page number to a physical address, which we need for the array lookup operation. An equivalent way of doing this conversion would be to add an all zeroes page offset the physical page number to form the physical address.

The second PTE is split into parts, checked for validity and permissions. If no fault occurs, the physical page number from the lookup is used as part of the final physical address.

The PPN from the last PTE is combined with the page offset from the original virtual address to form the final physical address, and memory is accessed there.

In this two-level lookup, there are three memory (or cache) accesses.

The part of the processor that does the whole lookup is called the memory management unit (MMU).

another view

cache accesses and multi-level PTs

four-level page tables — five cache accesses per program memory access
L1 cache hits — typically a couple cycles each?
so add 8 cycles to each program memory access?
not acceptable

program memory active sets

page table entries and locality

page table entries have excellent temporal locality
typically one or two pages of the stack active
typically one or two pages of code active
typically one or two pages of heap/globals active
each page contains whole functions, arrays, stack frames, etc.
needed page table entries are very small

page table entry cache

caled a TLB (translation lookaside buffer)

(usually very small) cache of page table entries

L1 cache	TLB
physical addresses	virtual page numbers
bytes from memory	page table entries
tens of bytes per block	one page table entry per block
usually thousands of blocks	usually tens of entries

only caches the page table lookup itself
(generally) just entries from the last-level page tables

virtual page number divided into index + tag

not much spatial locality between page table entries
(they’re used for kilobytes of data already)

0 block offset bits

few active page table entries at a time
enables highly associative cache designs

TLB and multi-level page tables

TLB caches valid last-level page table entries
doesn’t matter which last-level page table
means TLB output can be used directly to form address

TLB and two-level lookup

TLB organization (2-way set associative)

exercise: TLB access pattern (setup)

4-entry, 2-way TLB, LRU replacement policy, initially empty
4096 byte pages
how many index bits?
TLB index of virtual address 0x12345?

exercise: TLB access pattern

4-entry, 2-way TLB, LRU replacement policy, initially empty
4096 byte pages

type	virtual	physical
read	0x440030	0x554030
write	0x440034	0x554034
read	0x7FFFE008	0x556008
read	0x7FFFE000	0x556000
read	0x7FFFDFF8	0x5F8FF8
read	0x664080	0x5F9080
read	0x440038	0x554038
write	0x7FFFDFF0	0x5F8FF0

which are TLB hits? which are TLB misses? final contents of TLB?

solution: TLB access pattern

type	virtual	physical	result	set 0	set 1
read	0x440030	0x554030	miss	0x440
write	0x440034	0x554034	hit	0x440
read	0x7FFFE008	0x556008	miss	0x440, 0x7FFFE
read	0x7FFFE000	0x556000	hit	0x440, 0x7FFFE
read	0x7FFFDFF8	0x5F8FF8	miss	0x440, 0x7FFFE	0x7FFFD
read	0x664080	0x5F9080	miss	0x664, 0x7FFFE	0x7FFFD
read	0x440038	0x554038	miss	0x664, 0x440	0x7FFFD
write	0x7FFFDFF0	0x5F8FF0	hit	0x664, 0x440	0x7FFFD

set idx	V	tag	physical page	write?	user?	…	LRU?
0	1	0x00220 (0x00440 >> 1)	0x554	1	1	…	no
0	1	0x00322 (0x00664 >> 1)	0x5F9	1	1	…	yes
1	1	0x3FFFF (0x7FFFD >> 1)	0x554	1	1	…	no
1	0					…	yes

Backup slides

cache accesses and C code (1)

int scaleFactor;

int scaleByFactor(int value) {
    return value * scaleFactor;
}

scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret

exericse: what data cache accesses does this function do?
- 4-byte read of scaleFactor
- 8-byte read of return address

possible scaleFactor use

for (int i = 0; i < size; ++i) {
    array[i] = scaleByFactor(array[i]);
}

misses and code (2)

scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret

suppose each time this is called in the loop:
- return address located at address 0x7ffffffe43b8
- scaleFactor located at address 0x6bc3a0
with direct-mapped 32KB cache w/64 B blocks, what is their:

return address scaleFactor

tag 0xfffffffc 0xd7

index 0x10e 0x10e

offset 0x38 0x20

conflict miss coincidences?

obviously I set that up to have the same index
- have to use exactly the right amount of stack space…
but one of the reasons we’ll want something better than direct-mapped cache

C and cache misses (warmup 3)

int array[8];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
even_sum += array[4];
odd_sum += array[5];
even_sum += array[6];
odd_sum += array[7];

Assume everything but array is kept in registers (and the compiler does not do anything funny), and array[0] at beginning of cache block.
How many data cache misses on a 2-set direct-mapped cache with 8B blocks?

exercise solution

arrays and cache misses (1)

int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
    even_sum += array[i + 0];
    odd_sum +=  array[i + 1];
}

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

arrays and cache misses (2)

int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum +=  array[i + 1];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

explanation

2-way, 2KB set associative cache, 16B blocks
4 offset bits, 6 index bits
so addresses multiples $2^10$ bytes apart differ only in tag bits
example: array[0$\rightarrow{}$3], array[256$\rightarrow{}$259], array[512$\rightarrow{}$515], array[768$\rightarrow{}$771]
those all use the same set
but sets only holds 2 things
all misses

arrays and cache misses (2b)

int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum +=  array[i + 1];

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on initially empty 4KB direct-mapped cache with 16B cache blocks?

simulated misses: BST lookups

simulated misses: matrix multiplies

inclusive versus exclusive

Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)

$S=2^s$	number of sets
$s$	(set) index bits
$B=2^b$	block size
$b$	(block) offset bits
$m$	memory addreses bits
$t = m - (s+b)$	tag bits
$C = B \times{} S$	cache size (if direct-mapped)

example access pattern (1)

cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size, LRU replacement

data cache miss rates:

Cache size	direct-mapped	2-way	8-way	fully assoc.
1KB	8.63%	6.97%	5.63%	5.34%
2KB	5.71%	4.23%	3.30%	3.05%
4KB	3.70%	2.60%	2.03%	1.90%
16KB	1.59%	0.86%	0.56%	0.50%
64KB	0.66%	0.37%	0.10%	0.001%
128KB	0.27%	0.001%	0.0006%	0.0006%

exercise (1)

initial cache: 64-byte blocks, 64 sets, 8 ways/set
If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)

B. quadrupling the number of sets

C. quadrupling the number of ways/set

exercise (2)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache
If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)

B. quadrupling the number of ways/set

C. quadrupling the cache size

exercise (3)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache
If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)

A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)

B. quadrupling the number of ways/set

C. quadrupling the cache size

prefetching

seems like we can’t really improve cold misses…
have to have a miss to bring value into the cache?
solution: don’t require miss: ‘prefetch’ the value before it’s accessed
remaining problem: how do we know what to fetch?

common access patterns

suppose recently accessed 16B cache blocks are at:
- 0x48010, 0x48020, 0x48030, 0x48040
guess what’s accessed next
common pattern with instruction fetches and array accesses

prefetching idea

look for sequential accesses
bring in guess at next-to-be-accessed value
if right: no cache miss (even if never accessed before)
if wrong: possibly evicted something else — could cause more misses
- fortunately, sequential access guesses almost always right

quiz exercise solution

not the quiz problem

C and cache misses (4)

typedef struct {
    int a_value, b_value;
    int other_values[6];
} item;
item items[5];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 5; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 5; ++i)
    b_sum += items[i].b_value;

Assume everything but items is kept in registers (and the compiler does not do anything funny).

C and cache misses (4, rewrite)

int array[40]
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 40; i += 8)
    a_sum += array[i];
for (int i = 1; i < 40; i += 8)
    b_sum += array[i];

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array starts at beginning of cache block.
How many data cache misses on a 2-way set associative 128B cache with 16B cache blocks and LRU replacement?

C and cache misses (4, solution pt 1)

ints 4 byte $\rightarrow$ array[0 to 3] and array[16 to 19] in same cache set
- 64B = 16 ints stored per way
- 4 sets total
accessing 0, 8, 16, 24, 32, 1, 9, 17, 25, 33
0 (set 0), 8 (set 2), 16 (set 0), 24 (set 2), 32 (set 0)
1 (set 0), 9 (set 2), 17 (set 0), 25 (set 2), 33 (set 0)

C and cache misses (4, solution pt 2)

access	set 0 after (LRU first)	result
—	—, —
array[0]	—, array[0 to 3]	miss
array[16]	array[0 to 3], array[16 to 19]	miss
array[32]	array[16 to 19], array[32 to 35]	miss
array1	array[32 to 35], array[0 to 3]	miss
array[17]	array[0 to 3], array[16 to 19]	miss
array[32]	array[16 to 19], array[32 to 35]	miss

6 misses for set 0

C and cache misses (4, solution pt 3)

access	set 2 after (LRU first)	result
—	—, —
array[8]	—, array[8 to 11]	miss
array[24]	array[8 to 11], array[24 to 27]	miss
array[9]	array[8 to 11], array[24 to 27]	hit
array[25]	array[16 to 19], array[32 to 35]	hit

2 misses for set 1

C and cache misses (3)

typedef struct {
    int a_value, b_value;
    int other_values[10];
} item;
item items[5];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 5; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 5; ++i)
    b_sum += items[i].b_value;

observation: 12 ints in struct: only first two used
equivalent to accessing array[0], array[12], array[24], etc.
… then accessing array1, array[13], array[25], etc.

C and cache misses (3, rewritten?)

int array[60];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 60; i += 12)
    a_sum += array[i];
for (int i = 1; i < 60; i += 12)
    b_sum += array[i];

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array at beginning of cache block.
How many data cache misses on a 128B two-way set associative cache with 16B cache blocks and LRU replacement?
observation 1: first loop has 5 misses — first accesses to blocks
observation 2: array[0] and array1, array[12] and array[13], etc. in same cache block

C and cache misses (3, solution)

ints 4 byte $\rightarrow{}$ array[0 to 3] and array[16 to 19] in same cache set
- 64B = 16 ints stored per way
- 4 sets total
accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49
0 (set 0, array[0 to 3]), 12 (set 3), 24 (set 2), 36 (set 1), 48 (set 0)
- each set used at most twice
- no replacement needed
so access to 1, 21, 41, 61, 81 all hits:
- set 0 contains block with array[0 to 3]
- set 5 contains block with array[20 to 23]
- etc.

C and cache misses (3)

typedef struct {
    int a_value, b_value;
    int boring_values[126];
} item;
item items[8]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b_sum += items[i].b_value;

Assume everything but items is kept in registers (and the compiler does not do anything funny).
How many data cache misses on a 2KB direct-mapped cache with 16B cache blocks?

C and cache misses (3, rewritten?)

item array[1024]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 1024; i += 128)
    a_sum += array[i];
for (int i = 1; i < 1024; i += 128)
    b_sum += array[i];

C and cache misses (4)

typedef struct {
    int a_value, b_value;
    int boring_values[126];
} item;
item items[8]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b_sum += items[i].b_value;

Assume everything but items is kept in registers (and the compiler does not do anything funny).
How many data cache misses on a 4-way set associative 2KB direct-mapped cache with 16B cache blocks?

thinking about cache storage (1)

2KB direct-mapped cache with 16B blocks —
set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, …
- block at 0: array[0] through array[3]
- block at 0+2KB: array[512] through array[515]
set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, …
- block at 16: array[4] through array[7]
- block at 16+2KB: array[516] through array[519]
…
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, …
- block at 2032: array[508] through array[511]
- block at 2032+2KB: array[1020] through array[1023]

thinking about cache storage (2)

2KB 2-way set associative cache with 16B blocks: block addresses —
set 0: address 0, 0 + 2KB, 0 + 4KB, …
- block at 0: array[0] through array[3]
- block at 0+1KB: array[256] through array[259]
- block at 0+2KB: array[512] through array[515]
- …
set 1: address 16, 16 + 2KB, 16 + 4KB, …
- address 16: array[4] through array[7]
…
set 63: address 1008, 2032 + 2KB, 2032 + 4KB …
- address 1008: array[252] through array[255]

arrays and cache misses (3)

int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {
    int local_sum = 0;
    for (int j = i - 8; j < i + 8; j += 1) {
        local_sum += array[i] * (j - i);
    }
    sum += (local_sum - array[i]);
}

Assume everything but array is kept in registers (and the compiler does not do anything funny).
How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

Tag-Index-Offset exercise

$m$	memory addreses bits (Y86-64: 64)
$E$	number of blocks per set (‘‘ways’’)
$S=2^s$	number of sets
$s$	(set) index bits
$B=2^b$	block size
$b$	(block) offset bits
$t = m - (s+b)$	tag bits
$C = B \times{} S \times{} E$	cache size (excluding metadata)

My desktop:
- L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks
- L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks
- L3 Cache: 8 MB, 16 blocks/set, 64 byte blocks
- Divide the address 0x34567 into tag, index, offset for each cache.

T-I-O exercise: L1

quantity	value for L1
block size (given)	$B=64\text{Byte}$
$B=2^b$ ($b$: block offset bits)
block offset bits	$b=$$6$
blocks/set (given)	$E=8$
cache size (given)	$C = 32\text{KB} = E \times{} B \times{} S$
$S = \frac{C}{B\times E}$ ($S$: number of sets)
number of sets	$S = \frac{32\text{KB}}{64\text{Byte}\times 8} =$ $64$
$S = 2^s$ ($s$: set index bits)
set index bits	$s = \text{log}_2(64)=$ $6$

T-I-O results

	L1	L2	L3
sets	64	1024	8192
block offset bits	6	6	6
set index bits	6	10	13
tag bits	(the rest)

T-I-O: splitting

	L1	L2	L3
block offset bits	6	6	6
set index bits	6	10	13
tag bits	(the rest)

0x34567:

3 4 5 6 7

0011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
L1:
- bits 6-11 (L1 set): 01 0101 = 0x15
- bits 12- (L1 tag): 0x34
L2:
- bits 6-15 (set for L2): 01 0001 0101 = 0x115
- bits 16-: 0x3
L3:
- bits 6-18 (set for L3): 0 1101 0001 0101 = 0xD15
- bits 18-: 0x0

cache operation (associative)

backup slides — cache performance

cache miss types

common to categorize misses:
- roughly ‘‘cause’’ of miss assuming cache block size fixed

compulsory (or cold) — first time accessing something
- adding more sets or blocks/set wouldn’t change
conflict — sets aren’t big/flexible enough
- a fully-associtive (1-set) cache of the same size would have done better
capacity — cache was not big enough
coherence — from sync’ing cache with other caches
- only issue with multiple cores

making any cache look bad

1. access enough blocks, to fill the cache
1. access an additional block, replacing something
1. access last block replaced
1. access last block replaced
1. access last block replaced
…
but — typical real programs have locality

cache optimizations

(assuming typical locality + keeping cache size constant if possible…)

	miss rate	hit time	miss penalty
increase cache size	good	bad	—
increase associativity	good	bad	bad?
increase block size	depends	bad	bad
add secondary cache	—	—	good
write-allocate	good	—	?
writeback	—	—	?
LRU replacement	good	?	bad?
prefetching	good	—	—
prefetching = guess what program will use, access in advance

\[ \text{average time} = \text{hit time} + \text{miss rate} \times \text{miss penalty} \]

cache optimizations by miss type

(assuming other listed parameters remain constant)

	capacity	conflict	compulsory
increase cache size	good	good	—
increase associativity	—	good	—
increase block size	bad?	bad?	good

LRU replacement	—	good	—
prefetching	—	—	good

average memory access time

$\text{AMAT} = \text{hit time} + \text{miss penalty} \times{} \text{miss rate}$
- or $\text{AMAT} = \text{hit time} \times \text{hit rate} + \text{miss time} \times \text{miss rate}$
effective speed of memory

AMAT exercise (1)

90% cache hit rate
hit time is 2 cycles
30 cycle miss penalty
what is the average memory access time?
5 cycles
suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
how much do we have to increase the hit rate for this to not increase AMAT?
to miss rate of 2/30 $\rightarrow$ to approx 93% hit rate

exercise: AMAT and multi-level caches

suppose we have L1 cache with
- 3 cycle hit time, 90% hit rate
and an L2 cache with
- 10 cycle hit time, 80% hit rate (for accesses that make this far)
- (assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there’s an cache miss, the next level access starts after the hit time
- e.g. an access that misses in L1 and hits in L2 will take 10+3 cycles
what is the average memory access time for the L1 cache?
- $3 + 0.1\cdot(10 + 0.2\cdot 100) = 6$ cycles
- L1 miss penalty is $10 + 0.2\cdot100 = 30$ cycles

example access pattern (1)

cache organization and miss rate

depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size, LRU replacement

data cache miss rates:

Cache size	direct-mapped	2-way	8-way	fully assoc.
1KB	8.63%	6.97%	5.63%	5.34%
2KB	5.71%	4.23%	3.30%	3.05%
4KB	3.70%	2.60%	2.03%	1.90%
16KB	1.59%	0.86%	0.56%	0.50%
64KB	0.66%	0.37%	0.10%	0.001%
128KB	0.27%	0.001%	0.0006%	0.0006%

exercise (1)

initial cache: 64-byte blocks, 64 sets, 8 ways/set
If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)

B. quadrupling the number of sets

C. quadrupling the number of ways/set

exercise (2)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache
If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)

B. quadrupling the number of ways/set

C. quadrupling the cache size

exercise (3)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache
If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)

A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)

B. quadrupling the number of ways/set

C. quadrupling the cache size

prefetching

seems like we can’t really improve cold misses…
have to have a miss to bring value into the cache?
solution: don’t require miss: ‘prefetch’ the value before it’s accessed
remaining problem: how do we know what to fetch?

common access patterns

suppose recently accessed 16B cache blocks are at:
- 0x48010, 0x48020, 0x48030, 0x48040
guess what’s accessed next
common pattern with instruction fetches and array accesses

prefetching idea

look for sequential accesses
bring in guess at next-to-be-accessed value
if right: no cache miss (even if never accessed before)
if wrong: possibly evicted something else — could cause more misses
- fortunately, sequential access guesses almost always right

TLB and the MMU (1)

TLB and the MMU (2)

changing page tables

what happens to TLB when page table base pointer is changed?
- e.g. context switch
most entries in TLB refer to things from wrong process
- oops — read from the wrong process’s stack?

option 1: invalidate all TLB entries
- side effect on ‘‘change page table base register’’ instruction
option 2: TLB entries contain process ID
- set by OS (special register)
- checked by TLB in addition to TLB tag, valid bit

editing page tables

what happens to TLB when OS changes a page table entry?
most common choice: has to be handled in software
invalid to valid — nothing needed
- TLB doesn’t contain invalid entries
- MMU will check memory again
valid to invalid — OS needs to tell processor to invalidate it
- special instruction (x86: invlpg)
valid to other valid — OS needs to tell processor to invalidate it

address splitting for TLBs (1)

my desktop:
4KB ($2^{12}$ byte) pages; 48-bit virtual address
64-entry, 4-way L1 data TLB
TLB index bits? $64/4 = 16$ sets — 4 bits
TLB tag bits? $48-12=36$ bit virtual page number — $36-4=32$ bit TLB tag

address splitting for TLBs (2)

my desktop:
4KB ($2^{12}$ byte) pages; 48-bit virtual address
1536-entry ($3\cdot 2^9$), 12-way L2 TLB
TLB index bits? $1536/12 = 128$ sets — 7 bits
TLB tag bits? $48-12=36$ bit virtual page number — $36-7=29$ bit TLB tag

changing page tables

what happens to TLB when page table base pointer is changed?
- e.g. context switch
most entries in TLB refer to things from wrong process
- oops — read from the wrong process’s stack?

option 1: invalidate all TLB entries
- side effect on ‘‘change page table base register’’ instruction
option 2: TLB entries contain process ID
- set by OS (special register)
- checked by TLB in addition to TLB tag, valid bit

editing page tables

what happens to TLB when OS changes a page table entry?
most common choice: has to be handled in software
invalid to valid — nothing needed
- TLB doesn’t contain invalid entries
- MMU will check memory again
valid to invalid — OS needs to tell processor to invalidate it
- special instruction (x86: invlpg)
valid to other valid — OS needs to tell processor to invalidate it

quantity	value for L1
block size (given)	\(B=64\text{Byte}\)
\(B=2^b\) (\(b\): block offset bits)
block offset bits	\(b=\)\(6\)
blocks/set (given)	\(E=8\)
cache size (given)	\(C = 32\text{KB} = E \times{} B \times{} S\)
\(S = \frac{C}{B\times E}\) (\(S\): number of sets)
number of sets	\(S = \frac{32\text{KB}}{64\text{Byte}\times 8} =\) \(64\)
\(S = 2^s\) (\(s\): set index bits)
set index bits	\(s = \text{log}_2(64)=\) \(6\)

\(S=2^s\)	number of sets
\(s\)	(set) index bits
\(B=2^b\)	block size
\(b\)	(block) offset bits
\(m\)	memory addreses bits
\(t = m - (s+b)\)	tag bits
\(C = B \times{} S\)	cache size (if direct-mapped)

\(m\)	memory addreses bits
\(E\)	number of blocks per set (‘‘ways’’)
\(S=2^s\)	number of sets
\(s\)	(set) index bits
\(B=2^b\)	block size
\(b\)	(block) offset bits
\(t = m - (s+b)\)	tag bits
\(C = B \times S \times E\)	cache size (excluding metadata)

	return address	scaleFactor
tag	0xfffffffc	0xd7
index	0x10e	0x10e
offset	0x38	0x20

A.	quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
B.	quadrupling the number of sets
C.	quadrupling the number of ways/set

A.	quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
B.	quadrupling the number of ways/set
C.	quadrupling the cache size

3	4	5	6	7
0011	0100	0101	0110	0111