- what material is covered - pipelining and cacheing are most of the exam (80+%) - a small number of questions on non-cache performance - calculating # of cache misses like the post-quiz - different types of caches: - associativity: # of ways --- "width" in # of blocks special cases: 1-way --> direct-mapped (hsa (# blocks) sets) (# blocks)-ways --> fully associative (has 1 set) - write policies in caches - write-back and write-through - when we write a value, do we send it to memory right away? write-back: no --- keep track of values we need to send (dirty bit) on write: store value in cache on read or write that replaced written value: send stored value to next level write-through: yesj - write-allocate and write-no-allocate - when we write a value that isn't cached, do we add it to the cache - write-allocate: yes - write-no-allocate: no - can have ALL combinations - TIO - can just memorize formulas - can reason from geometry --- rows = # of sets, columns = #of ways, parts of columns = block size --> rows * columns * parts = cache size - bits in address: # index/offset bits = log_2(# sets/# bytes in block) - strategy: write down what you know (incl. formulas) - do algebra, try to solve for everything you might need - replacement policies - FIFO -- first in, first out -- "a queue" A B C D replace A replace B replace C replace D replace A replace B replace C replace D ... don't care about how often A, B, C, D are accessed counter: what block do we replace next - LRU -- lease recently used do care about non-replacement accessed to block need to update ordering of when things are used e.g. A is being read a lot, might never be replaced ordering of when blocks were last accessed (2 block/set -- 1 bit) - locality and loops - temporal locality: accessing things multiple times for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[i*N + j] = 1; each element of A is accessed exactly once --- no temporal locality for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[i] = 1; each element of A is accesed repeatedly N times --- very good temporal locaity for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[j] = 1; each element of A is again after accessing N other elements of A --- very poor temporal locaity - spatial locality: accessing things that are near other things for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[i*N + j] = 1; each element of A is 1 elements from the previous one --- very good spatial locality for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[j*N + i] = 1; each element of A is N elements from the previous one --- very poor spatial locality - forwarding paths between instructions addq %rax, %rax F D Ev M W addq %rax, %rax F Dv*E M W - path must be: after value is computed and before value is used - and path doesn't travel in time popq %rax F D E M* W [addq %rax, %rax F D *E M W <--- NOT POSSIBLE] addq %rax, %rax F D D *E M W - pipeline hazards in general - dependencies --> instruciton uses value from another - a dependency is a HAZARD if the "natural" pipeline can't handle (no forwarding, no stalling, no branch prediction, ...) - two kinds of hazards: - control hazards --- can't do it, bceause can't compute address to fetch - data hazards -- can't do it, because value (register value) not available in time (would read an old value if we did nothing) - no hazard examples: irmovq $10, %rax irmovq $12, %rax addq %rax, %rax nop nop nop nop addq %rax, %rax - pro/cons of optimizations (func inlining, loop unrolling, cache blocking) - most of these optimizations are orthonogonal - can and will do all of them - many optimizations INCREASE CODE SIZE (might increase instruction cache misses) - cache blocking - loop unrlling - function inlining - some don't always help --- sometimes can tell, otherwise test - loop unrolling --- sometimes doesn't help with loops with very few iteration (unless you can get rid of all the loop index tests) - cache blocking --- only helps if you have big enough data that cache matters - branch prediction - in general: processor guesses what instruction will run next - modern processors --- do something really complicated to make good guesses - PIPE: guess that the target of jumps is the destination (and stall for rets) - and fetch what they guess and start running - figure out actual result later: - PIPE: figure out result during the jump's execute - if right: do nothing - if wrong: cancel ("squash") all instructions that were fetched as a result of the wrong guess - PIPE: replacing decode and execute stage with pipeline bubbles (instead of allowing fetch and decode stages, which contain wrong instructions to advance) - if wrong: start fetching the right instructions - PIPE: fetch the right instruction when jump is in memory - cache blocking - identify good/bad locality in a loop that uses cache blocking - what kinds of scenarios does cache blocking help - enough data that the cache matters - can find locality that you wouldn't get with just loop ordering