- coverage versus lectures ~ approx proportional to how much time spent in lecture since last exam ~ may incidentally cover stuff beforehand - clarifications on the questions ~ put an asterisk in the box ~ good clarificatoin: assumption you think you needed to make - precise exceptions S2017 Q9 ~ not covered, wait for Exam 3 - HCL code on the exam ~ no writing, we might describe a circuit with HCL code, etc. - types of misses - compulsory ~~ miss even if cache size = infinity first time we access a block ever - capacity ~~ miss even if fully-associative (# ways = total # blocks) not enough room no extra space, even in other sets - conflict ~~ anything else ~~ miss but there was room in other sets - counting cache misses (1) figure out how to find the sets used (a) figure out the index bits and write down the indexes (b) figure out the distance between things that map to the same set (= # bytes / way = 2^(index + offset) bits) e.g. 1MB direc-tmapped cache, address 0 and 0 + 1MB --> same set b block offset bits, s set index bits 2^(b) * 2^(s) = 1MB --> b + s = 20 x + 2^20 --> doesn't change bottom 20 bits 2-way 2MB set assoc. cache, address 0 and 0 + 1MB --> same set either way, identify accesses to the SAME SET (2) go through the accesses: (a) if something was never accessed before, then it's a miss (b) if something was accessed before, check what else FROM THE SET was accessed and consider replacement policy - if LRU: too many things used more recent --> miss, otherwise hit "too many" --> more than blocks/set Least Recently Used - if FIFO: too many things brought (= miss) in more recently --> miss, otherwise hit "too many" --> more than blocks/set First In First Out - if random: ??? - cache misses and replacement policies and etc. F2017 Q5 - write misses - write miss: address we wrote to was not in the cache - write policies: when writing: [write-allocate] if it's not in the cache, should I bring it in? [write-back/through] if it's in my cache (now), should I just update my cache, or should I also update memory? - write-back and dirty bits - dirty bit --> value in cache != value in memory, memory needs to be updated only needed for writeback use to send updates to memory when evicted - dependencies versus hazards - dependency --- DOES NOT DEPEND ON THE CPU - X depends on Y (or "has a dependency on Y") we need X's result to compute Y - hazard --- FOR A PARTICULAR CPU DESIGN - natural implementation (no forwarding, stalling, branch prediction, etc.) would do the wrong thing or not know what to do - forwarding - instead of writing a value to the register file and reading it back (which reuqires waiting for writeback stage) send the value directly mrmovq (%rax), %rax addq %rax, %rax (1) when is the value computed? near the end of mrmovq's memory stage without forwarding: at the end of mrmovq's writeback (2) when is the value used? near the beginning of addq's execute stage without forwarding: near the beginning of addq's deccode with forwarding: (send the value WITHOUT the register file) mrmovq F D E M1W addq F D D2E M W (1 cycle of stlaling) without forwarding: (send the value with the registe file) mrmovq F D E M W1 addq F D D D2D E M W --> make sure we stall to get one after the other (3) after we stall enough, do we have to do something besides read the register file? this is "forwarding" in addq: instead of taking %rax from the register file, take it from mrmovq's pipeline register input d_valA = [ ... ... : m_valM /* == mem_output */ ... ] d_valB (same thing) alternate example: addq %rax, %rax F D E1M W nop F D E M W nop F D E M W addq %rax, %rbx F D2E M W d_valA = [ ... ... : W_valE /* == reg_inputE */ ... ] - bubbles and stalls - stalling AN INSTRUCTION - stopping the instruction from advancing to another pipeline stage - stalling A REGISTER - keeping the register from being updated F D E M W e d c b a [normal advance:] f e d c b [stall d:] e d * c b ^--- fill with NOP to implement this stall of instruction d: stall the PC register (fetch e again) [caveats: mispredictoins] stall the fetch to decode registers (use d's values again) bubble the decode to execute registers ^^^^^^ \------ have the register output values for nop ` - aliasing - two names for the same value int A[100]; int *x = &A[10]; int *y = &A[20]; x[10] and y[0] are the same value - and loop unrolling not very related loop unrolling: compiler needs to know for (int i = 0; i < n ; ++i) { BODY } for (int i = 0; i < n ; i += 4) { BODY i+=1; BODY i+=1; BODY i+=1; BODY } loop index: i loop bound: n maybe aliasing could make it hard to tell how i is changed (e.g. does the loop body change i) ... but then i would need to be somethign which has a pointer to it - and cache blocking for (int i = 0; i< N; ++i) for (int j = 0; j< N; ++j) A[i * 100 + j] += f(B[j * 100 + i]) ^ ^ probably pointers compiler can't do cache blocking without worrying that they point to the same array cache blocking: change the order of i, j - fully assoc v direct-mapped - fully associative = associativity = # blocks most flexible possible really expensive to implement (lots of tag comparisons for every lookup) almost never used for data/instruction caches good for reasoning about conflict misses - direct-mapped -- associtiavity = 1 least flexible possible really easy/fast to implement - metadata in caches - valid bit for every block (need a way to represent an empty cache) - valid = 1 block is used - valid = 0 block is UNUSED --- ignore everything else - tag bits --- multiple things map to set -- is this the right one? NB: needed regardless of associativity - [if write-back] dirty bit -- do I need to update memory eventually? OR is my value different than the one in memory? - bits to implement replacement policies e.g. for LRU: ordered list of blocks (most to least recently accessed) - S2017 exam 2 Q1 -- branch pred andq %rax, %rax jle later [NOT TAKEN] irmovq subq later: addq halt andq F D E M W jle F D E*M W later+0 F D later+1 F irmovq F D E M W subq F D E M W * find out we made a misprediction, later+0 (addq), later+1 (halt) are squashed squashing now keeps later+0's addq from changing the conditoin codes Q: REGISTER bubble/stall signals during jle's execute to cause addq/halt to be squashed bubble/stall signals are for what happens in the next cycle --> during jle's memory jle's execute: fetch: later+1 jle's memory: fetch: irmovq (new) decode: later+0 decode: (nothing) execute: jle execute: (nothing) memory: andq memory: jle writeback: [old irmovq] writeback: andq fetch->fetch: normal fetch->decode: bubble --- output nop instead of normal value (later+1) decode->execute: bubble --- output nop instead of normal value (later+0) execute->memory: normal memory-> writeback: normal - loop unrolling tradeoffs + less comparison, index managements instructions for large # of iterations + less conditoinal jumps for large # of iterations - maybe more work for small # of iteratoins due to "leftovers" - more space in instruction cache and in binaries