- what's on the exam if we covered it in leture since the last exam, probably if we covered it before, maybe incidentally loop unrolling will wait till exam 3 - addq rax=-5, rbx=5 --> condition codes and je result is 0 ZF = 1 je: ZF ?= 1 ---> "if the result of subtraction would have meant equal" ---> "if the result was = 0" - S2018 Q5 ~ cycles stalled from hazards mrmovq -> %r11 F D E M W addq %r11, %r8 x F D E (F D D E) needs %r11 from mrmovq subq %r8, %r11 F D E need %r8 (addq) need %r11 (mrmovq) popq %r8 F D E need %rsp stalling OF AN INSTRUCTION or PIPELINE: keeping an instruction where it is/ part of the pipeline where it is OF A REGISTER BANK: keeping the register bank outputting the same thing bubbling OF A REGISTER BANK: making the register bank output a "bubble" = NOP - post-quiz: counting cache misses values being accessed "elements" Q: how many elements per block? elements in the same block? maybe we access different parts of the block, have only miss size of value / cache size = # per block assuming element 0 is at the beginning of block, then 0-(# per block - 1) in one block, (# per block)-(2*# per block-1) in the next, ... Q: which elements map to the same set? find # bytes per way + K*block size to an address --> advance by K sets in our index bits if we go off the "bottom" --> go to the top again + # bytes per way --> end up at the same set + X * # bytes per way --> end up at the same set if elements are X * # bytes per way apart --> map to the same set we can figure out cache misses by thinking of exactly one set at a time 16 bytes/block --> 4 elements/block a[0], a[1], a[2], a[3] are in one block, a[4], a[5], a[6], a[7] are in the next block, etc. direct-mapped (= 1-way) 2048-byte --> 2048-bytes per way 2048 bytes / (4 bytes per int) = 512 ints per way values 512 ints apart ---> map to same set {a[0],a[1],a[2],a[3]} and {a[512],a[513],a[514],a[515]} map to the same set a[4] and a[512+4] map to the same set... int a[1024]; ... int sum = 0; for(int j = 0; j < 32; j++) for(int i = 0; i < 32; i++) sum += a[i * 32 + j]; set containing a[0 to 3] and a[512 to 515] // a[0]: i = 0, j = 0 a miss due to a[0 to 3] never accesed before // a[512]: i = 16, j = 0 a miss due to a[512 to 515] never accesed before // a[1]: i = 0, j = 1 <-- could benefit from a[0] being accessed before? but a miss due to conflict with a[512] accessed in between // a[513]: i = 16, j = 1 <-- could benefit from a[512] being accessed but a miss due to conflict with a[1] accessed in between // ditto for a[2], then a[514], then a[3] then a[515] ... answer 1024 misses (everything misses) suppose two-ways instead --- is this better? 1024-bytes per way now a[0] and a[256] and a[512] and a[768] map to the same set // a[0]: i = 0, j = 0 // a[256]: i = 8, j = 0 // a[512]: i = 16, j = 0 // a[768]: i = 24, j = 0 // a[1]: i = 0, j = 1 probably a MISS because we accessed a[256], a[512], a[768] since accessing a[0] (and only can store 2 things per way, we have three extra things to put in the cache) // a[257]: i = 8, j = 1 fully-associative --- usually better, most flexibility only gaureneteed if replacement policy chooses things that make sense with access pattern example bad case: scanning giant array so usually accessing thing accessed furhterest back in time and LRU replacement policy ---> same # or more than osme other designs? int a[1024]; ... int sum = 0; for(int i = 0; j < 1024; i++) sum += a[i]; a[0 to 3] is one block, accessing a[1] IMMEDIATELY after a[0] MUST BE HIT accessing a[2] IMMEDIATELY after a[1] MUST BE HIT 1 miss, 3 hits per block 1024 / 4 = 256 blocks --> 1 miss per block --> 256 miss - write-through/write-back write-through: always write TO MEMORY immediately pro: don't need to remember whether you need to write con: maybe you could write less often write-back: if value is cached, updae the cached value ONLY LATER, when value is evicted updat ememory pro: if we write a value multiple times, send value to memory once con: need to remember to write the value later on (even though write may have happened a while ago) reading values may trigger writes dirty bit = "is the data different than memory?" (do I have work to do on eviction to "fix" memory?) - post-quiz: benefits/disadvantages of optimizations function inlining: take the function body, insert it where the function would be called --> no function call (no setting arguments, call instruction, return instruction, retrieving return value) NB: also includes saving values on the stack, etc. now we have multiple copies of the function body (one for the function called "normally" + one for each place it's inlined) when does the compiler inline? wants to not your program gigantic only if the function is quite small --> so only slightly larger code size: amount of space the code takes up (on disk and/or in the instruction cache and/or in memory) number of instructions RUN: the amount of times we fetch an instruction # of instructions run in a loop is based on # of loop iters, not hte sie of the loop spatial locality and function inlining: w/o inlining: jump to a coplete different part of memory where the function is w/ inling: it's just the next instruction mmeomory - cache blocking = practice of processing data in blocks like our matrix example to make better use of cache for matrix-like problems - loop unrolling versus conditional branching conditional branching = something like a jXX instructoin (that is not JMP) - for (int i = 0 ; i + NUMBER_OF_TIMES_UNROLLED < N; i += NUMBER_OF_TIMES_UNROLLED) { MAIN LOOP } for( i = 0; i < N; ++i) <-- LEFTOVER small N --- end up spending extra time handling the "left over" that's extra overhead large N --- yes, still have the "left over", but saved a lot of conditional branches b/c MAIN LOOP has 1 branch per NUMBER_OF_TIMES_UNROLLED - when to stall/bubble - general rule: [data dependencies] look for dependencies data computed in stage X needed in stage Y, try to line up the stages so stage Y is after X e.g. with forwarding: popq %r8 F D E M W computed in stage MEMORY addq %r8, %r9 F D E M W needed in stage EXECUTE if no forwarding: values are effectively "computeD" in writeback and "needed" in decode no fowarding: popq %r8 F D E M W* addq %r8 r9 x x x F*D E M W F D D D D E M W [control dependencies] look for dependencies address of next inustrction to run computed in stage X but needed in FETCH line up so FETCH of next instr is after stage X RET F D E M W addr. of next instruction computed in memory next instr. x x x F ^^^^^--- three cycles of stalling JLE F D E M W cmptd in execute next instr. x x F D E M W ^^^--------------- two wasted cycles BUT can be avoided with prediction (but only sometimes) - S2018 Q19 ~~ cache miss counting