- forwarding / stalling data hazard --- pipeline "naturally" reads the wrong version of a value time ------------> v---- when irmovq naturally writes irmovq $100, %rax F D E * M W addq %rax, %rax F D * E M W ^---- when addq naturally reads %rax if we do nothing, wrong answer forwarding: "the value 100 is available, let's use it" shortcut the value from irmovq's execute to addq's decode implementation: need to check that irmovq instruction is writing to the same register that addq is reading from HCL: something like e_dstE == reg_srcA/reg_srcB (also other cases) need to distinguish from irmovq $100, %rbx addq %rax, %rax stalling: generic solution to hazards well, if we did everything one-at-a-time, it would correctly, so we can delay the operation to get the right answer ONLY using stalling: irmovq $100, %rax F D E M W addq %rax, %rax F F F F D sometime forwarding is not enough, but adding staling can always fix a hazard popq %rax F D E M* W addq %rax, %rax f dX e m (wouldn't work) solve with stalling only: F F F F D E solve with stalling+forwarding: F F D* E multiple equivalent options: time ------------> popq %rax F D E M W addq %rax, %rax F F D E addq %rbx, %rbx F D popq %rax F D E M W addq %rax, %rax F D D E addq %rbx, %rbx F F D - counting cache misses from C code the C code is "just" specifying a list of addresses that are accessed typcally with a pattern that should help identify what goes on more quickly what addresses? generally, shouldn't matter much whether array is at address 0 versus 1024 versus so on --- so long as aligned with cache blocks you can look at each cache set separately often different loop iterations (values of i, etc.) access completely cache sets can just consider iterations that might access the same set sometimes you'll access the same block repeatedly --> same element of some array? --> or adjacent elements of array which have same index bits (e.g. if you have 32-byte cache blocks, and sizeof(array[0]) == 4 array[0] is at the beginning of a block, then array[0] through array[7] are in the same block) sometimes you'll access different blocks that use the same set --- worry to about conflicts (how many ways are there for the set?) (e.g. if you have 1024 sets of 32-byte blocks, then elements that map to the same set are multiples of 1024*32 bytes apart = # bytes in a way [log_2(32) offset bits, log_2(1024) index bits, so for index+offset bits not to change, need to add 2^(offset+index bits) = 2^(log_2(32) + log_2(1024)) ]) look at the order accesses are in for each set, figure out what the contents of the set is direct-mapped cache: set contains only what was last accessed N-way cache: set contains N things, which ones depend on the replacement often loop mean that most sets have same hit/miss pattern remember: chances for hits are repeated accesses to smae blokc int array[1024 * 128]; /* <--- starts at beg. of cache block */ ... for (int i = 0; i < 1024; ++i) { array[i] += array[i * 128]; } ^^^^^^^^ ^^^^^^^^^^^^^ A B two types of repeated accesses: accessing array[i] and array[i+1] when they're same block roughly half of accesses to A are hits accessing array[i*128] and then accessing array[i'] when i' = i * 128 turn out to be misses (mostly?) because we have other accesses to the same set in iterations in between WT, WNA DM 16K cache, 8-byte blocks A: i = 0, 1 --- set 0 i = 2, 3 --- set 1 i = 4, 5 --- set 2 ... i = 1022, 1023 --- set 1022/2 = 511 16K bytes --> 4K ints, array[0] and array[4K] array[8K] ---> set 0 B: i = 0, 32 (array[4K]), 64 (array[8K]), ... (every 32): set 0 i = 1, 33 (array[128+4K]), 65, ...: set 64 array[128] is 128 ints from the beginning or 128 * 4 = 512 bytes from the beginning or 512 / 8 = 64 i = 2, 34 (array[256+4K]), 66, ...: set 128 ... counting misses: two cases set 0, 64, 128, ...: i = 0 / set 0: two misses (+ write miss?) array[128]: --- set 64 i = 1 (array[128]) for B i = 33 (array[128+4K]) for B i = 65 (array[128+8K]) for B ... i = 128 (array[128]) for A hit for A set 64 when i = 129 array[256]: -- set 128 i = 2 for B i = 34 for B --- different value! i = 256 for A --- miss! even though repeated ... - code optimizations --- loop unrolling, function inlining are good when? loop unrolling remove loop overhead loop overhead = instructions to manage the loop index instructions to check whether we're done with the loop advantage is if the loop overhead was slowing us down usually, yes it is example where it wouldn't: if we had really slow cache acceses that were executed in parlalel disadvantage is that is means the program has more instructions (more instructoins in the executable file, not more times we fetch an instruction) so more space in the instruction cache if our program is big, maybe we run out disadvantage is that we might run extra code if the # of loop iteratoins is very small because our unrolling assumed we'd usually do groups of K (K = how much we unrolled) function inling remove function overhead function overhead = instructions to manage arguments, return value return address, the stack, callee/caller-saved requirements advantage if th efunction overhead was slowing us down disadvantage is larger number of instructions in instruction cache second-order benefits of function inling/loop unrolling is that they expose other optimizations example: function inlining might make it easier to eliminate common subexpressions now that we (or compiler) don't need to look across multiple to do it common subexpression == parts of expressions that are the same (x *z + y) and (x * z + w) --- x * z is a common subexpression a common compiler optimization ---c ompute x*z once example: loop unrolling is a prereq for doing multiple accumulators or most uses of vector instructions - cache blocking reorder memory accesses to increase the amount of reuse close in time/space reorder memory accesses to increase locality (temporal + spatial) mechanism: divide input matrices into "blocks", try to keep block accessed repeatedly better spatial locality: adjacent elements within block used together in time better temporal locality: use parts of block repeatedly before moving on to next block useful only if you have a calculation which actually reuses cache blocks at some point disadvantage is that if cache performance wasn't the problem then your code becomes more complicated --> slower (more loop overhead) - vector instructions Q8 S2018 for (...; i += 1) { A[i] += A[i - 1] * B[i / 2]; total += A[i]; } can't use vector instructions for this easily because of aliasing generally using vector instructions for tihs: for (...i += 2) { A[i] += A[i - 1] * B[i / 2]; /* <--- */ A[i+1] += A[(i+1) - 1] * B[(i+1) / 2]; ^^^^^^^^^^^^ total += A[i]; total += A[i+1]; } end up with vector that's equivalent to this, where temp1/2 represent parts of a vector register: temp1 = A[i]; temp1 += A[i-1] * B[i/2]; /* <--- what if A[i] and B[(i+1)/2] are the same element */ previous code updated A[i] here! temp2 = A[i+1]; temp2 += A[(i+1)-1] * B[(i+1)/2]; A[i] = temp1; A[i+1] = temp2; - vecotr instructions in general single instruction multiple data same thing on pairs/groups of matching data vector = little array, vector instruction acts on several arrays, same operation on element 0, element 1, element 2, etc. writing vector instructions: unroll and try to have groups of (vector size) identical things don't want to have to worry about dependencies between them don't want to have conditionally different operations - accesses to physical addresses Q4 S2018 - remember: caches typically are based physical addresses - calculating page table sizes, PTE sizes, etc. versus virtual/physical address layout # bits used to index into a page table = size of the part of the VPN used single-level table: # bits used = bits in VPN multi-level table: bits in VPN are divided between the levels of the table (typically evenly, but sometimes that's not possible) # bits of to index page table --> 2^(#) page table entries 2^(# bits to index)*PTE size = page table size virtual adddress = [ VPN | PO ] bits in virtual address = bits in VPN + bits in PO physical address = [ PPN | PO ] bits in physical address = bits in PPN + bits in PO PTE sizes: must be big enough for AT LEAST PPN + valid bit (and probably more) - virtual memory lookup with multi-level tables take the VPN from the address --- MS bits (# of bits? see above) split the VPN into parts, depending on table sizes, # of levels take the MS (most significant) part --- use to find index in first-level table address you check is (start of table) + index * PTE size (just like accessing an array of PTEs) use PTE from first-levl table to get a *physical* page # that tells you where the second-level table is take the second MS part of orig. addr --- use to index in the second-leve ltable (start of table) + index * PTE size ^^^^^^^^^^^^^^ given as a physical page # in the PTE, not a byte address might need to convert repeat until you've run out of levels last PTE you find gives the PPN that goes in the final physical address - TLB hit versus miss - TLB hit: means we can SKIP the page table lookup b/c we have a copy of the page table entry we'd get *at the end* saved from earlier - TLB miss: means we have to to dthe page table lookup it may or may not result in a valid page