- stalling/forwarding -- when only an issue when two instructions depend on each other time --------------> v------------------- write value of %rbx addq %rax, %rbx F D E M W ^------------------------- value of %rbx computed by here ???? F D E M W subq %rcx, %rbx if we do nothing: F D E M W ^----------------- read value of %rbx if we do forwarding: F D E M W ^------------------ instead of using value from reigster file, going to take hte value from addq's wires directly if we do stalling F D D D ^----^-------- detect that %rbx isn't ready and send subq back into the fetch-to-decode pipeline registers how to resolve this: most efficient way (in terms of cycles): forwarding we know that the new value of %rbx was already computed take that computed value from [where it is in our processor] [at the time we read it, that's in the memory stage of addq] and have it replace the value we would read from the register file less efficient way (in terms of cycles): stalling (alone) delay the subq instruction until the value would be ready from the register file let's say we had a pipeline with three execute stages and assume addq result is not available until last execute stage addq %rax, %rbx F D E1 E2 E3 M W ^------------value of %rbx is computed here ???? F D E1 E2 E3 M W subq %rcx, %rbx if we do nothing: F D E1 E2 E3 M W ^---------------- value of %rbx is read here ^----- value is needed here value is computed in same cycle where we need it, so yes, we could technically forward it in the same cycle, but this would make the cycle time WAY longer --- not something we're ever going to do stall so value needed stage comes after value computed: v----------------------- replace value we would read from register file with the value that was just computed from the execute3-wires F D D E1 E2 E3 M W ^----------- detect value needed is computed yet and subq back into fetch-to-decode pipeline regs k how to resolve this: most efficient way (in terms of cycles): stalling + forwarding stall to make sure we wait until after the value is ready before we use take it from where the computed value is somewhere where want to use it (replacing some old versoin we would read from register file, etc.) - stalling an instruction: keeping the instruction in the same stage to implement this with our pipeline registers: have the register before the stage output the isntructoin again next cycle in HCLRS: stall_X signal have the register after the stage output a nop next cycle (otherwise we'll duplicate the instructoin) in HCLRS: bubble_Y signal - bubble_X: inserts a nop ("pipeline bubble") in the pipeline if we stall, because the stage after the stall needs something to do if we mispredict, because we need to replace our wrong guesses with something - write-through/write-back - write-allocate/write-no-allocate - when we have a cache access that is a write: (1) check if the cache has the value if it does not: there are two options: (a) write-allocate: bring the value into cache, then proceed as if it already there (b) write-no-allocate: just send the value to the next level and forget about it (2) now (if we keep going), the cache has the value (perhaps because we loaded) it (3) update the cache's version of the value (4) then two options: (a) write-back: mark the cache's version of the value as dirty (defer the actual write until later) later on if we ever remove something marked as dirty from the cache "clean" the value by writing it to the next level (b) write-through: send the value to the next level (and don't need to track if it's out-of-sync) - mapping cache sets to memory - X = # bytes per of way of the cache = # sets * block size = 2^(index bits) * 2^(offset bits) = 2^(index bits + offset bits) - values which are X bytes apart go the same set [ tag ][ index ][ offset ] 1 000000000000 00000000000 = X so adding X to a value changes tag, but not index or offset so if array[0] through array[3] are a cache block and map to set 0 if my cache is 6-way 12KB cache --> # bytes per of way cache is 12KB/6 = 2KB whateve is 2KB after array[0] maps to set 0, too can also find out:array[4] through array[7] map to set 1, etc. generally: want to group values together by what set that map to and analyze each set separately - cache misses with the struct 16B blocks, 2KB direct-mapped cache = 2KB 1-way cache 2KB bytes per way typedef struct { [a_value][b_value][boring_values[0]][boring_valiues[1]]... int a_value, b_value; int boring_values[126]; } item; item items[8]; // 4 KB array int a_sum = 0, b_sum = 0; for (int i = 0; i < 8; ++i) a_sum += items[i].a_value; for (int i = 0; i < 8; ++i) b_sum += items[i].b_value; sizeof(item) = (2 + 126) * sizeof(int) = 512B assuming items starts at the beginning of a cache block a_value, b_value, boring_values[0] and boring_values[1] are all in the same block and because sizeof(item) is a multiple of the size of a cache block, this is true for every item in the array values which are 2KB apart map to the same set items[0].a_value, items[0].b_value and items[4].a_value, items[4].b_value map to the same set items[1].a_value, items[1].b_value and items[5].a_value, items[5].b_value map to the same set ... to analyze this, we can look at one set, and know that every other set will have same number of misses (because the pattern is exactly the same for each set) accesses to set 0: items[0].a_value items[0].a_value,b_value,boring_values[0],boring_values[1] miss items[4].a_value items[4].a_value,b_value,boring_values[0],boring_values[1] miss items[0].b_value items[0].a_value,b_value,boring_values[0],boring_values[1] miss items[4].b_value items[4].a_value,b_value,boring_values[0],boring_values[1] miss what if a LRU 4-way 2KB cache --> bytes per way is 512 items[0].a_value and items[1].a_valie and items[2].a_value and etc. are all in th esame set accesses to set 0: items[0].a_value miss items[0].a_value+ items[1].a_value miss items[0].a_value+ items[1].a_value+ items[2].a_value miss items[0].a_value+ items[1].a_value+ items[2].a_value+ items[3].a_value miss items[0].a_value+ items[1].a_value+ items[2].a_value+ items[3].a_value+ items[4].a_value miss items[4].a_value+ items[1].a_value+ items[2].a_value+ items[3].a_value+ items[5].a_value items[6].a_value items[7].a_value items[0].b_value miss (b/c items[4] replaced items[0] due to LRU policy) - S2018 Q19: how many misses unsigned char A[1024 * 8]; ... for (int i = 0; i < 2; ++i) { for (int j = 0; j < 8; ++j) { A[j * 1024] += j + A[j * 256 + 4]; } } 4-way 4KB set associative cache with 4B blocks values 1KB apart map to the same set 1KB = 1024 array elements (bceause sizeof(unsigned char) = 1) accesses are 2x: A[0] <-- set 0 A[4] <-- set 1 A[1024] <-- set 0 A[256+4] <-- set 1+(256/4) = A[1024*2] <-- set 0 A[2*256=512+4] <-- set 1+(512/4) ... A[3*256=768+4] <-- set 1+(768/4) ... A[4*256+4] <--- set1 ... A[5*256+4] <-- set 1 +(256/4) ... A[6*256+4] = A[4*256 + 2*256+4] = set 1 + (2*256/4) = 1+512/4 ... A[7*256+4] = A[4*256 + 3*256] = A[1792+4] <-- same set as A[1792+4 - 1024 = 768+4] set 768/4+1 set 0: we have 8 values we care with an LRU policy, can only keep the 4 most recent but the 4 most recent are never what we're accessing next all misses 16 misses other sets: we have 2 e.g. A[0*256+4] and A[4*256+4] are 1024 bytes apart know that we don't have to worry about evictoins with 4 ways compulsory misses for these, but no other misses 8 misses for these values - loop unrolling loop: addq (%r8, %rdi, 8), %rax <--- "BODY" (what we actually want to compute) RAX <- RAX + memory[R8 + RDI * 8] incq %rdi <--- "loop overhead" RDI <- RDI + 1 cmpq %rdi, %rsi <--- "loop overhead" jne loop <--- "loop overhead" unrolled to do twice as much per iteration, but with the same loop overhead per iteration --> overall half the overhead? loop: addq (%r8, %rdi, 8), %rax RAX <- RAX + memory[R8 + RDI * 8] addq 8(%r8, %rdi, 8), %rax RAX <- RAX + memory[8 + R8 + RDI * 8] addq $2, %rdi cmpq %rdi, %rsi jne loop but: need to do a little extra work if the number loop iterations is not a multiple of th enumber of times we duplicated the body - throughput/latency bound throughput bound: when we have a calculation, how fast can it be based on what the hardware is capable of doing, ignoring how dependencies and how we have to move data around for them e.g. if I need to do 10000 loads and 5000 multiplies to process 5000 values, how fast can my hardware do that if my hardware can start 1 load per cycle and 2 multiplies per cycle and each load (optimally) takes 3 cycles and each multiply takes 5 cycles and they are pipelined [meaning that we don't need to worry if we started ones before] for throughput bound: don't care how long they take, just how many can we do can do multiplies for 2 values/cycle start 2 every cycle --> finish 2 every cycle (because pipelined) can do loads for 0.5 value/cycle these are throughput bounds --- ignoring dependencies start 1 every cycle and needed 2 per value --> start all the loads per value every other cycle (first load in cycle 1, second load in cycle 2) --> finish all the loads for a value every other cycle latency bound: given what I have to do one-after-the-other, how fast can it be given how my hardware does that calculation e.g. if for each value, I need to use a loaded value to determine what value to load next one load, THEN another, THEN another can start one of these loads after the previous finishes ^^^^^^^^^^^^^^^-- if loads take 5 cycles, then that means we start one of these loads every 5 cycles performance bound of 0.2 values/cycle "latency bound" --> based on the worst dependencies more generally: what is the longest chain of dependent operations "critical path" add up the time needed to complete opreations going through that chain not helped by being able to do two things in parallel along the chain bsaed on these bounds, best performance I could get would 0.5 values/cycle - register renaming (example?) we want a different name for each version of a register in assembly do this by chaning the "physical" name everytime we change a regiser e.g. addq %r8, %r9 addq %r10, %r9 addq %r11, %r9 going to have an initial name for the first version of %r8, %r9, %r10, %r11 let's say %x8, %x9, %x10, %x11 and going to need new names ot use: let's say we can use %x12, %x13, etc. vvvv--------- new name for the result addq %x8, %x9 --> %x12 ^^^^ read from the old version addq %x10, %x12 --> %x13 addq %x11, %x13 --> %x14 - instructoin queue and dispatch instructoin queue: contains renamed instructions alongside the instruction queue we track which physical registers have values available (intiially, they don't (when renaming allocates the register), but after the corresponding instructoin executes, they do) every cycle, we look for instructoins in the instruction queue where all their inputs are ready and we take as many of those ready instructions as we can execute and actually start executing them ----