- branch prediction jXX - processor makes a guess - for our assignment: guess is that jle foo will go to foo - processor fetches based on the guess even though it doesn't know if it's right - later on we figure out whethre the guess is right - how much later? example in our processor (pipehw2) jle F D E** ^^^---- evaluate conditoin codes here guess1 F D guess2 F F ^^^--- correct wrong guess here - if the guess is right: we do nothing - if the guess is wrong: we cancel the instructoins we guessed in our porcessor: just don't store in pipeline registers (by storing a nop instead) - branch prediction: call/ret - call doesn't need a prediction in our processor because we figure out the destination of the call in the fetch stage - if we couldn't (e.g. maybe we had two fetch stages) - ret: if we made a predictoin, we'd predict what is the return address - done by many real processors by recording the locatoin of the call in special registers - but in the textbook's design and pipehw2: stall instead of predicting - direct-mapped v fully-associative v --- these are all special cases of an X-way set associative cache way: number of spaces to put a value within each set set: place to put values determined from address if X=1: we call this "direct mapped" one-way: no replacement policy # of sets = # blocks: only one place to put values from each address # of index bits = log_2(# of sets) if X=(# of blocks): we call "fully associative" if number of sets=1: we call "fully associative" (same as X=(# of blocks) one set: no set index bits - capacity/compulsory/conflict - compulsory: first time we're accessing that block with this definition: changing the block size, might change # of compulsory misses - conflict: something about how the cache blocks are organized makes it miss if we had maximum flexibility and we maximally clever about where to place blocks, then we wouldn't have this miss maximum flexibility --> more ways --> fully associative cache being clever --> best possible replacement policy - capacity: we just don't have enough room to avoid these misses if we had maximum flexibility + maximally clever, we couldn't avoid this miss without creating other/more misses - write-allocate write miss: program tries to write and we don't have the value cached, what do we do write-allocate: bring it into our cache hope: the block it is in will be used later, so this will be faster write-no-allocate: don't bring it into our cache if we only write to block, don't bother caching (just send the write to the next level of cache) - write-back/write-through write-back: when we update our cache due to a write, we do not update the next level yet instead, we track that the block is "dirty" and update it before we get rid of the updated value NOTE: if we don't write to our cache at all, then there's nothing to track write-through: when we update our cache due to a write, update the next level immediately avoids doing updates later - determining the number of misses with C code - if we need to be precise: - figure out what sets are accessed # of blocks per way in the cache = # sets --> blocks before something maps ot the same set # of bytes per way in cache = # sets * block size --> # bytes before somethign maps to the same set for example: 6-way 12KB set-assocative cache: 12KB/6 = 2KB per way in the cache: every 2KB maps to the same set - look at each set separately and determine what's in throughout the loop if you know array[0-3] and array[1024-1027] and array[2048-(2048+3)] etc. are in the same set, write down a list of accesses to that set: array[0] array[0-3] miss array[1] array[0-3] hit array[1024] array[1024-1027] miss if two-way: array[1024-1027], array[0-3] array[1025] array[1024-1027] hit array[2] array[0-3] miss array[3] annotate that list with what will be stored in the set and whether hit/miss if not direct-mappped, need to track # ways values - either: go through all sets that are accessed, or look for pattern that lets you avoid manually going through all sets - approximation: what's being accessed repeatedly in the inner-most loop(s) if not too much data to fit in cache, should be hits after first accessj quesiton: how many first accesses if even innermost loop is too much to fit into cache, look at subsets of the innermost loop to see if there's spatial locality between iteratoins, etc. - loop unrolling loops have "overhead" loop: addq (%r8, %rdi, 8), %rax <-- "body" M[r8 + rdi * 8] incq %rdi <-- "loop overhead" cmpq %rdi, %rsi <-- "loop overhead" jle loop <-- "loop overhead" idea of loop unrolling is that we can avoid doing this loop overhead for every single iteratoin of the loop by doing the iterations in groups example with groups of 2: loop: addq (%r8, %rdi, 8), %rax <-- "body" addq 8(%r8, %rdi, 8), %rax <-- "body" M[8 + r8 + rdi * 8] addq $2, %rdi <-- "loop overhead" cmpq %rdi, %rsi <-- "loop overhead" jle loop <-- "loop overhead" since groups of two: going to execute the loop overhead half as much overall BUT we need to handle the case where the loop bound is not a power of two - cache blocking v loop unrolling cache blocking is about taking some nested loops and changing the order of how the inner body is executing given: for (i ...) for (j ...) for (k ...) BODY choose another way of iterating over i, j, k so we still execute BODY for each i, j, k, but we get better locality vs loop unrolling: since we have loops after we do this, we can unroll them and we will because it's faster often vs loop unrolling: often we'll introduce extra loops, which gives more ways to loop unroll: for (i' by 16) for (j ...) for (k ...) for(i from i' to i'+16) <-- new loop that's good to unroll BODY new loop is especially good to unroll: know exact number of iterations (no leftovers) can potential unroll it completely ---> no loop overhead left - out-of-order processors: what can execute at the same time - assuming we have big enough instructoin queues, reorder buffers, etc. and enough things that execute instructions (ALUs, read/write ports on our cache, etc.) the only restriction left is "do the instructions depend on each other" b/c they could both in instruction queue and we'd just look at "are their inputs ready" if not depending on each other: inputs become ready at the same time - architectural versus physical registers - out of order processors have lots of types of data hazards that are more complicated then our pipelined processor - with pipelined processor: we had multiple versoins of a register: - the versoin in the register file - the version that was just computed in execute, but not stored yet - the version that was just computed in memory or sent through the pipeline register to memory, but not stored yet - the version that was sent to the writeback stage but not stored yet - used forwarding to decide on the correct version - with OOO processor: also have multiple versions, but track them EXPLICITLY - give a new name --- physical register number to each version - if we can't do this, stall until we can - use the physical register numbers to actually figure out how to execute instructions - physical register number: determines where it stored in our register file - typically way more physical registers available than registers in the ISA - architectural registers: "old name" "register in the ISA" - physical registers: "new name" - inclusive/exclusive cachees - examples with L1 and L2, but can also consider L2 and l3, etc. - inclusive cache: everything in L1 is also stored in the L2 if we evict from the L1, the value will still be in the L2 later (usually good --- because we often evict values which will still be used) if L2 talks to multiple L1s, then either can get the value from L2: for example: if we have one L2 for mutliple processors and they share memory, then it doesn't matter which of two processors are accessing the memory location for whether they'll find a value in the L2 duplicated data: some of the L2 space is basically not used most of the time - exclusive cache: everything in L1 is not stored in the L2 if we evict from the L1, probably store the value in the L2 then L1 eviction's more complicated if L2 talks to multiple L1s, need to decide what to do when one of the L1 caches wants to store value not duplicating data: more of L2 stores data we might access in the future from it - can also have a non-inclusive, non-exclusive cache some things in l1 are stored in the l2, but not everything ---- - latency/throughput bounds - throughput bound: "how much can the hardware do (if we don't worry about dependencies)" if we have two ALUs that each take 5 cycles to do a multiply and are pipelined: we can start two multiples every cycle even though in practice, this might not be possible because of the multiplies depending on each other for figuring out the throughput bound for a calculation: what's the thing the calculation needs that's slowest assuming we work to reorder things to avoid dependencies (e.g. with reassociation/multiple accumulators..) for example: if we want to do adds + loading data from the array, need to consider how fast can th eprocessor add and how fast can the processor load data from the array - latency bound: "how much can the hardware do, given that we have to do this thing one at a time" have some operatoin that needs ot happen one after another performance is determined by how long it takes to get one reuslt and then feed it into the next operatoin consider time from starting coputation to finsihing it NOT what pipelining helps with if we have two ALUS that take 5 cycles to do amultiply and are pipelined: we can start the next multiple 5 cycles after the first we end up doing one multiply every 5 cycles latency bound for computation: what's the "critical path" on the data flow graph --> critical path = things we need to do one after th eother