- relocations and symbol tables ~ in an object file, we have the machine code for assembly, BUT: we need to know about references to addresses that aren't determined yet (not determined yet because we haven't decided where to put them in memory yet) >> RELOCATIONS are for this they say "we need to fix an addres in the machine code at this location" "and here's the name you use to find out what the address will be" example: "call printf" --> need a relocatoin to say "when you figure out where printf is, fix the call to actually point to its address" example: if there's a global variable 'global_counter' and you do 'movl %eax, global_counter', then we need a relocatoin to say "where in memory is the global variable 'global_counter' going to be stored" and global_counter probably ends up stored in the data part of the program's mmeory we need to know about things in this file that other assembly might reference >> SYMBOL TABLES are mainly for this e.g. if assembly defines a function square: '.global square square: ... ' then we'll have a symbol table entry to indicate wher ethe machine code for square is in the object file (the machine code doesn't have way of representing "square:" - masks (bitwise) ~ choose what bits we want to do something do select those bits with a "mask" which is a number which has 1s in those bits and 0s otherwise (or sometimes 0s in those bits and 1s otherwise --- depending on how we're using the mask) ~ can think of bitwise operators as applying a mask bitwise AND with a mask with 1s for selected bits and 0s otherwise "keep the selected bits and clear the other bits" bitwise OR with a mask with 1s for selected bits and 0s otherwise "set the selected bits and don't change the other bits" bitwise XOR with a mask with 1s for selected bits and 0s otherwise "flip the selected bits and don't change the other bits" - pipelining, generally ~ we have a bunch of steps in an instruction ~ rather than doing step 1 of instruction A, step 2 of A, ..., step 5 of A, step 1 of B, step 2 of B, ... we'll do step 1 of A, {at the same time: step 2 of A, step 1 of B}, {at the same time: step 3 of A, step 2 of B, step 1 of C}, ... ~ this works well because usually, the things we want to use for step 1 are different than step 2 or step 3 or step 4. * example: step 1 might use the instruction memory (fetch) step 2 uses the register read part of the register file (decode) step 3 uses the ALU (execute) ---> these are all mostly independent, so we can do the different instructions step at the same time, mostly - register delays and critical paths ~ pipelines with a circuit need some way of storing the values for the later stages (later steps) while the earlier stages are working ~ generally, to do this we'll have a "pipeline register" --- ~ performance of circuit is determined by how much time it takes to do a cycle ~ in a cycle, we need to have enough time for values to get from our register outputs --- which change when the clock cycle starts (rises) to our register/storage inputs --- which need to be ready when the clock cycle finishes (next rising of the clock) [with the storage components we used in the class] ~ the circuit can do multiple things at once because if we have two wires, they both work the same time if they're connected to different things ~ so what matters is what path from outputs to inputs takes the longest called the "critical path" --> determines cycle time ~ when we're considering how long it takes a path from input to output, we can't assume hte registers don't take any time (b/c registers are physical things that can't work instanteously) so the register needs an amount time we call the "register delay" [it turns out that some of this time is after the rising edge of the clock and some is before, but we never broke down how much is which part] ~ that "register delay" needs to be taken into account when computing the critical path time ~ usually just one register delay, b/c if we have a circuit that goes register ----> [stuff] ---> register we need to spend the time after the rising edge for the first register + time before the rising edge for the second register = total register delay for a single register - stalling and forwarding ~ sometimes we need values in a pipelined processor that won't be ready yet 0 1 2 3 4 5 mrmovq ...., %rax F D E M W addq %rax, %r8 F D E M W <--- PROBLEM! if we do nothing special, then we're trying to read %rax in cycle 2 (addq's D) even though mrmovq won't write it until cycle 4 (mrmovq's W) ~ simplest solution is to have the addq wait until the value's ready: 0 1 2 3 4 5 mrmovq ...., %rax F D E M W addq %rax, %r8 F d d d D E M W in this case, we have three cycles where the addq can't make progress because it waits for the value of %rax to be ready to read to implement this: > we want to keep the fetch-to-decode registers outputting the info for the addq > we want the decode-to-execute registers not to output the value for the addq (if we did, we'd have duplicated the addq instruction) but we can't actually have "decode-to-execute" registers output nothing, so we'll have them output the values for a nop called "pipeline bubble": 0 1 2 3 4 5 mrmovq ...., %rax F D E M W addq %rax, %r8 F d d d D E M W "bubble" E M W "bubble" E M W "bubble" E M W inserted this "bubble" = nop because we can't have the execute stage hvae no input (it always has some input) ~ the stalling only solution is unsatisfying: 0 1 2 3 4 5 mrmovq ...., %rax F D E M^W addq %rax, %r8 F d d d D E M W the value is read and available at the end of cycle 3 (the ^) [end of mrmovq's memory stage] BUT we don't actually get it to the addq instruction until later because we need to have go through the register file partial fix: let's send it without the register file called "forwarding" ---> we'll add an extra wire and a MUX to select that wire to replace the value we would read from the register file after we've done this: 0 1 2 3 4 5 mrmovq ...., %rax F D E M^W addq %rax, %r8 F d D E M W the value that will be stored in %rax is sent to the decode stage, and selected using a MUX instead of the register file output [forwarding from the end-of-memory to end-of-decode] [could also do beginning of writeback to beginning of execute, but probably not what you implemented in the HW and not what our textbook does] - finding forwarding paths ~ "forwarding" --> sending the vlaue without the register file ~ where can we forward from? when is the value that we're going to write to a register computed? (can't forward a value that isn't computed yet) (computed includes reading from memory, etc.) ~ where can forward to? it needs be when we would read it from the registe file (so we know what to read) or sometime after tihs but before we've used it mrmovq ...., %rax F D E M*W addq %rax, %r8 F D E*M W <-- FORWARDING WOULD NOT WORK b/c it's too late but contrast: mrmovq ...., %rax F D E M*W rmmovq %rax, .... F D E*M W <-- forwarding okay, haven't used %rax yet - cache miss types and cache design changes "CCC" taxonomy of cache misses: ~ based on a fixed block size "compulsory misses" --> first time we access a block compulsory because no change to the cache (that keeps same block size) will avoid them "conflict misses" --> misses that would be avoided with more associativity would not be a miss in a fully associative cache (1-set cache) of the same size with th ebest replacement policy we're not flexible enough about how we use the space in our cache "capacity misses" --> misses that would not be avoided with more associativity, but could be avoided with a bigger cache (not the only access) our cache isn't big enough ~~~~ if we categorize our misses, we want different strategies to fix each type of miss: compulsory misses: good ideas for reducing misses: ~ prefetching? ~ larger block size? ... bad ideas for reducing misses: ~ bigger cache ~ fancier replacemnet policy ... conflict misses: good ideas for reducing misses: ~ more ways ~ smarter replacement policies ~ larger cache capacity misses: good ideas for reducing misses: ~ larger cache bad ideas: ~ smarter replacement policies ~ more ways ~ ... - cache blocking and loops ~ cache blocking, given: for (i in ...) for (j in ...) f(A[i], B[j]) we have bad temporal locality in B and good temporal locality in A and we could change things to be like: for (j in ...) for (i in ...) f(A[i], B[j]) and then we'll have bad locality in A and good locality in B cache blocking is a pattern where we find a compromise that works well with both indices: for (range of Is in ....) for (range of Js in ...) for (i in range of Is) for (j in range of Js) f(A[i], B[j]) because the ranges of Js are not that big, we have better locality in B than the first example (because after accssing some B[j], we'll access it relatively soon again hwne we advance to the next i, rather than waiting for us to go through all the other Js first) because the range of I+Js are not that big, we have better locality in A than the second example (we'lll access some A[i] again pretty soon relatively to go through all the other J values) ~ same pattern applies to two-dimension array indexing (where staying the same row has better spatial locality than just staying in the same column assuming rows are stored together) ~ another view of: for (range of Is in ....) for (range of Js in ...) ***** for (i in range of Is) for (j in range of Js) f(A[i], B[j]) **** is that the loops between the ****s can fit everything they work in this cache, so we should have just one set of cache misses for all the A, B value seach time they run, versus the original: for (i in ...) **** for (j in ...) f(A[i], B[j]) **** where we'd have one set of cache misses for B everytime we ran that innnermost looop - OOO, instruction queue ~ out-of-order execution: we fetch a whole of bunch of instructions we do register renaming [or something similar] to figure out which instuctions depend on which other instructoins we find instructions where all their dependencies are ready, and run as many of them as possible as soon as possible ~ what's possible? how many ALUs, etc. do we have ("execution units"P)`a in practice: usually the dependencies are the main limitatoin: if we hvae 10 ALus, but only one add which has its values ready, we can run just one add ~ instruction queue list of instructions that are processed, but may not have their dependencies ready processor scans this list and checks which ones have become ready it selects as many as it can to actually run on the execution units - exceptions and kernel/user mode user mode: disallows "dangerous" operations (dagnerous -- defined by the processor maker) kernel mode: allows "dangerous" operations when an exception occurs, we run the OS and switch to kernel mode (if we weren't already) --> the OS code gets to use dangerous operations when we return from an exception, by default, the processor will restore user mode (if the exception we're returning from occured in user mode) ~~~ round 2 ~~~ - replacement policy typeS