- exam date 10 December: probably 12PM 10th --> 12PM 11th - topics on exam - everything that we covered in lecture/quizzes/labs/HWs - grading scale and CRs - what would be a C- in a "normal" semester will be a C/CR - minimally 70 --> C, 80 -> B-, etc. - object file / executable contents ~ assembly files human-readable representation of instructions (not ready for the processor to use) labels to identify the locations of things in the same or other files ~ object files machine-readable representation of the instructions and of constant data (sometimes, multiple "segments" of machine code/data that could be an executable later) BUT labels can't all be resolved into the addresses the processor expects so to deal with labels: (1) [relocations] where the labels are used example: "call printf" (uses label printf) example: "jle top_of_loop" (uses label top_of_loop) we put SOMETHING in the machine code for the instruction and generate a "relocation" saying that the SOMETHING is wrong and needs to be replaced information in the relocation --- seperate table (not in machine code/data) * location of the SOMETHING (the placeholder) * name of the label, which we'll find by looking at the files used to make an executable (2) where the labels are defined: "top_of_loop: ..." "printf: ..." we need to note that the location within the machine code of that label (processor doesn't have a "mark label" instruction, so this out-of-band) this is an entry in the *symbol table*, and the labels are called *symbols. information in the symbol table -- seperate table (not in machine code/data) * name of the label * its location in the file ~ executables combine object files and replace relocations (placeholders) with addresses the processor will understand when we're done, we should have machine code+data that can just be loaded into memory and jumped to - relocation versus symbol table - bitwise operation use "0x012 3456 789A" extract bits (12+10+10=) 32 to 42 bits (bit 1 is LSB) >> 32 = 0x12 --- removes the bottom 32 bits & 0x3FF -- keeps the bottom 10 bits 0x3FF HEX = 0011 1111 1111 BIN (10 bits set) 0x12 & 0x3FF = 0x12 --- bits 32 to 42 of the original value "0x012 3456 789A" extract bits (12+10=) 22 to 32 bits (bit 1 is LSB) >> 22 = 0x48D1 -- removes the bottom 22 bits & 0x3FF = 0xD1 -- alternative: keep bits 22 to 32, then shift over -- & 0xFFC00000 1111 1111 1100 0000 0000 0000 0000 0000 BIN = 0x3440 0000 >> 22 = 0xD1 * select particular bits: AND with them (examples above) * to do other things to particular bits: OR --- set them XOR --- to flip them * to move bits to the appropriate a part of number, bitshift with << and >> - RISC v CISC Reduced Instruction Set Computer v Complex Instruction Set Computer RISC-like instruction set is easier for the hardware designer but typically makes more work for the assembly programmer/compiler common attributes of RISC: - instructions do less stuff - typically fewer ways of specifying operands to instructions - typically no instructions that both read+write from memory - typically no/fewer instructions that write multiple registers - instructions that do arithmetic don't also access memory - more general purpose registers - makes up for fewer instruction being able to access memory - it's an easy to way to use extra hardware to possibly make programs faster it's hard to find examples of things which are "definitely completely RISC" or "definitely completely CISC" --- usually somewhere in between - amount of stalling to data dependencies without forwarding v--- when we write registers F | D | E | M | W | ^^^ when we read registers F | D | E | M | W * addq %rax, %rbx F | D | E | M | W | extra instruction 1 F | D | E | M | W | extra instruction 2 F | D | E | M | W | extra instruction 3 F | D | E | M | W | first instruction that can read %rbx w/o forwarding ^^^ F | D | EM| W * addq %rax, %rbx F | D | EM| W extra instruction 1 F | D | EM| W extra instruction 2 F | D | EM| W first instruction that can read %rbx w/o forwarding ^^^ - Quiz 8: setting stall/bubble cycle 0 1 W | M | W | E | M | W | D | E | M | W | F | D | E | M | M | W <-- scenario in quiz question: need to "stay in memory stage until cycle 1" | W <-- inserted nop F | D | E | E | M | W F | D | D | E | M | W F | F | D | E | M | W ^^^ \-- quiz Q1-4: what's happening here fetch to decode --> decode in cycle 1 is outputting the instruction that was in decode in cycle 0 decode to execute --> execute " " " " " " execute " " means that cycle 1 repeats the same calculation with the same inputs as cycle 0 did execute to memory --> memory " " " " " " memory " " memory to writeback --> "no instruction" in writeback in cycle 1? --- must be inserted nop - write-through v write-back when we want to write a value using a cache: (1) we lookup the value in the cache (2) if the value's not in the cache: if the cache write-allocate, we add to the cache [could involve eviction] if the cache is not write-allocate, we write it to the next level and stop (3) if the value is in the cache now (after lookup + any write-allocate): update the value in the cache (4a) if it is a write-through cache we also write to the next level (4b) if it is a write-back cache we update the dirty bit to remember to write it later when we want to evict a value from the cache: (1) if it is write-back cache AND the value's dirty bit is 1, we write to the value to the next level (2) we discard the value - fully associative - # sets = 1 --> 0 index bits - degenerate case of a set-associative cache where everything's in the same set and we have to compare all the tags (one for each block stored) - cache with no conflict miss problems - Quiz 9: when hit in direct-mapped cache ? Question 3-4 --- 2-set, 8B blocks, writeback, write-allocate ^^ --- 0x0->0x7, 0x8->0x0F, 0x10->0x17, .... read from 0x104 miss cache contains: set 0 [0x100-0x107] set 1 [] write to 0x101 hit cache contains: set 0 [0x100-0x107, dirty] set 1 [] write to 0x102 hit cache contains: set 0 [0x100-0x107, dirty] set 1 [] write to 0x108 miss cache contains: set 0 [0x100-0x107, dirty] set 1 [0x108-0x10F, dirty] read from 0x109 hit cache contains: set 0 [0x100-0x107, dirty] set 1 [0x108-0x10F, dirty] read from 0x10f hit cache contains: set 0 [0x100-0x107, dirty] set 1 [0x108-0x10F, dirty] read from 0x110 miss, requires replacing something dirty --> writes back a value cache contains: set 0 [0x110-0x117] set 1 [0x108-0x10F, dirty] read from 0x118 miss, requires replacing something dirty --> writes back a value cache contains: set 0 [0x110-0x117] set 1 [0x118-0x11F] write to 0x119 hit cache contains: set 0 [0x110-0x117] set 1 [0x118-0x11F, dirty] - loop orders and cache miss counts for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) A[i], B[j], C[j*N], D[i*N] looking at difference between what's accessed in innermost loop as a quick estimate of misses: A[i] --> we expect good temporal locality --> hit every time (~0 misses per iteration of j loop) B[j] --> we expect good spatial locality --> hit every time in the same cache block (~1/K misses per iteration of j loop where K is the # of items in a cache block) C[j*N] --> we expect bad locality of all kinds --> miss every time (1 miss per iteration of j loop) D[i*N] --> ewe expect good temporal locality --> hit every time (~0 misses per iteration of j loop) if we swap i and j --> we get different locality sometimes: obvious improvement other times: might have to run it to determine if performance matters - aliasing for (i = 0; i < 3; ++i) for (j = 0; j < 3; ++j) for (k = 0; k < 3; ++k) A[i*3+k] += B[k*3+i]; the loop orders matters BECAUSE A, B could be parts of the same array if we set A and B to pointers to parts of the same array so the compiler can't do optimizations like changing the loop orders or keeping values in registers logic for figuring this out: > consider, e.g., A = &array[0], B = &array[K], discover that you're writing things then reading them back and if you change the order of this, that's going to the change answer the compiler needs to worry about this b/c its code needs to take that into account - Quiz 11 Q1 --- OOO processor latency 1-cycle adder; 3 cycle-multiplier, pipelined a = b + c; d = e * f; h = a + d; i = a * d; 1 2 3 4 5 6 adder: b+c a+D mult s1 e*f a*d <------- couldnt' start this early b/c 'd' wasn't computed yet s2 e*f a*d s3 e*f a*d <------------------> = 6 cycles of time needed - processor versus OS jobs processor is the hardware typically handles address translation calls the OS when programs do something weird using an exception knows what the current registers + page table + exception base pointer are but not what particular programs are running otherwise OS (operating system) is software typically the OS tracks what programs are running, what memory is allocated, etc. - first page table entry address in multi-level table in Quiz 13 3x10-bit VPN parts, 12-bit PO VA 0x012 3456 789A PTBR 0x44 000 VPN part 1 is bits 32 to 42 of the VA: 0x12 address of PTE: PTBR + VPN part 1 * size of PTE = 0x44048 quiz is miskeyed --> will fix after lecture - page table size from Quiz 13 20-bit VA, 4 PTEs, 1024 byte pages 1024 byte pages --> 10-bit page offset <------- 20 bit VA -----> | VPN | PO | <--- 10 ---> <--- 10 --> minimum VPN = 0 maximum VPN = 2^10-1 number of virtual pages = 2^10 single-level page tables --> page table size = (number of virtual pages) * PTE size = 2^10 * 4 = 2^12 = 4096 bytes - three-level lookup result Q in Quiz 13 the page table entry for the first-level was valid and contained physical page number 0x99, the page table entry for the second-level was valid and contained physical page number 0xA4, and the page table entry for the third-level was valid and contained physical page number 0xC3 when looking up virtual addresss 0x012 3456 789A final PPN: 0xC3 vvvvvvv--- same the VA final PA: | PPN | PO | ^^^^^^^^^^^^^^ 0xC3 from third-level 0xC389A --- PO was 0x89A because it's the last 12 bits (12 bits was size of PO in question) - TLBs lookup using the full VPN when lookup works: we don't access ANY page tables what we lookup is the entry we'd get at the END of the page table lookup process when lookup doesn't work: we take the value from the end of the page table lookup process and remember it for next time TLBs are stored like caches, splitting the VPN into tag and index (offset is 0 bits) (VPN instead of physical address) TLBs data is a PTE (not bytes from program memory like in caches)