- is the test cumulative YES -- with more questions (~half) from post-exam-2 - specific information from previous exams no, but all prior lecture/HW material is fair game - vector instructions --- what is tested - no particular instructions or instrinsics to memorize we will you tell what you need e.g. if you need to know about _mm_add_epi16 or paddd, etc. we will give a brief description with th ename - address splitting: VPN, TLB index bits, etc. virtual addresses [ VPN | PO ] | ^^^^^^ | \-------------- looked in the TLB v physical addressses [PPN | PO ] VPN = virtual page number PPN = physical page number = physical frame number = frame number PO = page offset (= VPO virtual page offset = PPO physical page offset) PO bits = # bits needed to find a byte in a page e.g. 2^12 byte pages --> 12 bit page offsets TLB: [ VPN ] same idea as: [ physical address ] [ TLB tag ][index] [cache tag][cache index][offset] ^-- no offset 1 entry/block TLB index = # bits needed to identify a TLB set e.g. 16-set TLB = 2^4-set TLB --> 4 bit TLB index finding # of sets --- same as idea as # set of in a cache VPN size = index bits + tag bits index bits = log_2(# sets) TLB size (entries) = # sets * entries/set TLB entry = page table entry --- never have multiple PTEs in a block multi-level page table lookup splitting [ VPN | PO ] on TLB miss: [ VPN part 1 | VPN pt 2 | PO ] ^^^^^^^^^^^^ \-----------lookup the first-level page table entry using PTBR + VPN part 1 * size of PTE physical page # = location of second-level page table ^^^^^^^^^ \-second-level page table entry using location of sceond-level table (from first-level) + VPN part 2 * size of PTE if last-level (this example): contains PPN for final address if not: contains another page table location - TLB and multi-level page tables - the TLB *only holds last-level page table entries* - we always lookup a FULL virtual page number in the TLB essentially: TLB lookup doesn't depend on how many levels the table has - never store non-last-level PTEs in the TLB --> where would you store it? not indexed by a full VPN on TLB hit: VPN --> TLB --> valid PTE ---> use valid PTE + PO (from original address), access cache on TLB miss: VPN --> TLB --> miss! VPN part 1 --> * size PTE + PTBR = address of PTE --> read first level PTE VPN part 2 --> * size PTE + first-level PTE's PPN * page size --> read the second level PTE VPN part 3 --> * size PTE + second-level PTE's PPN * page size --> read the third level PTE ... VPN part N --> * size PTE + (N-1)-level PTE's PPN * page size --> read last level PTE last level PTE combined with PO (from original address) --> access cache - caching: write-through and write-back say cache contains: for addresses 0x10-0x11: 0xAB 0xCD and memory contains: for addresses 0x10-0x11: 0xAB 0xCD -- and the CPU writes 0xFF to address 0x10 -- two options: option 1 (write-back) option now cache contains: 0x10-0x11: 0xFF 0xCD <> and memory contains: 0x10-0x11: 0xAB 0xCD ^^^^ \---- not yet updated and later: cache contains: 0x20-0x21: ... [evicted 0x10-0x11] and memory contains: 0x10-0x11: 0xFF 0xCD ^^^^ \----- memory updated when cache contents replaced need to mark blocks as dirty, so we remember to update memory need a "dirty bit" for each block option 2 (write-through) option now cache contains: 0x10-0x11: 0xFF 0xCD and memory contains: 0x10-0x11: 0xFF 0xCD ^^^^ \--- updated immediately pros/cons: write-back: probably faster --- less accesses to memory/next level of cache write-through: less state --- no dirty bits write-through: when only one byte of a block is updated, only write one byte to memory versus write-back: always end up writing the whole block - counting cache misses - first: figure out the tag-index-offset split of addresses heuristic: # bytes/way = 2^(index bits + offset bits) if two addresses are apart by 2^(index bits + offset bits) or a multiple, then they must map to the same set (because they have the exact same index/offset bits) - what values map to the same set? [0 -> N], N/4 bytes per way --> 0, N/4, 2N/4, 3N/4 map to same set 1, 1+N/4, 1+2N/4, ... map to same set ... - what values are part of the same block? [0 -> N], 16 bytes per block --> 0-15 16-31 ... are the same block - track what's in a set if you have a small number of accesses, you can just write down the contents of each set - look for repeated accesses to the same block and compare othre accesses to the same set and see if there are "too many" (depends on replacement policy) 16-byte blocks, 32 bytes/way, 2-ways --> 2 sets, 0-15, 16-31, 4 block offset bits 1 index bit --> 17 = 10001 0-15, and things in the same set are 32 bytes apart so 0+32 through 15+32 is the next block in that set 3, 17, 32, 64, 3 ^ ^^ ^^ ^ <-- all set 0 accesses \ block containing 0-15 in set 0 ^^^^^^^^ 17: set 1 32: set 0, 32-47 ^^^^^ 64: set 0, 64-... - things that can trip people up: watch for acceses to the same cache block for counting cache misses --- all that matters is what the block is one way to handle: convert addresses to the first address of the blokc 0, 16, 32, 64, 0 - counting # of exceptions exceptions are the only way the OS gets to run in kernel mode (e.g. --- read from a device, change the page tables, etc.) program A: ask to read a character from the keyboard -- exception: system call program A can't do things with the keyboard itself program A can't switch to program B itself program A is paused and program B runs -- exception: interrupt (triggered by keyboard) runs the OS program B can't and doesn't know to do things with tkeyboard program B can't switch to program A itself (and as no reason to) keypress happens and OS switches to program A - processes, context switches processes = threads ^^^^^^^---- illusion of own CPU you get to use all the registers and keep executing without worrying about anything else that might need to share the core implementation: exceptions (to run OS) and saving regsiters, etc. to the OS's memory temporarily + address space ^^^^^^^^^^^^^---- illusion of own memory you don't get to touch other program's memory you don't have to place you stack someplac wher eother programs don't have stacks ... implementatoin: page tables and changing the page table base register context switch: saving registers, condition codes, etc. for one process to the OS's memory and restoring another's processsj registers, conditions codes, etc. that need to be saved/restored are the "contexT" context: registers (%rax, %rsp, ...) condition codes program counter (where do we resume the program?) page table base register (when switching processes) (anything I'm forgetting?) - types of exceptions - book's categories: - interrupts --- externally triggered events I/O devices (keyboards, hard drives, network interface devices, ...) signalling to OS that they're ready a conceptually external timer that the OS controls used to do context switches even if a progrma has an infinite loop - faults --- events triggered by the instruciton they stop BUT it's not what the instruction "usually" does protection faults --- trying to do things that need kernel mode e.g. changing memory marked read-only e.g. accessing an I/O device directly from user mode page faults --- trying to use an invalid page in the page table division by zero ... - traps --- events triggered by the instruction they stop AND the instruction is only meant to trigger this event system calls -- request OS services - aborts --- "the computer is broken" e.g. memory has been corrupted and is not recoverable - optimizations --- combining - loop unrolling eliminated some "loop overhead" incrementing loop counters, doing comparisons, etc. - multiple accumulators essentially, needed loop unrolling first exposed more parallelism to take advantage of multiple copies/pipelined ALUs special case of: reassociation changing the order of math to expose more parallelism - cache blocking reordering accesses to eliminate some cache misses by working on square-ish? blocks of values and trying to do as much work on the blocks before they get evicted from the cache worked well with loop unrolling but solved differently special case/extension of: loop reordering changing the order of loops to improve locality - function inlining eliminated some function call overhead (call/ret/saving and restoring registers) combines with loop unrolling, etc. but combining ---> much bigger executable - removing redundant operations example: strlen in a loop that could be "hoisted" outside of it - multiple accumulators: longest path and # of cycles - see picture - longest dependency chain ---> add latencies of functional units together heuristic assumption: hopefully everything else happens in parlalel usually the case if there aren't a large number of things that can be done at once and/or very long dependency chain BUT if you need to be sure, then you need to figure out if there are enough functional units, pipelining etc. for that to happen - can look at one iteation of a loop --- what chain of stuff needs ot be done to start the same chain in the next iteration e.g. adding into x, then don't care about chain that adds into y for (...) { x += a; x += b; x += c; y += x; } (adds to y can happen in parallel with next iteratoin's adds to x)