- covered topics [POSSIBLY INCOMPLETE LIST] - early in the semester: C assembly Y86 and ISAs (RISC/CISC/etc.) bitwise operators SEQ Pipelining and PIPE Caching Locality in caches Cache blocking - later in semeseter (most of exam) Optimizations Loop unrolling, etc. Multiple accumulators Inlining Out-of-order processors / Functoinl units Exceptions Signals Processes Virtual Memory - not fork - explanations of the quizzes? - this would take more time than have between now and the exam - office hours? - there are office hours today - I have office hours tomorrow - if there are office hours on the calendar on Thursday, they are supposed to happen - Quiz26 on page table space required two-level page table [first-level PT] --> [second-level PT contains entrys] worst case: each entry in its own second-level PT --> 1 (first-level) + 4 PTs, each one page --> 5 BUT 0x10 and 0xFF are close [ VPN ][ PO ] [ VPN ][ VPN ] pt1 pt2 ^^^^^ determines which second-level table is used VPN pt1 -- log_2(4096) bits -- 12 bits VPN in total = 12 + 12 = 24 bits 0x10 and 0xFF --> VPN pt 1 is 0 1 second-level table for both 0x5000 and 0x5001 --> VPN pt 1 is 5 1 second-level table for both 3 total tables --> each one page --> 3 pages - overlapping TLB and cache accesses VPN = virtual page number PPN = physical page number PO = page offset CT = cache tag CI = cache (set) index CO = cache (block) offset [ VPN ][ PO ] [ PPN ][ PO ] [ CT ][ CI ][ CO ] ^^^^^^^ needed for first part of cache lookup CI includes part of PPN can't do set lookup w/o PPN [ PPN ][ PO ] [ CT ][ CI][CO] ^^^^^ CI can be computed with just PO can do set lookup w/o PPN C (cache size) = B (block size) * S (# sets) * E (associativity) C / E ("size of each way") = B * S log_2(C / E) = log_2(B * S) = log_2(block size) + log_2(# of sets) ^^^^^^^^^^^^^^^^^ # of block offset bits ^^^^^^^ # set of set index bits log_2(C / E) = CI + CO log_2(32 KB / (1-way)) --> 15 = CI + CO 15 > 14 bit PO --> PPN is part of CI and/or CO log_2(32 KB / 2-way) --> 14 = CI + CO <= 14 PO - F2016 Q1 (virtual/physical) 8KB (2^13 B) pages two-level PT PT entries contain: 0 present bit 1 read/exec 2 write 6-29 PPN size of physical address? look at the page size vvvvvvvv PA : [ PPN ][ PO ] ^^^^^^^^^^^^^^ look at how much space is used in the page table entries 29 - 6 + 1 = 24 bit PPN + 13 bit PO ---- 37 - S2017 Q7-11 (virtual memory) 8 KB (2^13 B) pages --> 13 bit PO 32-bit virtual addresses two-level PT, 1KB first-level tables, 8KB second-level tables 4 byte page table entries 16-entry, 8-way set associative TLB 16 entries / 8 entries/way --> 2 sets --> 1 TLB set index bit [ VPN ][ PO ] [ TLB ][ TLB] tag idx Q7: same TLB set as 0x12345678? does bit 13 (bit 0 is least significant) the same in each address? 0x99345678 ^ 0101 ^^-- bit 12 \-bit 13 Q8: how many bits in a VPN? 32-bits = VPN size + PO size = VPN size + 13 --> 32 - 13 = 19 Q9: how many bits of the PTE are used? PTEs contain: - PPN --> 28 - 13 = 15 bits - valid bit + 1 - readable bit + 1 - writeable bit + 1 - executable bit + 1 - kernel-only bit + 1 ---- 20 Q10: overlap TLB access with 32KB X-way associative cache. What does X need to be? log_2(cache size / associativity) = CI + CO ^^^^^^^ must be <= PO bits to start cache access early log_2(32KB / 1) --> 15 not <= 13 log_2(32KB / 2) --> 14 not <= 13 log_2(32KB / 4) --> 13 <= 13 <-- minimum associativty to start cache access early Q11: Which addresses share 2nd level page table [ VPN ][ PO ] [ VPN ][ VPN ][ PO ] pt1 pt2 ^^^^^^ if shared --> then use the same second-level table two-level PT, 1KB first-level tables, 8KB second-level tables 4 byte page table entries 1 KB / 4 byte PTE --> 256 entries in first-level tables --> log_2(256) = 8 bits to say "which entry in 1st level" --> VPN pt 1 is the first 8 bits of the address - exception handling process - detect that we have an exception - varies depending on kind (e.g. explicit instruction, invalid memory access, external event, ...) - save the program counter (and maybe more) - lookup the location of the exception handler using the exceptoin table --- an array of function pointers set by OS, referenced by special register - switch to kernel mode - exception handler runs (part of the OS) - usually: save the context of the program all registers, page table base pointer, etc. in case it wants to switch to another program - exception handler does whatever OS decided is a good idea - e.g. swap in a page on a page fault - e.g. crash a program on a divide by zero - e.g. read from a file on a system clal - different kinds of exceptions - textbook's terms - traps - triggered explicitly using an instruction whose only purpose is doing so e.g. system calls - interrupts - triggered externally --- e.g. timer, keypress, network activity, etc. - faults - triggered by instruciton that was supposed to do something else e.g. divide by zero, page fault, protection fault (writing read-only page or accessing kerenl-only instruction), etc. - aborts -- kind-of vague - serious hardware errors being signalled to the CPU - e.g." your RAM is broken" - maybe related to running instruciton, maybe not? - signals versus exceptions operating system feature hardware feature for user programs for operating systems handlers run in user mode handlers run in kernel mode (because part of the user program) (because part of the OS) runs when OS decides to run it runs when HW decides to have same calling conventoin have different calling convention because OS+libraries do extra work than normal functions usually before calling signal handlers (keep implementing HW simple; so they can be written in C handle kernel/user mode shift) - context switches context = information the CPU uses when runing a program that we need to save elsewhere when we switch away from it - all program registers (%rax, %rsp, ...) - the program counter - page table base register - condition codes context switch = OS copies information from CPU in the context to its memory [for program A] and OS copies informatoin from memory in the context to CPU [for program B] switch from running program A to running program B done as part of many exception handlers - vector instructions -- what do we need to know - not going to ask about specific vector instructions/intrinsics unless I tell you about them on the exam - important things to know: - vector instructions have YOU (or compiler generating them) specify what to do in parallel - fixed-sized arrays, pairwise operations - versus reassociation/multiple accumulators/out-of-order where CPu figures what computations to do in parallel - Exam 2 Q 5 (caches and misses) - Exam 2 cache miss counting 16-way w/ 64B block - compulsory misses: array[i * 128] 16 misses in iteration x = 0 array[i * 128 + i] is same block sa array[i * 128] - conflict misses with 16-way: 16 blocks, 16 ways, ---> no misses 1KB direc-tmapped w/ 1B block - compulsoary 31 array[i * 128] 16 misses in iteration x = 0 array[i * 128 + i] for i = 0 is same as array[i * 128] array[1 * 128 + 1] is different block than array[1 * 128] and so on 15 misses in iteration x = 0 - conflict misses: 16 conflict/iteration after first --> 16 * 9 = 144 to have conflict: blocks 1KB apart exactly "set index bits are the same" set index bits go up to bit 0-9, changing bit 10+ doesn't change the set two blocks which are # bytes/way apart conflict 8 * 128 array elements --> 1KB array[0 * 128] and array[8 * 128] are the same set --- set 0 array[1 * 128] and array[9 * 128] are the same set --- set 128 ... array[i * 128] misses every time array[i * 128 + i] i = 0 --> hit: because we just accessed array[0 * 128]` otherwise --> nothing else maps to same set no other accesses exactly 1KB apart array[0 * 128 + 0] --- set 0 array[1 * 128 + 1] --- set 129 array[2 * 128 + 2] --- set 258 +129 mod 1024 (# ssets) 144 + 31 = 175 - pipeline dependencies and dependencies v hazards - data dependencies need a value from a previous instructoin (generally register value) - control dependnecies need PC computed by previous instruction - hazards versus dependencies all hazards are depenedencies hazard -- with a PARTICULAR PIPELINE instruction would not execute correctly because of this depenedency example: data hazard --- would need to forward or stall control hazard --- would need to stall, predict, or run wrong instruction techniques to handle hazards: - stalling - forwarding - branch predictoin - S2017 Q 4 CPU with Fetch 1.0 ns Decode 0.8 ns Exec 0.5 ns Mem 1.2 ns WB 0.8 ns Q4: Split memory stage ---> 0.6 ns previously 2.0 cycles/instruction now 10% instruction stall 1 more cycle than before