- covered topics [POSSIBLY INCOMPLETE LIST] - early in the semester: C assembly Y86 and ISAs (RISC/CISC/etc.) bitwise operators SEQ Pipelining and PIPE Caching Locality in caches Cache blocking - later in semeseter (most of exam) Optimizations Loop unrolling, etc. Multiple accumulators Inlining Out-of-order processors / Functoinl units Exceptions Signals Processes Virtual Memory - S2017 Q22 (page faults) - lsat quiz, Q1 (TLB tag?) - [ VPN ][ PO ] [ TLB ][TLB] tag idx ^^^^ (what TLB set to use?) # TLB sets = 32 entry TLB / (4 entries/set) --> 8 sets --> 3 set index bits 0x00CAFECAFE - remove the page offset (16KB --> 14 bits) and get the VPN - look at bottom 3 bits (TLB index bits) - last quiz, Q2 (page table entry add) - [ VPN ][ PO ] [ VPN][VPN] pt1 pt2 ^^^^ 4096 entries --> 12 bits to determine which entry 4 byte page table entries used for first-level lookup - extract VPN pt 1 (12 bits) first 12 bits of the 38-bit VA 0012 [00]00 0000 0001 0010 ^^ *************** ^^ not part of virtual address (b/c it's only 38 bits) - VPN pt 1 = 4 - first page table is an array of entries want index of 4 0x10000 is index 0 0x10004 is index 1 0x10008 is index 2 ... 0x10010 is index 4 - SEQ instruction knowledge - we expect to know what all instructions are supposed to do - we won't ask questions that depend on whether or not ALU is used to add 0, etc. - cache tag/offset formulas - there is a figure in the book - fork - we covered this more in S2017 than this semester - won't be on exam - last quiz -- L1 cache and TLB overlap VPN = virtual page number PPN = physical page number PO = page offset CT = cache tag CI = cache index CO = cache offset [VPN ][ PO] [PPN ][ PO] [CT ][CI ][CO] ^^^^^ needed to lookup the set first step of cache lookup Question: does the cache index part of the page offset or not? if yes ---> then we know the cache index without knowing the PPN [PPN ][ PO ] [CT ][CI ][CO] don't need PPN until later (but as if using the physical address) if no --> then we don't know the cache index until we find the PPN [PPN ][ PO ] [CT ][CI ][CO ] 2^(CI + CO) = size of a way log_2(size of a way) = # CI + CO bits size of a way = cache size / associativity 32KB direct-mapped cache --> way is 32KB = 2^15 B CI + CO = 15 > 14 (PO size) CI overlaps with the PPN --> no 32KB, 2 way --> way is 16KB = 2^14 CI + CO = 14 <= 14 CI does not overlap with PPN 128KB, 4-way --> way is 32KB - S2017 Q10 (dividing virtual address) 8KB pages --> 13 page offset bits 32-bit VAs 28-bit PAs 32 KB cache --- can we overlap the accesses? [ PPN ][ PO ] [CT ][CI][CO] what associativty do we need? 1-way associativity --> each way is 32 KB --> CI + CO = 15 not <= 13 2-way 16 KB --> CI + CO = 14 not <= 13 4-way 13 is <= 13 (double associativity --> remove an index bit) - S2017 Q11 8KB pages --> 13 page offset bits 32-bit VAs 28-bit PAs 2-level PT, 1KB page tables at level 1, 8KB at level 2 4-byte page table entries 1KB first-level page tables with 4B entries 256 entries in the first-level page tables 8 bits to identify index of entry Which VAs (from list) share the same second-level PT? [ VPN ][ PO ] 2^8 = 256 = 1KB/4B <- 8 ->< 11 >< 13 > [VPN ][ VPN][ PO ] pt 1 pt2 ^^^^^ lookup in first-level page table to find second level table ^^^^ lookup in second-level page table Which VAs share the same VPN pt 1 w/ 0x12345678? 0x123FD000 ^^ yes 0x00046678 ^^ no - Q23 Q3 --- multi-level PTs Multilevel PTs most efficient at representing (yes) address spaces with large, contiguous regions of unalllocated memory - entry in 1st or 2nd or (other non-last) level table is invalid - no later page tables at all (no) address spaces where the allocated addresses are evenly distributed - likely don't share the first part of the VPN - so, most first/second-level page table entries are valid - so, don't get to omit many page tables (need most possible second-level page tables) (no) address spaces with lots virtual pages using the same physical page - no savings because PPNs happen to be the same (no) address spaces with smaller pages - more page table entries - S2017 Q12 (signals + fork) we won't talk about fork on the exam - S2017 Q1-2 (bitwise ops and caches) 64KB 4-way set associative cache w/ 256B blocks Q1: C expression to get the tag? 64KB cache / 4 ways --> 16KB / way --> CI + CO = 14 C = B * S * E --- C / E = B * S ^^^^^ 16 KB log_2(C / E) = log_2(B * S) = log_2(B) + log_2(S) = block offset bits + set index bits log_2(16 K) = 14 CO = 8 (256 B blocks) 64 KB / 256 B = # blocks # blocks / 4 = # sets --> 2^CI tag is bits 14 and on look at bitwise ops see which makes bit 14 become bit 0 make sure all other higher bits of address are shifted similarly Q2: expression to get the index? CO = 8 CI = 6 index --> shift over 8 and select 6 bits & 0x3F -- select bottom 6 bits - S2017 Q18-9 (caching) 32-byte 2-way set associative cache w/ 4-byte blocks and LRU replacement 8 blocks --> 4 sets -- 2 set index bits 4-byte blocks --> 2 block offset bits bottom 4 bits determine set + offset 0 and 16 and 32 and 48 are the same locatoin 1 and 17 and 33 and 49 are the same locatoin (because the share the bottom 4 bits) 2-byte acceses to decimal address 0 16 32 2 14 34 12 20 36 capacity/compulsory/conflict misses? get address of block --- multiple of 4 0 16 32 0 12 32 12 20 36 0 0 0 0 3 0 3 1 1 compulsory: X X X X X X --> 6 compulsory misses conflict: X ^ conflict with 32 higher associativity would have prevented 0 1 2 3 0 1 [0-3][4-7][8-11][12-15][16-19][20 set 0: 0[evicted for 32], 16[evicted for 0], 32[H], 0 set 1: 20, 36 set 2: set 3: 12[H] - 16 B cache 0 2 4 6 8 10 12 14 16 18 0 2 4 6 ... ^^^^^ - vector instructions (what to know) - basic idea: one instruction says to do multiple computations on matching pairs on values from fixed-sized arrays - versus: multiple accumulators/reassociation -- CPU discovers multiple parallel computations (can be combined with that, too) - won't ask about specific instrinsics/instructions without giving you a description of them on the exam - pipelining and what each stage does - multiple pipelined CPU designs -- we talked about one - fetch --- read instruction, split up compute next predicted PC - decode --- read registers (probably a bad name) - execute --- use ALU, set/read condition codes - memory --- use data memory - writeback --- write registers