- meeting with Prof Reiss my office hours are open-door, so we can't discuss things privately you can email me and probably schedule a time - topic coverage (vectors) exam will be a majority on things from the last third of the semester about twice as long as a midterm - likely 3:30pm section will be in Rice 130 will send email with actual policy - storing values in TLB vvv---------- what the cach estores TLB is cache [VPN] --> [PTE] ^^^----- index and find entry unlike a "normal" cache: * use the VPN instead of a physical address * split the VPN into tag/index/offset * store single page table entries instead of blocks of multiple bytes of data single page table entry --> 0 offset bits * rely on the OS running a special instruction to handle writes to PTEs (instead of detecting automatically) values are stored in the TLB whenever we need to lookup a page table entry for an access (and we don't find it) stored based on the VPN used to lookup the page table entry values are evicted when the OS runs a special instruction - changing page tables whenever the OS changes a page table entry, it needs to make sure that no TLB contains a copy of it (or else the TLB might continue using that copy) on a multi-core system, there are multiple TLBs the special instructoin to clear out parts of a TLB needs to on the core with the TLB this means a multicore OS might need to run some code on a different core to edit a page table entry "TLB shootdown" --- trigger an interrupt just to run that piece of code on another core - overlapping TLB and L1 access cache access steps: * lookup the correct set of the cache [index bits] * extract the correct part of the block [offset bits] * compare the tag to determine the correct block [tag bits] typically we will use only physical addresses for our caches (yes, there are processors which don't do this, but it creates other issues) tag/index/offset therefore are parts of the physical address virtual address [ VPN ][ PO ] physiscal address vvvvvvvvvvvvvvvvvv---------------------------------same as virtual address [ PPN ][ PO ] (based on the page size) [ cache tag ][ index][ offst] (for the L1) we can figure out the page offset part of the physical address IMMEDIATELY therefore we can do the cache access steps that only require that piece of address immediately if index bits fit within the page offset, we can immediately lookup the correct set --> start doing the cache lookup before we have the PPN (which we'll get from the TLB) - how much space page table need for particular layouts [VM2 exercises] with a 4-level page table - 1 valid page of data: first-level page table ---> second-level page table --> third-level page table --> fourth-level page table --> the single valid page 4 page tables - 2 valid pages of data: Q: what are the virtual addresses of the pages [ VPN ][ PO ] [ prt 1 ][ prt 2 ][ prt 3 ][ pt4][ PO ] ^^^^^^^^ first-level lokup to find second ^^^^^^^^ second-level to find third if prt1 through prt 3 are the same for both the two pages: then they share the first, second, third, and fourth-level ---> 4 pages if prt1 is different for the two pages then they share the first and nothing else (doesn't matter if prt 2 happens to be equal) one first-level two second-levels two third two fourth ---> 7 pages if prt1 is the same, but prt 2 is different then they shared the first + second - can generalize this to more pages: which page tables do we actually have to create - F2018 exam 3 #11 (second-level PTE) 2^16 B pages (16-bit page offsets) 40-bit VA (40 - 16) = 24 bit VPN two-level page tables 16-byte PTEs and equal # of entries at each level 12 bit VPN parts for each level vv v~~~ VA 0x13 4815 1234: second-level page table is located at PA 0x10 0000 this address must have been found by reading a page table entry which was located using the page table base pointer + the first part of the VPN addres of the second-level page table entry use that base address we got from the first-level + second part of the VPN second part of the VPN 0x815 since each PTE is 16 bytes, 0x815 * 16 bytes from the beginning of the PT 0x8150 bytes from the beginning 0x10 0000 + 0x8150 = 0x10 8150 - page tables and normal caches [differences] page tables do not contain data (only pointers to data) TLBs whcih cache page tables do not contain data (only pointers to data) caches store data page offset --- only used after the whole page table lookup not really like cache offset - fork() Unix/POSIX choice for how to create processes makes a copy of the current process including its memory and current registers a rationale: don't need to have a bunch of interfaces for specifying how the process is setup typical implementations don't actually copy all the data in memory instead they make them shared read-only between the two processes and then they fix this whenever the OS is told that one of the two processes tried to write and couldn't fix by copying one page at a time "copy on write" - latency and throughput bound ~ both ways of approximating the best performance that you could have given certain computational hardware ~ latency bound: what operations do we have to do one-after-another how long is it from when we start each operation until it finishes. add this up for our list of operations for (int i = 0; i < N; ++i) { sum += a[i]; } need to do N + operations, one after the other ~ throughput bound: how much stuff can we do versus how much work do we have to do if I have to perform N multiplicatoins and 2N additions and I have two multipliers that each can start 1 multiplicatoin/cycle and four adders that can each start 1 add/cycle throughput bound for the multipliers: N/2 cycles to do this multiplications throughput bound for the adders: 2N/4 cycle sto do the addds but this is assuming we can not worry about dependencies - throughput bound ignores dependnecies --- good for knowing how well you can do with optimizations that get rid of dependencie s(e.g. multiple accumulators) - latency bound accounts for dependencies --- good for estimating what performance should be given code with dependencies or if you can get rid of dependnecies, deteminig how close you are to what the hardware can do - F2019 exam 2K #10 (cache blocking) 4-byte int in matrix 16B blocks (4 elements per block fully-associative cache with an LRU policy int n = 160; for (i = 0; i < n; i += 4){ for (j = 0; j < n; j += 4){ for (iB = i; iB <= i + 3; iB++){ // ---- for (jB = j; jB <= j + 3; jB++){ // ---- matrix[jB * n + iB] *= i*j; } } } } for each matrix block (in loops marked ---) we access four cache blocks matrix[j * n + i] through matrix[j * n + i + 3] matrix[(j+1) * n + i] through matrix[(j+1) * n + i + 3] and we only access each matrix block once so assuming the matrix block fits in the cache, it will stay (LRU) and we'll have four misses per matrix block cache is 256B --> 256/16 = 8 cache blocks --- enough to fit a matrix block can't have less than four misses per matrix block because we have to load each cache blockin the matrix block at least once each matrix block is 4x4 out of 160x160 ---> 40x40 matrix blocks 40 * 40 * 4 misses some simplificatoins that helped this question: fully associative meant we didn't worry about sets/conflict misses matrix blocks being whole cache blocks meant we didn't worry about keeping things in the cahce between matrix blocks matrix blocks only being used once/never overlappin meant we didn't worry about keeping things in the cache between matrix blocks BUT even for cases wher ethe above doesn't hold, this can still be a useful *approximation* of how useful cache blocking is going to be - counting cache misses w/ cache blocking - F2018 exam 1 #22 - are C expressions true (bitwise) setup: x, y were unsigned ints between 0 and 1000 inclusive (x & y) < y ^^^^^ \----- less or equal number of 1 bits than either x or y and since unsigned --> equal or lower value than either x or y counterexample: x = y = 0 (x | y) >= x ^^^^^ \---- more or equal number of 1 bits than either x or y and since unsigned --> equal or greater value than x or y ((x >> 3) & 0x7) == ((x & 0x3F) >> 3) x = ????abcdefgh x >> 3 = ????abcde & 0x7 & 111 ---------- 000000cde x = ????abcdefgh & 111111 -------------- 000000cdefgh >> 3: 000000cde --> same ((x + y) << 1) >= (x | (y << 1)) (x + y) * 2 x*2 + y*2 ^^^^^^^^^^ x | (y * 2) <= x + (y * 2) for non-negative a, b a + b >= a | b a | b is like addition without carrying - F2019 exam 1 #12 (bitwise) vvv vvv For example, the integer (specified in hexadecimal) `0x12345678` should become `0x12678678`. 0x??345678 vvv a. `x = ((x & ~0xFFFFFF) | ((x << 12) | x) & 0xFFFFFF);` ^^^^^^^^^^^^^^ ^^^^^^^^^ 0x123000000 0x??678000 *a. `x = ((x & ~0xFFFFFF) | (((x & 0xFFF) << 12) | (x & 0xFFF));` 0x678000 0x678 a. `x = ((x ^ (x >> 12)) | ((x << 12) & 0xFFF000));` ^^^^^^ *a. `x = ((x << 12) & 0xFFF000) + (x & 0xFF000FFF);` ^^^^^^^^^^^^^^^^^^^^^^ 0x345678000 0x12000678 0x000678000