- GILMER HALL - space used by page tables (e.g. 100% of virtual addreses occupied) - cache locality and loops for (i = 0 to N) <-- changes less frequency for (j = 0 to N) <-- changes more frequently array[i] // repeatedly access same place array[0] array[0] array[0] array[0] array[0] <-- temporal array[j] // accessing consecutive array[0], array[1], array[2], array[3] <-- spatial array[i + j] array[0] array[1] array[2] array[3] ... array[N] (next i) array[1] array[2] array[3] array[i + j * 1000] array[0] array[1000] array[2000] for (j = 0 to N) for(i = 0 to N) array[i] array[0] array[1] array[2] array[3] for (j = 0 to N) for (i = 0 to N) for (k = 0 to 3) // <-- array[i + k] array[0] [1] [2] [3] [1] [2] [3] [4] - loop order in the rotate for (i <- 0 to N) for (j <- 0 to N) dst[i * N + j] = src[j * N + i] // bad locality in src // good locality in dst empirical observatoin: on our test machine, locality in reads was more important than for writes guess: writes can be done "in parallel" "in the backgound" also L1 is probably write-through to L2 - can do writes of less than a block - data flow - OOO produce dependency-graph of program - key ideas: things will be done in parallel if: - no dependencies (can get values in time) - enough functional units - two adds/cycle? (two adders) - one multiply issued every cycle with three cycle latency ( a * b ) * (c * d) MULT1 MULT1 MULT1 (a * b) MULT2 MULT2 MULT2 (c * d) MULT3 MULT3 MULT3 Can't use! MULT MULT MULT (a*b)*(c*d) - ISA -- definition - instruction set architecture - what is in the ISA: - what do the instructions do - how are the instructions encoded - everything functional about what instructions do - includes # of registers visible to instructions (might not be # of registers in HW) - what is not in the ISA; - how the instructions are implemented - how fast the instructions are - context switching - save context from CPU - program registers (%rax, %rsp, etc.) - condition codes - page table base register - all visible state in the CPU - restore previous context to CPU - copy values from OS data structure to CPU registers - exceptions and pipeline state - stop after a particular instruction - NEVER expose partially finished instructions - mechanism: squash/bubble instructions after exception - "precise exceptions" mrmovq F D E M[fault] addq F D E[squashed!] - can't have changed program registers (no W) - can't have chaned memory (no M) - could be changing condition codes --- but can stop! subq F D[squashed!] - TLB - cache for page table entries - normal caches: address ---> contents of address - TLBs: virtual page # --> **PTE for that virtual page #** - always blocks of one PTE (no offset bits) - set-associative (index bits --- lower bits of VP#) (tag bits --- rest of VP#) - hardware managed <-- almost all current systems hardware does the page table lookup and fills the TLB on demand - software managed OS does the page table lookup (in response to fault) and runs special instruction to add to the TLB - physical vs virtual caches - virtually-indexed/physically-tagged - cache index + offset is ONLY in page offset - same set looked up with virtual OR physical address - can store ONLY physical addresses in the cache but still start the lookup with a virtual or physical address (start lookup = read the set of the cache) - still need physical address to check the cache tag - condition: index + offset bits <= page offset bits - physically-indexed/physicall-tagged - cache index + offset may overlap with physical page # - can't lookup the cache set without getting (part of) phyiscal page # - above options: assume PHYSICAL addresses in cache - almost all processors do this - alternate implementation: VIRTUAL addresses in the cache - advantage: - use any cache organization and do the cache lookup w/o TLB/page table lookup - avoid TLB lookups in most cases - disadvantage: - synonym --- two virtal addresses could be same physical - virtual addresses are different in programs - invalidate the cache a lot? - process #s in tcache tags? - what happens with page table changes? - page directories - Intel's name for non-last-level page tables in their multi-level page tables - page tables that point to page tables instead of physical pages that store program data