- how cumulative? ~half (+/-) will be from after exam 2 - counting cycles for pipelined processors one way: big timing diagram add F D E M W sub F D E M W xor F D E M W mrmovq F D E M W add F D &E another way: start with no stalling: no stalling = # cycles = # instructions + 4 [D E M W of last instruction] each cycle we need to stall --> +1 cycle need to check dependencies draw out where forwarding can happen Q: is this actually going forward in time textbook's CPU only had two cases where we needed to stall ret load/use each time we mispredict a branch --> some penalty textbook's CPU --> wait until jmp's memory --> 2 wasted instructions - counting cycles for multiple accumulator/OOO scenarios - functional units: latency (time to answer) pipelined or not (if pipelined, new input/output every cycle if not pipelined, one at a time) addq %rax, %rbx (1) addq %rcx, %rbx (2) addq %rdx, %rcx (3) addq %r8, %r8 (A) addq %r9, %r9 (B) addq %r10, %r10 (C) ... addq %rbx, %rbx (4) if we have EXACTLY ONE pipelined adder with 3 cycle latency 1 1 1 3 3 3 2 2 2 4 4 4 A A A B B B C C C D D D heuristic: longest path (1) --> (2) --> (4) minimum total needed time: time to do (1) THEN (2) THEN (4) 3 cycle latency --> 9 cycles assumption: other operations aren't too much e.g. you have a loop, and the longest chain is "enough" of the work so the other stuff can be in done parallel assumption: everything else can be done in parallel (+ see drawing) - virtual -> physical addrs [ VPN | PO ] | | page table | | v v [ PPN | PO ] PO = page offset location within a page page size is a parameter, e.g. if page size is 2^12 bytes, page offset is 12 bits VPN = virtual page number PPN = physical page number = physical frame number = frame number page table: entry for each virtual Page Number [* slightly different for multi-level] containing a valid bit, PPN, some permission bits single-level page table page table entry for VPN x is at address (page table base) + (entry size) * x - TLB cache for page table entries maps VPNs (instead of physical addresses) to page table entries each block = 1 entry [ VPN ] same idea as [ Phys Addr ] [TLB tag][TLB index] [cache tag][cache index][offset] # index bits = log_2(# sets) # entrys / # ways = # sets on TLB hit: VPN --> TLB --> get out PPN (+ permission bits) --> use to form the physical address along with the PO on TLB miss: VPN --> TLB --> get out "miss" (PTE = page table entry) VPN --> combine with page table base to get PTE address --> fetch the PTE --> extract PPN (+ permission bits) --> use to form the address --> and update TLB - multi-level page tables [ VPN ][ PO ] [ VPN part 1][VPN pt 2][ PO ] VPN part 1 * page table entry size + page table base --> get a PTE from level 1 use PPN from level 1's PTE to form level-2 page table base (that PPN * page size) VPN part 2 * page table entry size + level-2 page table base --> get a PTE from level 2 if last level (this example): use PPN + PO to get final physical address otherwise: use PPN from level 2's PTE to form level-3 page table base (that PPN * page size) VPN part 3 ... ... - and TLB TLB only stores last-level entries because it has to take the WHOLE VPN TLB works the same no matter how many levels page table hsa - and permission bits and valid bits we need the valid bit to be 1 --- OR we ignore the PPN valid = 0 --> "hole in memory" usually, we also check all permissions at each level (--> update permission bits based on higher levels before storing into TLB) - swapping - use of page tables was to use memory as a cache for disk - illusion: really program storage is whatever can be stored on disk and the OS places it in memory as needed - "swap in": move from disk to memory e.g. loading e.g. after running out of space, but accessing something again often triggered by a _page fault_ you access an invalid page the OS gets a fault where the CPU says "your program tried to acces address x but you told me nothing was there" the OS looks up what should be at address x OS finds out: it's allocated, but only stored on disk at the moment OS allocates a physical page OS reads from disk into that physical page maybe run another a program while reading OS updates the page table of the program OS restarts the program RERUNNING the instruction that faulted - "swap out": move from memory to disk e.g. after running out of space in memory probably triggered by swapping in or allocating memory or ... possibly in a different program OS updates the page table to mark the virtual page as invalid OS needs to make sure the TLB gets invalidated possibly on other cores if the physical page was modified since it was last written to disk, the OS may also need to write it to disk OS reuses the physical page that that virtual page used to map to - types of exceptions book's categories: faults --- some particular instruction did something (and it's not somethig the instruciton normally does) page fault --- tried to access a virtual page marked invalid in the page tables protection fault --- tried to do something in user mode that needs kernel mode e.g. talking directly to a device, changing the page table base register,... divide by zero ... traps --- an instruction whose purpose is to trigger exceptions system calls request OS services (e.g. things the program can't do itself in user mode) interrupts --- external events I/O devices keypresses, network input, disk reads finishing, etc. timer aborts --- "the machine is broken" e.g. physical memory around byte X has wrong values and can't be recovered - cache blocking extension of loop reordering: improving locality by changing when we accessed values e.g. columns versus rows e.g. do everything with a value once before moving on usually with cache blocking: can't get perfect locality for everything e.g. some accessed row-by-row, some column-by-column cache blocking: try to work on a square-ish block of values square-ish --> best ratio of work : values in the cache plan what should stay in the cache try to make use of everything we have loaded as much as we can before moving on for (i <- 1 to N by I) for (j <- 1 to N by J) | /* hopefully the I x J chunk of stuff is in the cache | and we do all the work we can do with just that */ | for (ii <- i to i + I - 1) { | for (jj <- j to j + J - 1) { | do stuff based on ii, jj | e.g. use A[ii * N + jj] and B[jj * N + ii] | and C[jj] and D[ii] | } | } I * J elements of A, J * I elements of B, J elements of C, I element of D --> Q: is the size of all of these <= cache size? probably fits? in practice: BUT what about partial cache blocks BUT what about other values you access from cache (code, global variables, the stack, ...) BUT what about conflict misses - compilation steps [C file] --- compile --> [assembly file] --- assemble --> [object file] \ [object file] - link -> [exec] [object file] / C file: high-level operations multiple for how they get translated into instructions function and variable names assembly file: chosen particular instructions labels for instructions that refer to each other or for instructions that refer to data stored in memory that's allocated at program start functions and global variable names were turned into labels object file: instructions converted to machine code data constants converted to bytes but labels STILL EXIST: relocations for where labels are USED how to update machine code to have correct address symbol table for where labels are DEFINED needed because labels can come from other files example: printf() used in main.o, but defined in some library object file executable: machine code and data FROM ALL THE OBJECT FILES because we didn't know about other object files until now, couldn't know where machien code, etc. would be in memory with labels converted to addresses (or whatever format is needed by the CPU) idea [if statically linked]: just load bytes into memory, and jump there