- 5-stage pipeline stages ~ our textbook had one possible choice for splitting up the steps (and a common choice) but there are other options ~ fetch [perhaps better named "fetch and PC update"] read instruction memory compute next [predicted] PC (but tricky cases like conditoinal jumps and return really handled later) ~ decode [perhaps better named "register read"] read registers in instruction compute destination registers for later stages ~ execute did whatever the ALU needed to do arithmetic for OPq instructions (add, sub, xor, and) addition/subtraction for address computation (rmmovq, mrmovq, push, pop call, ret) wrote condition codes based on ALU operatoin (if required) evaluated condition codes (cmovXX, jXX) ~ memory read or wrote data memory (if required) ~ writeback wrote registers in the pipelined processor, each instruction does the stages in order b/c values have to go into a pipeline register bank before the instructoin can advance to the next stage - dependencies and how they affect OOO instruction X depends on instruction Y if we can't compute the result of instruction X until we have the result of instruction Y mostly this means that instruction X reads from a register that instruction Y writes but what about control flow? addq %rcx, %rdx <--- A jle foo <-- A' subq %rax, %rbx <--- B whether instruction B executes depends on the result of A and A' (do we jump over B?) but OOO will predict what the branch does and not care about this kind of dependnecy (assuming they predict correctly) but in we can't compute the result of A' until A finished (b/c A writes to the condition codes, and A' reads the conditoin codes) cases where it's not obvious whether there's a dependency for OOO purposes: * "hidden" register --- %rsp, conditoin codes * values in memory (if accessing same address, but computed from different values) * indirect dependencies addq %rax, %rbx <-- X subq %rbx, %rcx xorq %rcx, %rdx <-- Y Y depends on X * same name, but different value (what OOO use register renaming to deal with) addq %rax, %rbx <--- X mrmovq (%rcx), %rbx <--- Z rrmovq %rbx, %rdx <--- Y Y does NOT depend on X Y DOES depend on Z * doesn't actual read the register addq %rax, %rbx <--- X mrmovq (%rcx), %rbx <--- Y Y does NOT depend on X - exception handling procedure ~ two categories of exceptions synchronous exceptions: some instruction was executing and as part of the execution of that instruction, the processor detected a "special" condition e.g. system call (current program wants to run the OS) e.g. page fault (program trying to access memory that isn't setup) e.g. divide by zero ... effectively the instruction turns into a special function call asynchronous exceptions: something happens outside the processor (conceptually outside the processor --- not related to what instruction is being run) the processor chooses some instruction stop on and then effectively does a special function call ~ when the exception is triggered, the processor is going to run an _exception handler_ previously setup by the operating system * the processor saves the old program counter somewhere * (sometimes) the processor saves some additional regsiters (x86: save the old stack pointer) * the address of the exception handler will be determined from the _exception table_. The processor will know about the exceptoin table through a special register the OS sets up on boot the exceptoin table usually contains one pointer for each type of exception processor reads pointer from that table, and jumps there * processor enters "kernel mode" where extra operatoins are allowed example extra operations: * setting page tables * setting the special register that says where exception hnadlers are ~ then operating system code runs and decides what to do, and typically * save the state of the program * do some special action depending on the type of exception * choose what program to run next (could be the same program that was previously running) * return from the exception handler into that program (with state saved previously) - page tables ~ way of doing address translation: each process has addresses it uses which map in an OS-controlled way to real addresses on the hardware ~ the possible addresses are divided into fixed sized pages ~ table for each page in the addresses the process uses ("virtual addresses") saying which page in the addresses the hardware really uses ("physical addresses") should be used (if any) ~ e.g. on memory HW with the single-level examples, the page table was the part of the memory from the base address extending N bytes later where N was (PTE size * number of virtual pages). ~ gives the OS flexibility: it can put the program whereever it wants in memory without having the program need to be changed to accomodate this ~ by adding a few extra bits to page tables, we can use them to give variable acess to memory (e.g. make part of it read only) CPU is already looking up the translation, can lookup at the same time "is the program allowed to write this", etc. ~ common extra bits we might want in page tables: * valid bit -- b/c not every address the program makes up will be loaded in memory [not all of the falling bits exist on every platform:] * permission bits ("what operations can the program do") * user-mode/kernel-mode bit: "can the program access this in user mode" (versus reserved for the OS) * writable-bit: "can the program access this with instructions like rmmovq or push" * executable-bit: "can the program access this when it's fetching instructions" * bits to assist with swapping --- to help the OS tell how the memory is used * accessed/referenced bit: "has this page table entry been used since this bit was las tcleared" * dirty bit: "has this page table entry been used to perform a write since this bit was last cleared" * bits to control caching - where would page table handling fit in pipehw2 if we added page tables, we would need to do a page table lookup (or equivalent) before the instruction memory access and before the data memory access (or as part of those) simplest (but slow) solution: * to fetch an instruction: extract the virtual page number [get VPN] combine it with a page table base register (needs to be added) to get the locatoin of a page table entry read the page table entry extract the physical page number from the page table entry [get PPN] combine PPN with the page offset from the original address fetched then fetch that address and use the result of that fetch and something similar for data memory reads/writes [but some special cases: if multi-level page tables, need to find page table entries at each level if we are fetching 10 bytes from near the end of a page, we might need to lookup two pages and fetch the first K bytes separately from the sceond 10-K bytes and then combine them ] less slow solution: implement the simple solution the add a cache ("TLB") for the result of the VPN to PPN translation (everything between [get VPN] and [get PPN] above) - copy-on-write fork() -- appears to copy all memory in a process to make a new proces simple implementatoin: [real copy version] actually make a copy of everything, setup a new page table (for the new process) for that points to the copies let's say both processes never wrote to any of their memory after fork() if so, we have an even simpler implementation idea: just use the same page table compromise that does handle writes: [copy on write versoin] make a new page table for the new process that points the same memory set both page tables as read-only with these steps, either we get perfect, fast copy OR we detect that something needs to be fixed b/c we get a protection fault (exception for accessing a read-only page) when one of the processes tries to WRITE a read-only page, [from the protectoin fault handler:] actually make the copy of that page update the process's page table to use the copy and mark as writeable rerun the instruction that was trying to write the page from the program's point of view both the real copy version and the copy on write versoi are the same except for performance - swapping illusion to provide: the program has access to as much memory as it wants (even if that would really require using disk/etc.) reality: not all of the addresses the program can access are really loaded into memory when the program tries to access something that's not loaded: a page fault will happen (b/c it won't be valid in the page table) [from the page fault handler:] load that thing into memory (possibly after making room) [****] and then rerun the instruction that was accesses the thing that wasn't loaded from the program's point of view: same as if everything was always loaded except for performance [****] how do we make room for data? we have to choose something to remove from memory to make room often, this might be data that loaded for another/same program to make room, need to edit the page table to not point to that physical page might need to save data to disk so we don't forget it need informatoin to make a good choice want to detect if programs are using particular pages need informatoin to decide whether to save to data to disk want to detect if programs have written to pages since they were last saved to disk/generated - TLB (typicallly small) cache that we lookup VPNs in and get out page table entries this replaces the entire page table lookup on a TLB hit (even if it's a multi-level lookup) organized the same kinds of ways as L1 caches but: * instead of addresses to split into tag/index/offset have VPNs * instead of mutli-byte blocks have single-page table entry blocks (so always 0 offset bits) * instead of storing data the program cares about, store page table entries * instead of handling writes automatically, we rely on the OS executing a special instruction when the page table entries change - and overlapping with L1 lookup * typically L1 caches are going to use physical addresses (you could use virtual addresses, but it's complicated) * but we don't want to wait for TLB lookup to finsih to start our L1 lookup * divide the L1 lookup - access the correct set of the cache [index bits] - access the correct part of a block [offset bits] - compare the tags to find the correct block [tag bits] * can do each step as soon as we have the appropriate bits of the address virtual address: [ VPN ][ PO ] physical address: vvvvvvvvvvvvvvvvvv--------- available before we do any lookup [ PPN ][ PO ] [ cache tag ][ index ][ofst ] (for the L1) if index bits only include part of the page offset, we can do correct set lookup as soon as we have the virtual address, b/c we know the PO part of the physical address --- even though we don't know the rest of the physical address - mulit-level page tabl eaddress splitting with a multi-level page table we split the VPN into pieces the first piece is used for the first-level lookup the first-level lookup finds wher ethe second-level table is the second piece is used for the second-levle lookup the second-level lookup finds where third-level table is (if there is one) ... how big are the pieces? if the first-level table has N entries, then the first piece of the VPN must be log_2(N) bits (and same for other levels) if the table size is the same at each level and the VPN is N bits and there are K levels, then each piece of the VPN must be N/K