- 5-stage pipeline stages
    ~ our textbook had one possible choice for splitting up the steps
        (and a common choice)
      but there are other options
    ~ fetch
        [perhaps better named "fetch and PC update"]
        read instruction memory
        compute next [predicted] PC
            (but tricky cases like conditoinal jumps and return really
             handled later)
    ~ decode
        [perhaps better named "register read"]
        read registers in instruction
        compute destination registers for later stages
    ~ execute
        did whatever the ALU needed to do
            arithmetic for OPq instructions (add, sub, xor, and)
            addition/subtraction for address computation (rmmovq, mrmovq, push, pop
                call, ret)
        wrote condition codes based on ALU operatoin (if required)
        evaluated condition codes (cmovXX, jXX)
    ~ memory
        read or wrote data memory (if required)
    ~ writeback
        wrote registers

    in the pipelined processor, each instruction does the stages in order
        b/c values have to go into a pipeline register bank before the instructoin
        can advance to the next stage

- dependencies and how they affect OOO
    instruction X depends on instruction Y if we can't compute the result of instruction X
    until we have the result of instruction Y

    mostly this means that instruction X reads from a register that instruction Y writes

        but what about control flow?
            addq %rcx, %rdx <--- A
            jle foo <-- A'
            subq %rax, %rbx <--- B

        whether instruction B executes depends on the result of A and A'
            (do we jump over B?)
        but OOO will predict what the branch does and not care about this kind of
        dependnecy (assuming they predict correctly)

        but in we can't compute the result of A' until A finished
            (b/c A writes to the condition codes, and A' reads the conditoin codes)

    cases where it's not obvious whether there's a dependency for OOO purposes:
        * "hidden" register --- %rsp, conditoin codes
        * values in memory (if accessing same address, but computed from different values)
        * indirect dependencies
            addq %rax, %rbx <-- X
            subq %rbx, %rcx
            xorq %rcx, %rdx <-- Y
                Y depends on X
        * same name, but different value (what OOO use register renaming to deal with)
            addq %rax, %rbx <--- X
            mrmovq (%rcx), %rbx  <--- Z
            rrmovq %rbx, %rdx <--- Y
                Y does NOT depend on X
                Y DOES depend on Z
        * doesn't actual read the register
            addq %rax, %rbx <--- X
            mrmovq (%rcx), %rbx  <--- Y
                Y does NOT depend on X

- exception handling procedure
    ~ two categories of exceptions
        synchronous exceptions:
            some instruction was executing and as part of the execution of
            that instruction, the processor detected a "special" condition
                e.g. system call (current program wants to run the OS)
                e.g. page fault (program trying to access memory that isn't setup)
                e.g. divide by zero
                ...

            effectively the instruction turns into a special function call
        asynchronous exceptions:
            something happens outside the processor
                (conceptually outside the processor --- not related to what instruction
                is being run)
            the processor chooses some instruction stop on
            and then effectively does a special function call
    ~ when the exception is triggered, the processor is going to run an
        _exception handler_ previously setup by the operating system

        * the processor saves the old program counter somewhere

        * (sometimes) the processor saves some additional regsiters
            (x86: save the old stack pointer)

        * the address of the exception handler will be determined from the
            _exception table_. The processor will know about the exceptoin table
            through a special register the OS sets up on boot
            
          the exceptoin table usually contains one pointer for each type of exception

          processor reads pointer from that table, and jumps there

        * processor enters "kernel mode" where extra operatoins are allowed
            example extra operations:
                * setting page tables
                * setting the special register that says where exception hnadlers are

    ~ then operating system code runs and decides what to do, and typically
        * save the state of the program
        * do some special action depending on the type of exception
        * choose what program to run next (could be the same program that was previously
          running)
        * return from the exception handler into that program (with state saved previously)

- page tables
    ~ way of doing address translation:
        each process has addresses it uses which map in an OS-controlled way to
        real addresses on the hardware
    ~ the possible addresses are divided into fixed sized pages
    ~ table for each page in the addresses the process uses ("virtual addresses")
        saying which page in the addresses the hardware really uses ("physical addresses")
        should be used (if any)
        ~ e.g. on memory HW with the single-level examples, the page table
            was the part of the memory from the base address extending N bytes later where
            N was (PTE size * number of virtual pages).
    ~ gives the OS flexibility: it can put the program whereever it wants in memory
        without having the program need to be changed to accomodate this

    ~ by adding a few extra bits to page tables, we can use them to give variable
        acess to memory (e.g. make part of it read only)
            CPU is already looking up the translation, can lookup at the same time
            "is the program allowed to write this", etc.

    ~ common extra bits we might want in page tables:
        * valid bit -- b/c not every address the program makes up will be loaded in memory
        [not all of the falling bits exist on every platform:]
        * permission bits ("what operations can the program do")
            * user-mode/kernel-mode bit: "can the program access this in user mode"
                (versus reserved for the OS)
            * writable-bit: "can the program access this with instructions like
                rmmovq or push"
            * executable-bit: "can the program access this when it's fetching
                instructions"
        * bits to assist with swapping --- to help the OS tell how the memory is used
            * accessed/referenced bit: "has this page table entry been used
                since this bit was las tcleared"
            * dirty bit: "has this page table entry been used to perform a write
                since this bit was last cleared"
        * bits to control caching

- where would page table handling fit in pipehw2
    if we added page tables, we would need to do a page table lookup
            (or equivalent)
    before the instruction memory access and before the data memory access
        (or as part of those)
    simplest (but slow) solution:
        * to fetch an instruction:
            extract the virtual page number [get VPN]
            combine it with a page table base register (needs to be added)
                to get the locatoin of a page table entry
            read the page table entry
            extract the physical page number from the page table entry [get PPN]
            combine PPN with the page offset from the original address fetched
            then fetch that address and use the result of that fetch
        and something similar for data memory reads/writes

        [but some special cases:
            if multi-level page tables, need to find page table entries at each level
            if we are fetching 10 bytes from near the end of a page, we might need
                to lookup two pages and fetch the first K bytes separately from
                the sceond 10-K bytes and then combine them
        ]
    less slow solution:
        implement the simple solution
        the add a cache ("TLB") for the result of the VPN to PPN translation
            (everything between [get VPN] and [get PPN] above)
    
- copy-on-write
    fork() -- appears to copy all memory in a process to make a new proces
    simple implementatoin: [real copy version]
        actually make a copy of everything,
        setup a new page table (for the new process) for that points to the copies

    let's say both processes never wrote to any of their memory after fork()
        if so, we have an even simpler implementation idea:
                just use the same page table

    compromise that does handle writes: [copy on write versoin]
        make a new page table for the new process that points the same memory
        set both page tables as read-only
            with these steps, either we get perfect, fast copy OR
            we detect that something needs to be fixed b/c we get a protection fault
                (exception for accessing a read-only page)
        when one of the processes tries to WRITE a read-only page,
            [from the protectoin fault handler:]
            actually make the copy of that page
            update the process's page table to use the copy and mark as writeable
            rerun the instruction that was trying to write the page

    from the program's point of view both the real copy version and the copy on write versoi
        are the same except for performance

- swapping
    illusion to provide: the program has access to as much memory as it wants
        (even if that would really require using disk/etc.)
    reality: not all of the addresses the program can access are really loaded into memory
    
    when the program tries to access something that's not loaded:
        a page fault will happen (b/c it won't be valid in the page table)
        [from the page fault handler:]
        load that thing into memory
            (possibly after making room) [****]
        and then rerun the instruction that was accesses the thing that wasn't loaded
    
    from the program's point of view: same as if everything was always loaded
        except for performance

    [****] how do we make room for data?
        we have to choose something to remove from memory to make room
            often, this might be data that loaded for another/same program
            to make room, need to edit the page table to not point to that physical page
            might need to save data to disk so we don't forget it
        need informatoin to make a good choice
            want to detect if programs are using particular pages
        need informatoin to decide whether to save to data to disk
            want to detect if programs have written to pages since they were 
                last saved to disk/generated


- TLB
    (typicallly small) cache that we lookup VPNs in and get out page table entries
    this replaces the entire page table lookup on a TLB hit
        (even if it's a multi-level lookup)
    organized the same kinds of ways as L1 caches but:
        * instead of addresses to split into tag/index/offset have VPNs
        * instead of mutli-byte blocks have single-page table entry blocks
            (so always 0 offset bits)
        * instead of storing data the program cares about, store page table entries
        * instead of handling writes automatically, we rely on the OS
            executing a special instruction when the page table entries change

    - and overlapping with L1 lookup
        * typically L1 caches are going to use physical addresses
            (you could use virtual addresses, but it's complicated)
        * but we don't want to wait for TLB lookup to finsih to start our L1 lookup

        * divide the L1 lookup
            - access the correct set of the cache     [index bits]
            - access the correct part of a block    [offset bits]
            - compare the tags to find the correct block  [tag bits]
        * can do each step as soon as we have the appropriate bits of the address
            virtual address:
         [            VPN    ][           PO   ]
            physical address:
                              vvvvvvvvvvvvvvvvvv--------- available before we do any lookup
            [           PPN  ][           PO   ]
            [    cache tag     ][ index ][ofst ] (for the L1)

                if index bits only include part of the page offset,
                    we can do correct set lookup as soon as we have the virtual address,
                    b/c we know the PO part of the physical address ---
                    even though we don't know the rest of the physical address


- mulit-level page tabl eaddress splitting
    with a multi-level page table we split the VPN into pieces
        the first piece is used for the first-level lookup
        the first-level lookup finds wher ethe second-level table is
        the second piece is used for the second-levle lookup
        the second-level lookup finds where third-level table is (if there is one)
        ...

    how big are the pieces?
        if the first-level table has N entries,
            then the first piece of the VPN must be log_2(N) bits
        (and same for other levels)
         
        if the table size is the same at each level and the VPN is N bits
            and there are K levels,
            then each piece of the VPN must be N/K