- how cumulative?
    ~half (+/-) will be from after exam 2

- counting cycles for pipelined processors
    one way: big timing diagram

    add F D E M W
    sub   F D E M W
    xor     F D E M W
    mrmovq    F D E M W
    add         F D &E


    another way:
        start with no stalling:
            no stalling = # cycles = # instructions + 4  [D E M W of last instruction]
                    each cycle we need to stall --> +1 cycle
                        need to check dependencies
                        draw out where forwarding can happen
                        Q: is this actually going forward in time
                        textbook's CPU only had two cases where we needed to stall
                            ret
                            load/use
                    each time we mispredict a branch --> some penalty
                        textbook's CPU --> wait until jmp's memory --> 2 wasted instructions
            
    
- counting cycles for multiple accumulator/OOO scenarios
    - functional units:
        latency (time to answer)
        pipelined or not (if pipelined, new input/output every cycle
                          if not pipelined, one at a time)

        addq %rax, %rbx (1)
        addq %rcx, %rbx (2)
        addq %rdx, %rcx (3)
                                addq %r8, %r8    (A)
                                addq %r9, %r9    (B)
                                addq %r10, %r10  (C)
                                ...
        addq %rbx, %rbx (4)
       
       if we have EXACTLY ONE pipelined adder with 3 cycle latency
        1 1 1 
          3 3 3
              2 2 2 4 4 4
            A A A
                B B B
                  C C C
                      D D D

       heuristic: longest path
        (1) --> (2) --> (4)
            minimum total needed time: time to do (1) THEN (2) THEN (4)
            3 cycle latency --> 9 cycles
        assumption: other operations aren't too much
            e.g. you have a loop, and the longest chain is "enough" of the work
             so the other stuff can be in done parallel
        assumption: everything else can be done in parallel

    (+ see drawing)
            

- virtual -> physical addrs
            [   VPN       |    PO  ]
                 |             |
               page table      |
                 |             v
                 v
              [ PPN       |    PO  ]
    PO = page offset
        location within a page
        page size is a parameter, e.g. if page size is 2^12 bytes, page offset is 12 bits
    VPN = virtual page number
    PPN = physical page number = physical frame number = frame number

    page table:
        entry for each virtual Page Number [* slightly different for multi-level]
        containing a valid bit, PPN, some permission bits

    single-level page table
        page table entry for VPN x is at address (page table base) + (entry size) * x
        
- TLB
    cache for page table entries
        maps VPNs (instead of physical addresses) to page table entries
        each block = 1 entry
        
        [    VPN           ]     same idea as  [     Phys Addr                ]
        [TLB tag][TLB index]                   [cache tag][cache index][offset]

        # index bits = log_2(# sets)
        # entrys / # ways = # sets

    on TLB hit:
        VPN --> TLB --> get out PPN (+ permission bits) --> use to form the physical address
                along with the PO

    on TLB miss:
        VPN --> TLB --> get out "miss"
            (PTE = page table entry)
            VPN --> combine with page table base to get PTE address --> fetch the PTE
                --> extract PPN (+ permission bits) --> use to form the address
                --> and update TLB
- multi-level page tables
    [    VPN              ][   PO  ]
    [ VPN part 1][VPN pt 2][   PO  ]

      VPN part 1 * page table entry size + page table base --> get a PTE from level 1
      use PPN from level 1's PTE to form level-2 page table base (that PPN * page size)
      
      VPN part 2 * page table entry size + level-2 page table base --> get a PTE from level 2
      if last level (this example):
          use PPN + PO to get final physical address
      otherwise:

      use PPN from level 2's PTE to form level-3 page table base (that PPN * page size)

      VPN part 3
      ...
      ...

    - and TLB
        TLB only stores last-level entries
            because it has to take the WHOLE VPN
        TLB works the same no matter how many levels page table hsa

    
    - and permission bits and valid bits
        we need the valid bit to be 1 --- OR we ignore the PPN
            valid = 0 --> "hole in memory"
        usually, we also check all permissions at each level
            (--> update permission bits based on higher levels before storing into TLB)

- swapping
    - use of page tables was to use memory as a cache for disk
    - illusion: really program storage is whatever can be stored on disk
      and the OS places it in memory as needed
    - "swap in": move from disk to memory
            e.g. loading
            e.g. after running out of space, but accessing something again
        often triggered by a _page fault_
            you access an invalid page
            the OS gets a fault where the CPU says "your program tried to acces address x
                    but you told me nothing was there"
            the OS looks up what should be at address x
            OS finds out: it's allocated, but only stored on disk at the moment

            OS allocates a physical page
            OS reads from disk into that physical page
                    maybe run another a program while reading
            OS updates the page table of the program
            OS restarts the program RERUNNING the instruction that faulted

    - "swap out": move from memory to disk
            e.g. after running out of space in memory    

        probably triggered by swapping in or allocating memory or ...
            possibly in a different program
        OS updates the page table to mark the virtual page as invalid

        OS needs to make sure the TLB gets invalidated
            possibly on other cores

        if the physical page was modified since it was last written to disk,
            the OS may also need to write it to disk 

        OS reuses the physical page that that virtual page used to map to
    

- types of exceptions
    book's categories:
        faults --- some particular instruction did something 
                (and it's not somethig the instruciton normally does)
            page fault --- tried to access a virtual page marked invalid in the page tables
            protection fault --- tried to do something in user mode that needs kernel mode
                e.g. talking directly to a device, changing the page table base register,...
            divide by zero
            ... 
        traps --- an instruction whose purpose is to trigger exceptions
            system calls
                request OS services (e.g. things the program can't do itself in user mode)

        interrupts --- external events
            I/O devices
                keypresses, network input, disk reads finishing, etc.
            timer
       
        aborts --- "the machine is broken"
            e.g. physical memory around byte X has wrong values and can't be recovered 

- cache blocking
    extension of loop reordering:
        improving locality by changing when we accessed values
            e.g. columns versus rows
            e.g. do everything with a value once before moving on
    usually with cache blocking:
        can't get perfect locality for everything
            e.g. some accessed row-by-row, some column-by-column

    cache blocking:
        try to work on a square-ish block of values
            square-ish --> best ratio of work : values in the cache
        plan what should stay in the cache
        try to make use of everything we have loaded as much as we can before
            moving on

    for (i <- 1 to N by I)
        for (j <- 1 to N by J)

          | /* hopefully the I x J chunk of stuff is in the cache
          |     and we do all the work we can do with just that */
          | for (ii <- i to i + I - 1) {
          |     for (jj <- j to j + J - 1)  {
          |         do stuff based on ii, jj
          |             e.g. use A[ii * N + jj] and B[jj * N + ii]
          |                  and C[jj] and D[ii]
          |     }
          | }

          I * J elements of A, J * I elements of B, J elements of C, I element of D
            --> Q: is the size of all of these <= cache size? probably fits?
                in practice:
                    BUT what about partial cache blocks
                    BUT what about other values you access from cache   
                        (code, global variables, the stack, ...)
                    BUT what about conflict misses


- compilation steps

    [C file] --- compile --> [assembly file] --- assemble --> [object file] \
                                                              [object file]  - link -> [exec]
                                                              [object file] /

    C file:
        high-level operations
        multiple for how they get translated into instructions
        function and variable names

    assembly file:
        chosen particular instructions
        labels for instructions that refer to each other
                or for instructions that refer to data stored in memory that's allocated at program start
            functions and global variable names were turned into labels

    object file:
        instructions converted to machine code
        data constants converted to bytes
        but labels STILL EXIST:
            relocations for where labels are USED
                how to update machine code to have correct address
            symbol table for where labels are DEFINED
        
            needed because labels can come from other files
                example: printf() used in main.o, but defined in some library object file

    executable:
        machine code and data   FROM ALL THE OBJECT FILES
            because we didn't know about other object files until now,
            couldn't know where machien code, etc. would be in memory
        with labels converted to addresses (or whatever format is needed by the CPU)

        idea [if statically linked]: just load bytes into memory, and jump there