- GILMER HALL

- space used by page tables
    (e.g. 100% of virtual addreses occupied)

- cache locality and loops
    for (i = 0 to N) <-- changes less frequency
        for (j = 0 to N) <-- changes more frequently
            array[i] // repeatedly access same place
                array[0] array[0] array[0] array[0] array[0]
                <-- temporal
            array[j] // accessing consecutive
                array[0], array[1], array[2], array[3]
                <-- spatial

            array[i + j]
                array[0] array[1] array[2] array[3] ...
                array[N] (next i) array[1] array[2] array[3]

            array[i + j * 1000]
                array[0] array[1000] array[2000]

    for (j = 0 to N) 
        for(i = 0 to N)
            array[i]
                array[0] array[1] array[2] array[3]

    for (j = 0 to N)
        for (i = 0 to N)
            for (k = 0 to 3) // <--
                array[i + k]
                array[0] [1] [2] [3] [1] [2] [3] [4] 

- loop order in the rotate
    for (i <- 0 to N) 
        for (j <- 0 to N)
            dst[i * N + j] = src[j * N + i]
                // bad locality in src
                // good locality in dst
    empirical observatoin: on our test machine,
        locality in reads was more important than for writes

    guess: writes can be done "in parallel" "in the backgound"
        also L1 is probably write-through to L2
            - can do writes of less than a block


- data flow
    - OOO produce dependency-graph of program
    - key ideas:
        things will be done in parallel if:
            - no dependencies (can get values in time)
            - enough functional units
                - two adds/cycle? (two adders)
                - one multiply issued every cycle
                    with three cycle latency
    
        ( a * b ) * (c  * d)

            MULT1 MULT1 MULT1               (a * b)
                  MULT2 MULT2 MULT2         (c * d)
                        MULT3 MULT3 MULT3   Can't use!
                                    MULT  MULT MULT (a*b)*(c*d)

- ISA -- definition
    - instruction set architecture
    - what is in the ISA:
        - what do the instructions do
        - how are the instructions encoded
        - everything functional about what instructions do
            - includes # of registers visible to instructions 
                (might not be # of registers in HW)

    - what is not in the ISA;
        - how the instructions are implemented
        - how fast the instructions are


- context switching
    - save context from CPU
        - program registers (%rax, %rsp, etc.)
        - condition codes 
        - page table base register
        - all visible state in the CPU
    - restore previous context to CPU
        - copy values from OS data structure to CPU registers


- exceptions and pipeline state
    - stop after a particular instruction
    - NEVER expose partially finished instructions
        - mechanism: squash/bubble instructions after exception
    - "precise exceptions"

        mrmovq F D E M[fault]
        addq     F D E[squashed!]
            - can't have changed program registers (no W)
            - can't have chaned memory (no M)
            - could be changing condition codes --- but can stop!
        subq       F D[squashed!]

- TLB
    - cache for page table entries
    - normal caches:
        address ---> contents of address
    - TLBs:
         virtual page # --> **PTE for that virtual page #**

    - always blocks of one PTE (no offset bits)
    - set-associative (index bits --- lower bits of VP#)
                      (tag bits --- rest of VP#) 

    - hardware managed <-- almost all current systems
        hardware does the page table lookup and
        fills the TLB on demand
    - software managed
        OS does the page table lookup (in response to fault)
        and runs special instruction to add to the TLB


- physical vs virtual caches
    - virtually-indexed/physically-tagged
        - cache index + offset is ONLY in page offset
            - same set looked up with virtual OR physical address
        - can store ONLY physical addresses in the cache
            but still start the lookup with a virtual or
            physical address
                (start lookup = read the set of the cache)
        - still need physical address to check the cache tag

        - condition: index + offset bits <= page offset bits

    - physically-indexed/physicall-tagged
        - cache index + offset may overlap with physical page #
        - can't lookup the cache set without getting
            (part of) phyiscal page #


    - above options: assume PHYSICAL addresses in cache
        - almost all processors do this
    - alternate implementation: VIRTUAL addresses in the cache
        - advantage:
            - use any cache organization and do the
              cache lookup w/o TLB/page table lookup
            - avoid TLB lookups in most cases
        - disadvantage:
            - synonym --- two virtal addresses could be same physical
            - virtual addresses are different in programs
                - invalidate the cache a lot?
                - process #s in tcache tags?
            - what happens with page table changes?
    
    
- page directories
    - Intel's name for non-last-level page tables in
        their multi-level page tables
    - page tables that point to page tables instead of
        physical pages that store program data