- what material is covered
    - pipelining and cacheing are most of the exam (80+%)
    - a small number of questions on non-cache performance

- calculating # of cache misses like the post-quiz
- different types of caches: 
    - associativity: # of ways --- "width" in # of blocks
        special cases:
            1-way --> direct-mapped (hsa (# blocks) sets)
            (# blocks)-ways --> fully associative (has 1 set)
- write policies in caches
    - write-back and write-through
        - when we write a value, do we send it to memory right away?
            write-back: no --- keep track of values we need to send (dirty bit)
                on write: store value in cache
                on read or write that replaced written value: 
                    send stored value to next level
            write-through: yesj
    - write-allocate and write-no-allocate
        - when we write a value that isn't cached, do we add it to the cache
            - write-allocate: yes
            - write-no-allocate: no
    - can have ALL combinations
- TIO
    - can just memorize formulas
    - can reason from geometry --- rows = # of sets, columns = #of ways, parts of columns = block size --> rows * columns * parts = cache size
    - bits in address: # index/offset bits = log_2(# sets/# bytes in block)
    - strategy: write down what you know (incl. formulas)
    - do algebra, try to solve for everything you might need
- replacement policies
    - FIFO -- first in, first out -- "a queue"
        A B C D
        replace A
        replace B
        replace C
        replace D
        replace A 
        replace B
        replace C
        replace D
        ...
    don't care about how often A, B, C, D are accessed
            counter: what block do we replace next

    - LRU -- lease recently used
        do care about non-replacement accessed to block
        need to update ordering of when things are used
        e.g. A is being read a lot, might never be replaced
            ordering of when blocks were last accessed 
            (2 block/set -- 1 bit)

- locality and loops
    - temporal locality: accessing things multiple times
        for (int i = 0; i < N; ++i)
            for (int j = 0; j < N; ++j)
                A[i*N + j] = 1;
        each element of A is accessed exactly once --- no temporal locality

        for (int i = 0; i < N; ++i)
            for (int j = 0; j < N; ++j)
                A[i] = 1;
        each element of A is accesed repeatedly N times --- very good temporal locaity

        for (int i = 0; i < N; ++i)
            for (int j = 0; j < N; ++j)
                A[j] = 1;
        each element of A is again after accessing N other elements of A --- very poor temporal locaity


    - spatial locality: accessing things that are near other things
        for (int i = 0; i < N; ++i)
            for (int j = 0; j < N; ++j)
                A[i*N + j] = 1;
        each element of A is 1 elements from the previous one --- very good spatial locality
        
        for (int i = 0; i < N; ++i)
            for (int j = 0; j < N; ++j)
                A[j*N + i] = 1;
        each element of A is N elements from the previous one --- very poor spatial locality


- forwarding paths between instructions
    addq %rax, %rax  F  D  Ev M  W
    addq %rax, %rax     F  Dv*E  M  W

        - path must be: after value is computed
                    and before value is used
        - and path doesn't travel in time

    popq %rax        F   D   E   M*  W
   [addq %rax, %rax      F   D  *E   M    W <--- NOT POSSIBLE]
    addq %rax, %rax      F   D   D  *E    M    W

- pipeline hazards in general
    - dependencies --> instruciton uses value from another
    - a dependency is a HAZARD if the "natural" pipeline can't handle
        (no forwarding, no stalling, no branch prediction, ...)
    - two kinds of hazards:
        - control hazards --- can't do it, bceause can't compute address to fetch
        - data hazards -- can't do it, because value (register value) not available in time (would read an old value if we did nothing)

    - no hazard examples:
        irmovq $10, %rax
        irmovq $12, %rax

        addq %rax, %rax
        nop
        nop
        nop
        nop
        addq %rax, %rax

- pro/cons of optimizations (func inlining, loop unrolling, cache blocking)
    - most of these optimizations are orthonogonal
        - can and will do all of them
    - many optimizations INCREASE CODE SIZE (might increase instruction cache misses)
        - cache blocking 
        - loop unrlling
        - function inlining
    - some don't always help --- sometimes can tell, otherwise test
        - loop unrolling --- sometimes doesn't help with loops with very few iteration
            (unless you can get rid of all the loop index tests)
        - cache blocking --- only helps if you have big enough data that cache matters

- branch prediction
    - in general: processor guesses what instruction will run next
        - modern processors --- do something really complicated to make good guesses
        - PIPE: guess that the target of jumps is the destination
                (and stall for rets)
    - and fetch what they guess and start running
    - figure out actual result later:
            - PIPE: figure out result during the jump's execute
        - if right: do nothing
        - if wrong: cancel ("squash") all instructions that were fetched as a result of the wrong guess
            - PIPE: replacing decode and execute stage with pipeline bubbles
                (instead of allowing fetch and decode stages, which
                 contain wrong instructions to advance)
        - if wrong: start fetching the right instructions
            - PIPE: fetch the right instruction when jump is in memory


- cache blocking
    - identify good/bad locality in a loop that uses cache blocking
    - what kinds of scenarios does cache blocking help
        - enough data that the cache matters
        - can find locality that you wouldn't get with just loop ordering