- coverage versus lectures
    ~ approx proportional to how much time spent in lecture since last exam
    ~ may incidentally cover stuff beforehand
- clarifications on the questions
    ~ put an asterisk in the box
    ~ good clarificatoin: assumption you think you needed to make
- precise exceptions S2017 Q9
    ~ not covered, wait for Exam 3
- HCL code on the exam
    ~ no writing, we might describe a circuit with HCL code, etc.

- types of misses
    - compulsory ~~ miss even if cache size = infinity
        first time we access a block ever
    - capacity   ~~ miss even if fully-associative (# ways = total # blocks)
        not enough room
        no extra space, even in other sets
    - conflict   ~~ anything else
                 ~~ miss but there was room in other sets
- counting cache misses
    (1) figure out how to find the sets used
        (a) figure out the index bits and write down the indexes
        (b) figure out the distance between things that map to the same set
            (= # bytes / way = 2^(index + offset) bits)
                e.g. 1MB direc-tmapped cache, address 0 and 0 + 1MB --> same set
                        b block offset bits,  s set index bits
                        2^(b) * 2^(s) = 1MB --> b + s = 20
                        x + 2^20 --> doesn't change bottom 20 bits

                     2-way 2MB set assoc. cache, address 0 and 0 + 1MB --> same set
        either way, identify accesses to the SAME SET
    (2) go through the accesses:
        (a) if something was never accessed before, then it's a miss
        (b) if something was accessed before, check what else FROM THE SET
            was accessed and consider replacement policy
                - if LRU: too many things used more recent --> miss, otherwise hit
                    "too many" --> more than blocks/set
                    Least Recently Used
                - if FIFO: too many things brought (= miss) in more recently --> miss, otherwise hit
                    "too many" --> more than blocks/set
                    First In First Out
                - if random: ???
- cache misses and replacement policies and etc. F2017 Q5

- write misses 
    - write miss: address we wrote to was not in the cache
- write policies: when writing:
    [write-allocate] if it's not in the cache, should I bring it in?
    [write-back/through] if it's in my cache (now), should I just update my cache,
        or should I also update memory?
    - write-back and dirty bits
        - dirty bit --> value in cache != value in memory, memory needs to be updated
             only needed for writeback
             use to send updates to memory when evicted

- dependencies versus hazards
    - dependency --- DOES NOT DEPEND ON THE CPU
        - X depends on Y (or "has a dependency on Y")
            we need X's result to compute Y
    - hazard --- FOR A PARTICULAR CPU DESIGN
        - natural implementation (no forwarding, stalling, branch prediction, etc.)
          would do the wrong thing or not know what to do

- forwarding
    - instead of writing a value to the register file and reading it back
        (which reuqires waiting for writeback stage)
      send the value directly
        
        mrmovq (%rax), %rax
        addq %rax, %rax

    (1) when is the value computed?
            near the end of mrmovq's memory stage
                without forwarding:
                    at the end of mrmovq's writeback
    (2) when is the value used?
            near the beginning of addq's execute stage
                without forwarding:
                    near the beginning of addq's deccode

    with forwarding: (send the value WITHOUT the register file)
        mrmovq F D E M1W
        addq     F D D2E M W (1 cycle of stlaling)


    without forwarding: (send the value with the registe file)
        mrmovq F D E M W1
        addq     F D D D2D E M W

    --> make sure we stall to get one after the other

    (3) after we stall enough, do we have to do something besides
        read the register file? this is "forwarding"

        in addq: instead of taking %rax from the register file, 
            take it from mrmovq's pipeline register input
                
                d_valA = [ 
                    ...
                    ... : m_valM /* == mem_output */
                    ...
                ]
                d_valB (same thing)

        alternate example:
            addq %rax, %rax  F D E1M W
            nop                F D E M W
            nop                  F D E M W
            addq %rax, %rbx        F D2E M W

            d_valA = [
                ...
                ... : W_valE /* == reg_inputE */
                ...
            ]


- bubbles and stalls
    - stalling AN INSTRUCTION
        - stopping the instruction from advancing to another pipeline stage
    - stalling A REGISTER
        - keeping the register from being updated

    F D E M W
    e d c b a

    [normal advance:]
    f e d c b

    [stall d:]
    e d * c b
        ^--- fill with NOP

    to implement this stall of instruction d:
        stall the PC register (fetch e again) [caveats: mispredictoins]
        stall the fetch to decode registers (use d's values again)
        bubble the decode to execute registers
        ^^^^^^
         \------ have the register output values for nop

         `
- aliasing
    - two names for the same value
        int A[100];
        int *x = &A[10];
        int *y = &A[20];
        x[10] and y[0] are the same value

    - and loop unrolling 
        not very related

        loop unrolling: compiler needs to know
            for (int i = 0; i < n ; ++i) { 
                BODY
            }
            for (int i = 0; i < n ; i += 4) { 
                BODY
                i+=1;
                BODY
                i+=1;
                BODY
                i+=1;
                BODY
            }
            loop index: i
            loop bound: n

            maybe aliasing could make it hard to tell how i is changed
                (e.g. does the loop body change i)
            ... but then i would need to be somethign which has a pointer to it

    - and cache blocking
        for (int i = 0; i< N; ++i)
            for (int j = 0; j< N; ++j)
                A[i * 100 + j] += f(B[j * 100 + i])
                ^                   ^
                probably pointers
                compiler can't do cache blocking without worrying that they point to
                the same array

                    cache blocking: change the order of i, j

- fully assoc v direct-mapped
    - fully associative = associativity = # blocks
        most flexible possible
        really expensive to implement (lots of tag comparisons for every lookup)
            almost never used for data/instruction caches
        good for reasoning about conflict misses
    - direct-mapped -- associtiavity = 1
        least flexible possible
        really easy/fast to implement

- metadata in caches
    - valid bit for every block (need a way to represent an empty cache)
        - valid = 1 block is used
        - valid = 0 block is UNUSED --- ignore everything else
    - tag bits --- multiple things map to set -- is this the right one?
        NB: needed regardless of associativity
    - [if write-back] dirty bit -- do I need to update memory eventually?
        OR is my value different than the one in memory?
    - bits to implement replacement policies
        e.g. for LRU: ordered list of blocks (most to least recently accessed)

- S2017 exam 2 Q1 -- branch pred
            andq %rax, %rax     
            jle later [NOT TAKEN]
            irmovq
            subq
    later:  addq
            halt

    andq  F D E M W
    jle     F D E*M W
    later+0   F D 
    later+1     F 
    irmovq        F D E M W
    subq            F D E M W

    * find out we made a misprediction, later+0 (addq), later+1 (halt) are squashed
        squashing now keeps later+0's addq from changing the conditoin codes

    Q: REGISTER bubble/stall signals during jle's execute
        to cause addq/halt to be squashed

    bubble/stall signals are for what happens in the next cycle
        --> during jle's memory

    jle's execute: fetch:  later+1      jle's memory: fetch: irmovq (new)
                   decode: later+0                    decode: (nothing)
                   execute: jle                       execute: (nothing)
                   memory: andq                       memory: jle
                   writeback: [old irmovq]            writeback: andq
    fetch->fetch: normal
    fetch->decode: bubble --- output nop instead of normal value (later+1)
    decode->execute: bubble --- output nop instead of normal value (later+0)
    execute->memory: normal 
    memory-> writeback: normal

- loop unrolling tradeoffs
    + less comparison, index managements instructions for large # of iterations
    + less conditoinal jumps for large # of iterations
    - maybe more work for small # of iteratoins due to "leftovers"
    - more space in instruction cache and in binaries