- precise exceptions, OOO (out-of-ordeR)
    - OOO: processors execute instructions out-of-order
    - precise exceptions: covered later if at all

- cache misses questions on most recent post quiz
- tag/index/offset/etc. questions
    - one way: memorize the formulas
    - picture of the cache:
        rows as sets
        columns as ways
        width of columns is block size
    - bits: log_2(number of things indexed)
    - you can do Algebra

    - example: compuitng tag without block size
        - tag bits = address size - (set index bits + block offset bits)
        - tag bits = address size - log_2(cache size / ways)
                = address size - log_2(# sets * block size)
                = address size - log_2(cache size / ways)

- bubbling and squashing
    - how do you set the bubble/stall signals
        - figure out what instructions should be running
          next
        - bubble signal if next instruction after regs should be
            nop
        - stall signal if next instruction after regs should be
            same as current
    - branch predictions:
        - compute the result of the branch
        - in *that cycle*, we set bubble signals so in
            the next cycle: "squashing"
            - there are nops replacing every instruction
              that came after a mispredicted branch
              (PIPE: instruction going to decode,
                     instruction going to execute)
        - in the next cycle, we fetch the correct instruction
- finding #s of stalls
    - forwarding requirements:
        - value must be computed before its needed
        - don't make any clock cycles much longer (no feeding memory results to the ALU)
    - example: ret: compute the return address near the end of memory
    - stalling: adjust instruction so forwarding, etc. works
        - example: ret stall fetch until return address computed
            --> stall fetch until ret's writeback
    -{ popq %rax
       addq %rax, %rax }-

       popq %rax  F D E M* W
       [addq %rax   F D E* M W <--- not possible]
       addq %rax    F D D  E M W
                        ^^^ <- nop in execute 
   - no stalling *with forwarding*
       addd %rax, %rax  F D E*M W
       addq %rax, %rax     F D E M W

    - stalling *without forwarding*
        addq %rax       F D E M W*
        addq %rax, %rax   F D D D D E M W
                              1 2 3

        popq %rax       F D E M W*
        addq %rax, %rax   F D D D D E M W

- Spring 2017 Q4 --- returns and pipeline hazards
    - return reads %rsp
    - wait until memory FINISHES to fetch next instruction

- good temporal locality/spatial locality?
    - temporal: is it access twice?
        if so, how frequently are duplicate accesses?
    - spatial: what is the distance between consecutive
       accesses?
- locality quests on most recent post quiz

- cache blocking (breaking up a loop)
    for (int i = 0; i < N; ++i) 
        for (int j = 0; j < N; ++j) 
            A[i * N + j] = B[j * N  + i];
    bad locality in B and good in A
        (swap loop orders: good locality in B and bad in A)
        -- common problem with loop orders
    
    for (int ii = 0; ii < N; i += 2) 
        for (int j = 0; j < N; ++j) 
            for (int i = ii; i < ii+2; i += 1) 
                A[i * N + j] = B[j * N  + i];
    better locality in B (distance between consecutive B acceses
        is often 1)
    worse locality in A (distance between consectuve A accesses
        is sometimes N)

        --> overall better locality
            --> expansions: better choice than "2"
            --> repeat for j
        --> worse locality doesn't matter --- q: what fit sin cache?

- loop unrolling
    for (int i = 0; i < N; ++i)
        foo(i);

    for (int i = 0; i < N; i += 2) {
        foo(i); 
        foo(i+1);
    }

    executes less instructions per index
        (except you need to handle the "leftover")
    but makes the code bigger (maybe more cache misses?)

- inlining
    foo(i) {
        something complicated
    }


    for (int i = 0; i < N; ++i)
        foo(i);

    -->
    for (int i = 0; i < N; ++i)
        something complicated

    makes code bigger, but less instructions for function call/return

- aliasing

    two things are different names for the same thing:

    foo(int *px, int *py) {
        for (int i = 0; i < 10; ++i) {
            *px += *py;
        }
    }

    foo(&x, &y)
    foo(&x, &x) <-- possible call