- forwarding / stalling
    data hazard --- pipeline "naturally" reads the wrong
        version of a value

                time ------------>
                                            v---- when irmovq naturally writes
    irmovq $100, %rax   F    D    E *  M    W
    addq %rax, %rax          F    D *  E    M    W
                                  ^---- when addq naturally reads %rax

    if we do nothing, wrong answer

    forwarding: "the value 100 is available, let's use it"
        shortcut the value from irmovq's execute to addq's decode
        implementation: need to check that irmovq instruction is writing
            to the same register that addq is reading from
                HCL: something like e_dstE == reg_srcA/reg_srcB
                    (also other cases)
                need to distinguish from
                    irmovq $100, %rbx
                    addq %rax, %rax

    stalling: generic solution to hazards
        well, if we did everything one-at-a-time, it would correctly,
        so we can delay the operation to get the right answer
    
    ONLY using stalling:
        irmovq $100, %rax   F  D  E  M  W
        addq %rax, %rax        F  F  F  F  D 

    sometime forwarding is not enough, but adding staling can always fix a hazard 

        popq %rax           F  D  E  M* W
        addq %rax, %rax        f  dX e  m (wouldn't work)
    solve with stalling only:
                               F  F  F  F  D  E  
    solve with stalling+forwarding:
                               F  F  D* E

    multiple equivalent options:
            time ------------>
        popq %rax           F  D  E  M  W
        addq %rax, %rax        F  F  D  E 
        addq %rbx, %rbx              F  D
        
        popq %rax           F  D  E  M  W
        addq %rax, %rax        F  D  D  E 
        addq %rbx, %rbx           F  F  D


- counting cache misses from C code
    the C code is "just" specifying a list of addresses that are accessed
        typcally with a pattern that should help identify what goes on
        more quickly
    what addresses?
        generally, shouldn't matter much whether array is at address 0
        versus 1024 versus so on --- so long as aligned with cache blocks

    you can look at each cache set separately
        often different loop iterations (values of i, etc.) access
        completely cache sets

        can just consider iterations that might access the same set
            sometimes you'll access the same block repeatedly
                --> same element of some array?
                --> or adjacent elements of array which have same index bits
                    (e.g. if you have 32-byte cache blocks,
                    and sizeof(array[0]) == 4
                    array[0] is at the beginning of a block, then
                    array[0] through array[7] are in the same block)
            sometimes you'll access different blocks that use the
                same set --- worry to about conflicts
                    (how many ways are there for the set?)
                (e.g. if you have 1024 sets of 32-byte blocks,
                then elements that map to the same set are multiples of
                1024*32 bytes apart = # bytes in a way
                [log_2(32) offset bits, log_2(1024) index bits,
                so for index+offset bits not to change,
                need to add 2^(offset+index bits) = 2^(log_2(32) + log_2(1024))
                ])
        look at the order accesses are in for each set,
            figure out what the contents of the set is

            direct-mapped cache: set contains only what was last accessed
            
            N-way cache: set contains N things, which ones depend on
                the replacement

        often loop mean that most sets have same hit/miss pattern

        remember: chances for hits are repeated accesses to smae blokc

    int array[1024 * 128];  /* <--- starts at beg. of cache block */
    ...
    for (int i = 0; i < 1024; ++i) {
        array[i] += array[i * 128];
    }
        ^^^^^^^^    ^^^^^^^^^^^^^
        A           B

    two types of repeated accesses:
        accessing array[i] and array[i+1] when they're same block
            roughly half of accesses to A are hits

        accessing array[i*128] and then accessing array[i']
        when i' = i * 128
            turn out to be misses (mostly?) because we have other
            accesses to the same set in iterations in between

    WT, WNA DM 16K cache, 8-byte blocks
    A:
        i = 0, 1 --- set 0
        i = 2, 3 --- set 1
        i = 4, 5 --- set 2
        ...
        i = 1022, 1023 --- set 1022/2 = 511

    16K bytes --> 4K ints, array[0] and array[4K] array[8K] ---> set 0
    B:
        i = 0, 32 (array[4K]), 64 (array[8K]), ... (every 32): set 0
        i = 1, 33 (array[128+4K]), 65, ...: set 64
                array[128] is 128 ints from the beginning or
                    128 * 4 = 512 bytes from the beginning or
                    512 / 8 = 64
        i = 2, 34 (array[256+4K]), 66, ...: set 128
        ...

    counting misses: two cases
        set 0, 64, 128, ...:
            i = 0 / set 0: two misses (+ write miss?)
            array[128]: --- set 64
                i = 1 (array[128]) for B
                i = 33 (array[128+4K]) for B
                i = 65 (array[128+8K]) for B
                ...
                i = 128 (array[128]) for A
                    hit for A
                set 64 when i = 129 
            array[256]: -- set 128
                i = 2 for B
                i = 34 for B --- different value!
                i = 256 for A --- miss! even though repeated
                ...

- code optimizations --- 
        loop unrolling, function inlining are good when?
    loop unrolling
        remove loop overhead
        loop overhead = instructions to manage the loop index
        instructions to check whether we're done with the loop

        advantage is if the loop overhead was slowing us down
            usually, yes it is
        example where it wouldn't: if we had really slow cache acceses
            that were executed in parlalel

        disadvantage is that is means the program has more instructions
                (more instructoins in the executable file, not
                more times we fetch an instruction)
            so more space in the instruction cache
            if our program is big, maybe we run out
        disadvantage is that we might run extra code if the # of loop
            iteratoins is very small because our unrolling
            assumed we'd usually do groups of K (K = how much we unrolled) 

    function inling
        remove function overhead
        function overhead = instructions to manage arguments, return value
            return address, the stack, callee/caller-saved requirements

        advantage if th efunction overhead was slowing us down
        disadvantage is larger number of instructions in instruction cache

    second-order benefits of function inling/loop unrolling
        is that they expose other optimizations
        
        example: function inlining might make it easier to eliminate
        common subexpressions now that we (or compiler) don't need to
        look across multiple to do it
            
            common subexpression == parts of expressions that are the same

            (x *z + y) and (x * z + w) --- x * z is a common subexpression
            a common compiler optimization ---c ompute x*z once

        example: loop unrolling is a prereq for doing multiple accumulators or
        most uses of vector instructions

- cache blocking
        
     reorder memory accesses to increase the amount of reuse close in time/space
     reorder memory accesses to increase locality (temporal + spatial)

     mechanism: divide input matrices into "blocks", try to keep
        block accessed repeatedly
            better spatial locality: adjacent elements within block used
                together in time
            better temporal locality: use parts of block repeatedly before
                moving on to next block

    useful only if you have a calculation which actually reuses cache blocks
        at some point

    disadvantage is that if cache performance wasn't the problem
        then your code becomes more complicated --> slower
            (more loop overhead)


- vector instructions Q8 S2018
    for (...; i += 1) {
        A[i] += A[i - 1] * B[i / 2];
        total += A[i];
    }
    can't use vector instructions for this easily because of aliasing

    generally using vector instructions for tihs:
    for (...i += 2) {
        A[i]   += A[i - 1]     * B[i / 2];
            /* <--- */
        A[i+1] += A[(i+1) - 1] * B[(i+1) / 2];
                                 ^^^^^^^^^^^^
        total += A[i];
        total += A[i+1];
    }

    end up with vector that's equivalent to this, where
            temp1/2 represent parts of a vector register:
        temp1 = A[i];
        temp1 += A[i-1] * B[i/2];
            /* <--- what if A[i] and B[(i+1)/2] are the same element */
            previous code updated A[i] here!
        temp2 = A[i+1];
        temp2 += A[(i+1)-1] * B[(i+1)/2];
        A[i] = temp1;
        A[i+1] = temp2;

- vecotr instructions in general
    single instruction multiple data
        same thing on pairs/groups of matching data
        vector = little array,
        vector instruction acts on several arrays, 
            same operation on element 0, element 1, element 2, etc.
    writing vector instructions:
        unroll and try to have groups of (vector size) identical things
        don't want to have to worry about dependencies between them
        don't want to have conditionally different operations

- accesses to physical addresses Q4 S2018 
        - remember: caches typically are based physical addresses

- calculating page table sizes, PTE sizes, etc.
        versus virtual/physical address layout
    # bits used to index into a page table = size of the part of the VPN used
    single-level table:
        # bits used = bits in VPN
    multi-level table:
        bits in VPN are divided between the levels of the table
            (typically evenly, but sometimes that's not possible)
    # bits of to index page table --> 2^(#) page table entries
    2^(# bits to index)*PTE size = page table size

    virtual adddress = [ VPN   | PO ]
    bits in virtual address = bits in VPN + bits in PO
    physical address = [ PPN  | PO ]
    bits in physical address = bits in PPN + bits in PO

    PTE sizes: 
        must be big enough for AT LEAST PPN + valid bit
            (and probably more)

- virtual memory lookup with multi-level tables
    take the VPN from the address --- MS bits (# of bits? see above)
    split the VPN into parts, depending on table sizes, # of levels 

    take the MS (most significant) part --- use to find index in first-level table
        address you check is (start of table) + index * PTE size
            (just like accessing an array of PTEs)
    use PTE from first-levl table to get a *physical* page #
    that tells you where the second-level table is
    take the second MS part of orig. addr --- use to index in the second-leve ltable
        (start of table)     + index  * PTE size
         ^^^^^^^^^^^^^^
         given as a physical page # in the PTE, not a byte address
            might need to convert
    repeat until you've run out of levels
    last PTE you find gives the PPN that goes in the final physical address

- TLB hit versus miss
    - TLB hit: means we can SKIP the page table lookup
        b/c we have a copy of the page table entry we'd get *at the end*
        saved from earlier
    - TLB miss: means we have to to dthe page table lookup
        it may or may not result in a valid page