- what's on the exam
    if we covered it in leture since the last exam, probably
    if we covered it before, maybe incidentally

    loop unrolling will wait till exam 3

- addq rax=-5, rbx=5 --> condition codes and je
    result is 0
    ZF = 1
    je: ZF ?= 1 ---> "if the result of subtraction would have meant equal"
                ---> "if the result was = 0" 

- S2018 Q5 ~ cycles stalled from hazards
    mrmovq -> %r11     F D E M W 
    addq %r11, %r8       x F D E			(F D D E)
        needs %r11
        from mrmovq
    subq %r8, %r11           F D E
        need %r8 (addq)
        need %r11 (mrmovq)
    popq %r8                   F D E
        need %rsp
	
		stalling
			OF AN INSTRUCTION or PIPELINE: keeping an instruction where it is/
						part of the pipeline where it is
			OF A REGISTER BANK: keeping the register bank outputting the same thing
		bubbling
			OF A REGISTER BANK: making the register bank output a "bubble" = NOP


- post-quiz: counting cache misses
	values being accessed "elements"

	Q: how many elements per block? elements in the same block?
		maybe we access different parts of the block, have only miss
		size of value / cache size = # per block
		assuming element 0 is at the beginning of block, then 0-(# per block - 1) in one block,
								 (# per block)-(2*# per block-1) in the next, ...
	Q: which elements map to the same set?
	        find # bytes per way
                    + K*block size to an address --> advance by K sets in our index bits
                    if we go off the "bottom" --> go to the top again
                    + # bytes per way --> end up at the same set
                    + X * # bytes per way --> end up at the same set
                if elements are X * # bytes per way apart --> map to the same set		

        we can figure out cache misses by thinking of exactly one set at a time

    16 bytes/block --> 4 elements/block
        a[0], a[1], a[2], a[3] are in one block,
        a[4], a[5], a[6], a[7] are in the next block, etc.
    direct-mapped (= 1-way) 2048-byte --> 2048-bytes per way
        2048 bytes / (4 bytes per int) = 512 ints per way
        values 512 ints apart ---> map to same set

        {a[0],a[1],a[2],a[3]} and {a[512],a[513],a[514],a[515]}  map to the same set
        a[4] and a[512+4] map to the same set...

    int a[1024];
    ...
    int sum = 0;
    for(int j = 0; j < 32; j++)
        for(int i = 0; i < 32; i++)
            sum += a[i * 32 + j];

    set containing a[0 to 3] and a[512 to 515]
            // a[0]: i = 0, j = 0
                a miss due to a[0 to 3] never accesed before
            // a[512]: i = 16, j = 0
                a miss due to a[512 to 515] never accesed before
            // a[1]: i = 0, j = 1  <-- could benefit from a[0] being accessed before? 
                but a miss due to conflict with a[512] accessed in between
            // a[513]: i = 16, j = 1 <-- could benefit from a[512] being accessed
                but a miss due to conflict with a[1] accessed in between
            // ditto for a[2], then a[514], then a[3] then a[515]
            ...
    answer 1024 misses (everything misses)

    suppose two-ways instead --- is this better?
        1024-bytes per way
            now a[0] and a[256] and a[512] and a[768] map to the same set
                // a[0]: i = 0, j = 0
                // a[256]: i = 8, j = 0
                // a[512]: i = 16, j = 0
                // a[768]: i = 24, j = 0
                // a[1]: i = 0, j = 1
                    probably a MISS because we accessed a[256], a[512], a[768] since accessing a[0]
                        (and only can store 2 things per way, we have three extra things to put in the cache)
                // a[257]: i = 8, j = 1

    fully-associative --- 
        usually better, most flexibility
        only gaureneteed if replacement policy chooses things that make sense with access pattern
        example bad case: scanning giant array so usually accessing thing accessed furhterest back in time and
            LRU replacement policy ---> same # or more than osme other designs?

    int a[1024];
    ...
    int sum = 0;
    for(int i = 0; j < 1024; i++)
        sum += a[i];

        a[0 to 3] is one block, 
            accessing a[1] IMMEDIATELY after a[0]
                MUST BE HIT
            accessing a[2] IMMEDIATELY after a[1]
                MUST BE HIT
        1 miss, 3 hits per block

    1024 / 4 = 256 blocks --> 1 miss per block --> 256 miss


- write-through/write-back
    write-through: always write TO MEMORY immediately
        pro: don't need to remember whether you need to write
        con: maybe you could write less often 
    write-back: if value is cached, updae the cached value ONLY
        LATER, when value is evicted updat ememory
            pro: if we write a value multiple times, send value to memory once
            con: need to remember to write the value later on
                    (even though write may have happened a while ago)
                 reading values may trigger writes
        dirty bit = "is the data different than memory?"
            (do I have work to do on eviction to "fix" memory?)

- post-quiz: benefits/disadvantages of optimizations
    function inlining:
        take the function body, insert it where the function would be called
            --> no function call (no setting arguments, call instruction, return instruction, retrieving return value)
                NB: also includes saving values on the stack, etc.
        now we have multiple copies of the function body
            (one for the function called "normally" + one for each place it's inlined)

    when does the compiler inline?
        wants to not your program gigantic
            only if the function is quite small --> so only slightly larger
    
    code size: amount of space the code takes up
        (on disk and/or in the instruction cache and/or in memory)
    number of instructions RUN: the amount of times we fetch an instruction 
        # of instructions run in a loop is based on # of loop iters, not hte sie of the loop

    spatial locality and function inlining:
        w/o inlining: jump to a coplete different part of memory where the function is
        w/ inling: it's just the next instruction mmeomory

- cache blocking = practice of processing data in blocks like our matrix example
    to make better use of cache for matrix-like problems

- loop unrolling versus conditional branching
            conditional branching = something like a jXX instructoin (that is not JMP)
    - for (int i = 0 ; i + NUMBER_OF_TIMES_UNROLLED < N; i += NUMBER_OF_TIMES_UNROLLED) {
            MAIN LOOP
      }
      for( i = 0; i < N; ++i)  <-- LEFTOVER

        small N --- end up spending extra time handling the "left over" that's extra overhead
        large N --- yes, still have the "left over", but saved a lot of conditional branches b/c MAIN LOOP
            has 1 branch per NUMBER_OF_TIMES_UNROLLED

- when to stall/bubble
    - general rule:
        [data dependencies] look for dependencies
            data computed in stage X needed in stage Y, try to line up the stages
                so stage Y is after X
            e.g. with forwarding:
            popq %r8                            F D E M W
                    computed in stage MEMORY
            addq %r8, %r9                           F D E M W
                    needed in stage EXECUTE

            if no forwarding: values are effectively "computeD" in writeback and "needed" in decode


                no fowarding:
                    popq %r8        F D E M W*
                    addq %r8 r9       x x x F*D E M W
                                      F D D D D E M W

        [control dependencies] look for dependencies
            address of next inustrction to run computed in stage X but needed in FETCH
                line up so FETCH of next instr is after stage X
            
            RET                                                     F D E M W
                    addr. of next instruction computed in memory
            next instr.                                               x x x F
                                                                      ^^^^^--- three cycles of stalling

            JLE                     F D E M W
                cmptd in execute
            next instr.               x x F D E M W
                                      ^^^--------------- two wasted cycles BUT can be avoided with prediction
                                                (but only sometimes) 

- S2018 Q19 ~~ cache miss counting