- latency and throughput ~ calculating for SEQ v PIPEA
    - latency: time from start to finish for some operation
    - throughput: rate at which operations complete
        usually operations/time
        can also write it as time/operation completed (still throughput
            time ---->  (S = start operatoin, E = end operation)
            S 1 2 3 E
              S 1 2 3 E
                S 1 2 3 E
                  ...
        --> throughput of 1 operation/time unit
                        = 1 time unit/operation completed
        --> latency of 5 time units/operatoin

    - single-cycle processor: we do one instruction at a time, so
        time per instruction = time between when instructions complete

    - time per circuit = critical path: longest sequence of operations that needs
        to complete in a clock cycle
            - single cycle: operations from getting the new PC (beginning of cycle)
                           all the way to writing all the instructions' values
            - pipelined processor: operations for pipeline stage
                typically "between two pipeline registers" (except: ends)
        - set cycle time = critical path length

    - subtly with critical paths are that not everything is on the critical path:
        - single-cycle: operations which can be done in parallel
            for example: adding X to the PC while reading registers
        - pipelined: only the longest stage matters
            (and only longest path through a stage)

    - calculating time: sum of things on critical path
        - but remember that pipeline registers take time (if there are pipeline registers)
            "register delay" or "pipeline register delay"

- RET and Load/Use Hazards
        - in the five-stage processor we used
    RET   F  D  E  M  W
             ^^^^^^^\
             can't   \
             fetch    F
             yet
    result is 3 cycles of stalling

    Load/Use hazard
    
    MRMOVQ -> R9   F  D  E  M  W
       value not ready^^^^^^^\
                              \
    ADDQ R9, R10      x  F  D  E
                      ^
                      \--- 1 cycle of waiting, even with forwarding
        anytime we read a value from memory and want to use it in execute
        probably forwarding from end of memory to end of decode

    no hazard examples:

    ADDQ R8, R9   F  D  E  M  W
                         \
                          \
    SUBQ R9, R10     F  D  E  M  W
        end of execute to end of decode
    
    ADDQ R8, R9
    NOP
    SUBQ R9, R10
        end of  memory to end of decode
        ...

- S2018 Q5 ~~ hazards in a five-stage pipeline
    
    mrmovq -> %r11     F D E M W
    addq %r11, %r8       x F D E M 
        needs %r11
        from mrmovq
    subq %r8, %r11           F D E
        need %r8 (addq)
        need %r11 (mrmovq)
    popq %r8                   F D E
        need %rsp

- S2018 Q13 ~~ addresses mapping to same set
    4-way, 16KB cache, 16B blocks

    offset = location w/in block, 16 locations --> 4 bits (2^4 = 16)
    [set] index = number of the set, 256 sets --> 8 bits (2^8 = 256)
        16KB cache / 16B blocks --> 1K blocks
        4 blocks / set
        1K blocks/(4 blocks/set) = (1K/4) sets = 256 sets
    tag = everything else (in this case, 16 bit addresses - 8 - 4 = 4 bits)
    
    same set as 0x1234
    tag / index / offset
    0x1   0x23    0x4
          ^^^^
           \---- determines which set to use
    same set:
        0x1235
           ^^
        0xF23F
           ^^
    not same set:
        0x1A34
           ^^
        0x1004
           ^^

- direct-mapped v set-associative
    direct-mapped ==== 1-way set associative
    N-way set associative means N places to put each block
        means can store N different with the same index bits
    
    fully-assocative ==== (# blocks in cache)-way set associative

- write strategies (write-alloc/no-alloc, etc.)
    write-allocate: when the program writes, do I bring something into the cache because of it
        pro: hopefully the prgoram will read that value/near it
        con: writes now become reads (rest of block) + writes
    write-no-allocate: when the program writes, modify value if I have it, otherwise
        send it to memory w/o any extra work

    write-back: when the program writes, don't send it to memory until I need to
        typically but not always w: write-allocate
    write-through: when the program writes, always send it to memory immediately

- high-level strategy for cache miss counting:
    figure out how big blocks are in terms of what's being accessed
        what values map to the same block?
    figure out what values map to the same set
        rule: find # bytes in way, 
            values separated by K * # bytes in a way map to the same set

    can analyze each set independently
    
- post-quiz cache miss quetions on last post quiz
    int a[1024];
    ... 
    int sum = 0;
    for(int j = 0; j < 32; j++)
        for(int i = 0; i < 32; i++)
            sum += a[i * 32 + j];
                // i = 0, j = 0 --> a[0]
                // i = 16, j = 0 --> a[512]
                // i = 0, j = 1 --> a[1] 
                    // ALWAYS A MISS, because we accesses a[512] in between a[0] and a[1]
                // i = 16, j = 1 --> a[513]
                    // ALWAYS A MISS because we access a[1] in between a[512] and a[513]
                // similar for j = 2, j = 3

                
    16Byte blocks --> 4 ints/block
        a[0] and a[1] and a[2] and a[3] are in the same block, assuming blocks
        start at a[0] (which the quesiton says they do)

        a[512] and a[513] and a[514] and a[515] are in the same block
    DM 2048-byte cache
    a[0] and a[512] are 512 * 4 = 2048 bytes apart
        so a[0] and a[512] (and a[1024] and a[1024 + 512]) map to the same set


    suppose 2-way set associative
        a[0] and a[256] and a[512] and a[768] all map to the same set
            between accessing a[0] and a[1], we'd access
                a[256] and a[512], probably evicting a[0] (assuming LRU-like)

    other quiz question:

    int a[1024];
    ...
    int sum = 0;
    for(int i = 0; i < 1024; i++)
            sum += a[i];
	access a[0], a[1], a[2], ...
		nothing in between access to the same block to evict them

	same as if
        for(int i = 0; i < 32; i++)
	    for(int j = 0; j < 32; j++)
		sum += a[i * 32 + j];
	

- S2018 Q7/Q8 ~~ cache miss counting
- F2017 Q9

- cache blocking [slides]

- function inlining
        - copy the contents of a function in a place of a function call
        - compiler will do it if
            - the function is "small enough", and
            - the compiler doesn't need to look across .o files


- locality spatial/temporal
    spatial locality: nearby values
        ~ caches take advantage of this with cache blocks being larger than one value
    temporal locality: repeated access to same value
        ~ caches take advantage of this by adding things to the cache when they are accessed
            and trying to keep them if they've been recently accessed

        for (int i = 0; i < N; ++i)
            for (int j = 0; j < 100; ++j) 
                use(A[i])

        more temporarl locality than 

        for (int i = 0; i < N; ++i)
            for (int j = 0; j < 10; ++j) 
                use(A[i])