- stalling/forwarding -- when
    only an issue when two instructions depend on each other

        time -------------->
                                                      v------------------- write value of %rbx
    addq %rax, %rbx         F    D     E     M    W
                                           ^------------------------- value of %rbx computed by here
    ????                         F     D     E    M    W
    subq %rcx, %rbx
        if we do nothing:              F     D    E    M    W
                                              ^----------------- read value of %rbx
        if we do forwarding:           F     D    E    M    W
                                              ^------------------ instead of using value from reigster file,
                                                                  going to take hte value from addq's wires directly
        if we do stalling              F     D    D    D
                                              ^----^--------  detect that %rbx isn't ready
                                                              and send subq back into the fetch-to-decode pipeline
                                                                registers

    how to resolve this:
        most efficient way (in terms of cycles): forwarding
            we know that the new value of %rbx was already computed
        take that computed value from [where it is in our processor]
                [at the time we read it, that's in the memory stage of addq]
         and have it replace the value we would read from the register file

        less efficient way (in terms of cycles): stalling (alone)
        delay the subq instruction until the value would be ready from the register file


    let's say we had a pipeline with three execute stages and assume addq result is not available until
        last execute stage

    addq %rax, %rbx         F    D     E1    E2   E3   M    W
                                                      ^------------value of %rbx is computed here
    ????                         F     D     E1   E2   E3   M    W
    subq %rcx, %rbx
        if we do nothing:              F     D    E1   E2   E3   M    W
                                              ^---------------- value of %rbx is read here
                                                  ^----- value is needed here
                                                   
                                                    value is computed in same cycle where we need it, so
                                                    yes, we could technically forward it in the same cycle,
                                                    but this would make the cycle time WAY longer --- not
                                                    something we're ever going to do
                                                    
        stall so value needed stage
            comes after value computed:              v----------------------- replace value we would read from register
                                                                              file with the value that was just computed
                                                                             from the execute3-wires
                                       F     D    D    E1   E2   E3   M    W
                                                 ^----------- detect value needed is computed yet
                                                            and subq back into fetch-to-decode pipeline regs

k
    how to resolve this:
        most efficient way (in terms of cycles): stalling + forwarding
            stall to make sure we wait until after the value is ready before we use
            take it from where the computed value is somewhere where want to use it
                (replacing some old versoin we would read from register file, etc.)

- stalling an instruction: keeping the instruction in the same stage
    to implement this with our pipeline registers:
        have the register before the stage output the isntructoin again next cycle
            in HCLRS: stall_X signal
        have the register after the stage output a nop next cycle (otherwise we'll duplicate the instructoin)
            in HCLRS: bubble_Y signal

- bubble_X: inserts a nop ("pipeline bubble") in the pipeline
    if we stall, because the stage after the stall needs something to do
    if we mispredict, because we need to replace our wrong guesses with something

- write-through/write-back
- write-allocate/write-no-allocate

    - when we have a cache access that is a write:
    (1) check if the cache has the value
        if it does not: there are two options:
            (a) write-allocate: bring the value into cache, then proceed as if it already there
            (b) write-no-allocate: just send the value to the next level and forget about it
    (2) now (if we keep going), the cache has the value (perhaps because we loaded) it
    (3) update the cache's version of the value
    (4) then two options:
        (a) write-back: mark the cache's version of the value as dirty (defer the actual write until later)
                later on if we ever remove something marked as dirty from the cache "clean" the value by
                writing it to the next level
        (b) write-through: send the value to the next level (and don't need to track if it's out-of-sync)


- mapping cache sets to memory
    - X = # bytes per of way of the cache
        = # sets * block size = 2^(index bits) * 2^(offset bits)
        = 2^(index bits + offset bits)
    - values which are X bytes apart go the same set
                
                [    tag   ][    index   ][   offset  ]
                          1  000000000000  00000000000 = X 
                            so adding X to a value changes tag, but not index or offset
    so if array[0] through array[3] are a cache block and map to set 0
    if my cache is 6-way 12KB cache --> # bytes per of way cache is 12KB/6 = 2KB
        whateve is 2KB after array[0] maps to set 0, too

    can also find out:array[4] through array[7] map to set 1, etc.
    generally: want to group values together by what set that map to
        and analyze each set separately
                    
- cache misses with the struct
    16B blocks, 2KB direct-mapped cache  = 2KB 1-way cache  2KB bytes per way

    typedef struct {        [a_value][b_value][boring_values[0]][boring_valiues[1]]...
	int a_value, b_value;
	int boring_values[126];
    } item;
    item items[8]; // 4 KB array
    int a_sum = 0, b_sum = 0;
    for (int i = 0; i < 8; ++i)
	a_sum += items[i].a_value;
    for (int i = 0; i < 8; ++i)
	b_sum += items[i].b_value;
    
    sizeof(item) =  (2 + 126) * sizeof(int) = 512B

    assuming items starts at the beginning of a cache block
        a_value, b_value, boring_values[0] and boring_values[1] are all in the same block
        and because sizeof(item) is a multiple of the size of a cache block, this is
        true for every item in the array

    values which are 2KB apart map to the same set
    items[0].a_value, items[0].b_value and items[4].a_value, items[4].b_value map to the same set
    items[1].a_value, items[1].b_value and items[5].a_value, items[5].b_value map to the same set
    ...

    to analyze this, we can look at one set, and know that every other set will have same number of misses
        (because the pattern is exactly the same for each set)

    accesses to set 0:
        items[0].a_value        items[0].a_value,b_value,boring_values[0],boring_values[1]  miss
        items[4].a_value        items[4].a_value,b_value,boring_values[0],boring_values[1]  miss
        items[0].b_value        items[0].a_value,b_value,boring_values[0],boring_values[1]  miss
        items[4].b_value        items[4].a_value,b_value,boring_values[0],boring_values[1]  miss


    what if a LRU 4-way 2KB cache --> bytes per way is 512
    items[0].a_value and items[1].a_valie and items[2].a_value and etc. are all in th esame set

    accesses to set 0:
        items[0].a_value        miss
            items[0].a_value+ 
        items[1].a_value        miss
            items[0].a_value+ items[1].a_value+
        items[2].a_value        miss
            items[0].a_value+ items[1].a_value+ items[2].a_value+
        items[3].a_value        miss
            items[0].a_value+ items[1].a_value+ items[2].a_value+ items[3].a_value+
        items[4].a_value        miss
            items[4].a_value+ items[1].a_value+ items[2].a_value+ items[3].a_value+
        items[5].a_value
        items[6].a_value
        items[7].a_value
        items[0].b_value        miss (b/c items[4] replaced items[0] due to LRU policy)

- S2018 Q19: how many misses
    unsigned char A[1024 * 8];
    ...
    for (int i = 0; i < 2; ++i) {
        for (int j = 0; j < 8; ++j) {
            A[j * 1024] += j + A[j * 256 + 4];
        }
    }
    4-way 4KB set associative cache with 4B blocks
        values 1KB apart map to the same set
        1KB = 1024 array elements (bceause sizeof(unsigned char) = 1)
    accesses are 2x:
        A[0]        <-- set 0
        A[4]        <-- set 1
        A[1024]     <-- set 0
        A[256+4]    <-- set 1+(256/4) = 
        A[1024*2]   <-- set 0
        A[2*256=512+4]    <-- set 1+(512/4)
        ...
        A[3*256=768+4]    <-- set 1+(768/4)
        ...
        A[4*256+4]  <--- set1
        ...
        A[5*256+4]  <-- set 1 +(256/4)
        ...
        A[6*256+4] = A[4*256 + 2*256+4] = set 1 + (2*256/4) = 1+512/4
        ...
        A[7*256+4] = A[4*256 + 3*256] = A[1792+4] <-- same set as A[1792+4 - 1024 = 768+4]
                                        set 768/4+1

    set 0: we have 8 values we care
        with an LRU policy, can only keep the 4 most recent
        but the 4 most recent are never what we're accessing next
            all misses
                    16 misses
    other sets: we have 2
        e.g. A[0*256+4] and A[4*256+4] are 1024 bytes apart
        know that we don't have to worry about evictoins with 4 ways
        compulsory misses for these, but no other misses
            8 misses for these values

- loop unrolling
    loop:
        addq (%r8, %rdi, 8), %rax   <--- "BODY" (what we actually want to compute)
            RAX <- RAX + memory[R8 + RDI * 8]
        incq %rdi                   <--- "loop overhead"
            RDI <- RDI + 1
        cmpq %rdi, %rsi             <--- "loop overhead"
        jne loop                    <--- "loop overhead"

    unrolled to do twice as much per iteration, but with the same loop overhead
        per iteration  --> overall half the overhead?
    loop:
        addq (%r8, %rdi, 8), %rax   
            RAX <- RAX + memory[R8 + RDI * 8]
        addq 8(%r8, %rdi, 8), %rax 
            RAX <- RAX + memory[8 + R8 + RDI * 8]
        addq $2, %rdi
        cmpq %rdi, %rsi
        jne loop 

    but: need to do a little extra work if the number loop iterations is not a multiple
        of th enumber of times we duplicated the body
    
- throughput/latency bound
    throughput bound: when we have a calculation, how fast can it be
        based on what the hardware is capable of doing, ignoring how
        dependencies and how we have to move data around for them

    e.g. if I need to do 10000 loads and 5000 multiplies to process 5000 values,
        how fast can my hardware do that
        if my hardware can start 1 load per cycle and 2 multiplies per cycle
                and each load (optimally) takes 3 cycles and each multiply
                takes 5 cycles and they are pipelined [meaning that we don't need to
                worry if we started ones before]
            for throughput bound: don't care how long they take, just how many can we do
        
                can do multiplies for 2 values/cycle
                    start 2 every cycle -->  finish 2 every cycle (because pipelined)
                can do loads for 0.5 value/cycle
                    these are throughput bounds --- ignoring dependencies
                    start  1 every cycle and needed 2 per value --> 
                        start all the loads per value every other cycle 
                            (first load in cycle 1, second load in cycle 2)
                        -->
                        finish all the loads for a value every other cycle

    latency bound: given what I have to do one-after-the-other, how
        fast can it be given how my hardware does that calculation

        e.g. if for each value, I need to use a loaded value to determine what
            value to load next
            one load, THEN another, THEN another

            can start one of these loads after the previous finishes
                                                    ^^^^^^^^^^^^^^^-- 
            if loads take 5 cycles, then that means we start one of these
                loads every 5 cycles  
            performance bound of 0.2 values/cycle
            "latency bound" --> based on the worst dependencies
            more generally: what is the longest chain of dependent operations
                    "critical path"
                add up the time needed to complete opreations going through that chain
                not helped by being able to do two things in parallel along the chain

                bsaed on these bounds, best performance I could get would 0.5 values/cycle

- register renaming (example?)
    we want a different name for each version of a register in assembly
    do this by chaning the "physical" name everytime we change a regiser

    e.g.

        addq %r8, %r9
        addq %r10, %r9
        addq %r11, %r9

    going to have an initial name for the first version of %r8, %r9, %r10, %r11
        let's say %x8, %x9, %x10, %x11
    and going to need new names ot use: let's say we can use %x12, %x13, etc.

                          vvvv--------- new name for the result
        addq %x8, %x9 --> %x12
                 ^^^^
                 read from the old version

        addq %x10, %x12 --> %x13
        addq %x11, %x13 --> %x14

- instructoin queue and dispatch
    instructoin queue: contains renamed instructions
    alongside the instruction queue we track which physical registers have values available
        (intiially, they don't (when renaming allocates the register),
         but after the corresponding instructoin executes, they do)
    every cycle, we look for instructoins in the instruction queue where all their inputs
        are ready
    and we take as many of those ready instructions as we can execute
        and actually start executing them

----