- F2021 final, Q3b ~~ assembly and memory changing
        %rax 0x1000
        %rcx 0x12345678
         v--------------------------- "quad" -- 8 bytes
    rmmovq %rcx, 0x10(%rax)
                 ^^^^^^^^^^--- destination
                       displacement(base) syntax means "access memory at displacement + base"
                0x10+%rax = 0x10 + 0x1000 = 0x1010 is the mmeory address we write to
        value we write is 0x12345678
                            which is an 8-byte value because it's in %rcx
                            so %rcx must store some leading 0s
        the amount we write 8 bytes
        so we'll modify addresses
            0x1010 through 0x1017 inclusive
            little-endian --> least significant byte goes the lowest address
            0x1010: least significant byte of 0x12345678 --> 0x78 (1's place)
            0x1011: 0x56 <---
            0x1012: 0x34
            0x1013: 0x12
            0x1014-7: 0x00 <---

    What are the values at 0x100f, 0x1011, 0x1017 after this?
        0x100f: unchanged, 0x1011: 0x56, 0x1017: 0x00

- F2021 final, Q11b ~~ pipelining
    pipelined porcessor with 7 stages:
        F, D, E1, E2, M1, M2, W

    running
                                    0  1  2  3  4  5  6  7  8  9  10
        subq %r9, %r11              F  D  E1 E2 M1 M2 W
        andq %r12, %r13                F  D  E1 E2 M1 M2 W
        rmmovq %r12, 0(%r11)              F  D  E1 E2 M1 M2 W
        rmmovq %r13, 0(%r15)                 F  D  E1 E2 M1 M2 W
        ret                                     F  D  E1 E2 M1 M2 W

    how many cycles of stalling are needed?
        
        dependencies:
            [these might require us to wait, depending on how the pipeline works]
            %r11 written by subq and read by rmmovq (1st)
                the value of %r11 is available at end of cycle 3 = end of rmmovq (1st)'s decode 
                    so we'll forward it to decode and solve the problem
            %r13 written by andq and read by rmmovq (2nd)
                the value of %r13 is available at end of cycle 3 = end of rmmovq (2nd)'s decode 
                    so we'll forward it to decode and solve the problem

        0 from these dependencies

        BUT:
            in the assembly, %r9 (used by subq) is computed by an earlier addq:
                addq %r8, %r9
                jle label
                subq %r9, %r11
            
            we'll have fetched a bunch of mispredicted instructions during the jle,
            so the addq will be out of the pipeline by the time the subq reads %r9
                don't actually hvae a hazard there

        BUT:
            after ret we can't figure out what instruction to run next
                and isn't that part of running the ret

            on the exam: we accepted two interprations:
                1. the stall for ret was part of the next instructoin (0 cycles of stlaling)
                2. the stall for ret was part of the ret, and that stall took 5 cycles:
                    
                           1  2  3  4  5
                        F  D  E1 E2 M1 M2 W     ret
                                          F     instruction returned to


- loading vectors and the caches 
    vector-load that's contiguous (the one we've seen in the lab, etc.)
        are just a big cache read
            e.g. reading 256 bits instead of 64-bits
                but work the same way as a 256-bit read
        if the value we are reading is split across two cache blocks,
           then the processor needs to somehow turn thi sinto two cache accesses
           BUT this also happens with non-vector reads
                (e.g. reading an 8-byte integer that starts 4 bytes before the end of a cache block)

- F2021 final Q1b ~~ caching
    system with an L1 and L2 cache
        both had the same block size
    L1 cache: write-back, write-allocate
        16KB, 4-way, 64B block -->  6 index bits, 6 offset bits
    L2 cache: write-back, write-allocate
        512, 8-way, 64B block -->  10 index bits, 6 offset bits

            tag                 index   offset
         ------------------------------|------|
        |                    |         |      |  L1
        |--------------------------------------
        |               |              |      |  L2
        |---------------|--------------|------|

            if two addresses have same index in L1 --> may or may in L2
            if two addresses have same tag in L2 --> may or may in L1
            if two addresses have same tag in L1 --> same in L2

    when a value is written to the L1, which of these could happen?

    1. another value with the same index bits (but different tag bits) will be written to the L2
        YES --> writing a value to the L1 will allocate a block for that value
            alllocating a block for that value might require evicting a previously stored block
            that previously sstored block might be dirty (b/c write-back)
                and if so, we'd need to update the next level of cache
    2. another value with the same index bits (but different tag bits) will be written to main memory
        YES --> when we write to the L2 cache, the same thing in (1) could happen in the l2
            it will be in the same set in the L2
            it's possible that something which shares index bits in the L1
                also shares all the index bits in the L2
                (if the first tag bits in the L1 are the same)
            
    3. that value (the one written to the L1) is also written to the L2
        NO  --> write-back policy --> we're going to wait as long as possible
    4. that value is also written to main memory
        NO ---> write-back policy
    5. another value with the same tag bits but different index bits written to the L2
        NO --> we'll only evict values with the same index bits
            because each set of the cache works indepednently
    6. another value with the same tag bits but different index bits written to the main memory
        NO -->
            we know that the L2 cache might evict something,
            could that thing hvae the same tag bits but different index bits looking at L1
            if the L1's tag are all the same, then this means that the L2 tag bits are all the same
                (because the L2 has less tag bits than the L1)
            so this is still the same scenario as (5)

- what about the AVX stuff we expect you to know
    ~ we don't expect you to know any instruction or instrinsic names
    ~ we may ask about whether code is easy/hard to vectorize
    ~ we might give you the names/descriptions of instructions/instrinsics and expect you
        to use them or undertand how they're used

- exceptions and pipelines
    ~ we didn't really talk about how exceptions interact with pipelines/OOO
    ~ when an exception occurs, the processor should choose an instruction to stop at
        and cancel/undo the operatoins of everything at/after that instruction
            - "precise exceptions"
    ~ THEN the processor will run the OS
        
    ----
    ~ so, if the OS jumps back to the instructoin that we stopped at,
        that should run with the same register values, etc. that it originally had
        (unles the OS changed something)

~~~~~~~~~~~~~~~~~~~~round 2~~~~~~~~~~~~~~~~~~~


- Q14 Q1 (vectorization)
    which of these would benefit from a vector load instruction that
        rathre than taking one address and loading a bunch of things from there
        takes a vector of pointers and loads one thing from each of those pointers?
        [the "load vector of pointerS" instruction is going to be slower than the
            load a bunch from one pointer instruction]

    A: for (int i = 0; i < 8192; i += 1) { A[i] += B[i-1] + B[i] + B[i+1]; }
        to vectorize this, I'm going to try doing multiple i at a time:
       for (int i = 0; i < 8192; i += 4) {
            A[i+0] += B[i-1] + B[i+0] + B[i+1];
            A[i+1] += B[i+0] + B[i+1] + B[i+2];
            A[i+2] += B[i+1] + B[i+2] + B[i+3];
            A[i+3] += B[i+2] + B[i+3] + B[i+4];
        }          

        and I can do each of the adds vertically with one vector-add instruction
        and the values I want to load in vectors to do this are:
            A[i+0] through A[i+3]
            B[i-1] through B[i+2]
            B[i+0] through B[i+3]
            B[i+1] through B[i+4]
        which are all contiguous ---> I can use the "load a bvunch of things from one address"
            (though maybe I wnat to figure out a way of loading from B fewer times
                becaues the loads overlap ---> but I could do that with "shuffling" instructions
                that move values around in vectors in registers)
   
    B:
    for (int i = 0; i < 8192; i += 1) { 
        for (int j = 0; j < 8192; j += 1) {
            A[i*8192+j] = B[i*8192 + j] * C[j];
        }
    }
    to vectorize this, let's try doing multiple j
        (this is my first idea because it has better locality than multiple i
            for A, which is what we're writing to)
    for (int i = 0; i < 8192; i += 1) { 
        for (int j = 0; j < 8192; j += 4) {
            A[i*8192+j] = B[i*8192 + j] * C[j];
            A[i*8192+j+1] = B[i*8192 + j+1] * C[j+1];
            A[i*8192+j+2] = B[i*8192 + j+2] * C[j+2];
            A[i*8192+j+3] = B[i*8192 + j+3] * C[j+3];
        }
    }
    ---> looks like all the multiplies can be done with vector instructions well
    we want to load:

        A[i*8192+j] through A[i*8192+j+3] --> load a bunch GOOD
        B[i*8192+j] through B[i*8192+j+3] --> load a bunch GOOD
        C[j] through C[j+3] --> load a bunch GOOD


    C: 
    for (int i = 0; i < 8192; i += 1) {
        B[i] = A[T[i]];
    }

    for (int i = 0; i < 8192; i += 4) {
        B[i] = A[T[i]];
        B[i+1] = A[T[i+1]];
        B[i+2] = A[T[i+2]];
        B[i+3] = A[T[i+3]];
    }

    want to load:
        T[i] through T[i+3] --> load a bunch GOOD
        A[T[i]], A[T[i+1]], A[T[i+2]], A[T[i+3]]
            these could be A[X], A[X+54321], ... depending on T
            --> load a bunch is NOT GOOD
            --> probably we want to use the fancier load instruction
    

    D: 
        for (int i = 0; i < 8192; i *= 3) { A[i] *= B[i]; }
        question was meant to be:
        for (int i = 1; i < 8192; i *= 3) { A[i] *= B[i]; }
    
        for (int i = 1; i < 8192; i *= 3 * 3 * 3 * 3) { 
            A[i] *= B[i];
            A[i*3] *= B[i*3];
            A[i*3*3] *= B[i*3*3];
            A[i*3*3*3] *= B[i*3*3*3];
        }
        multiply we could do nicely with vector instructions, but loading/stoing
        is not easy:

        want to load A[i], A[i*3], A[i*3*3], ... --> load from one address instruction
            won't work

~~~~~~round 3~~~~

- bitwise operators: what do we need to know
    ~ bitwise AND (&), OR (|), XOR (^), complement (~)
    ~ left and right shift
    ~ arithemetic (copy sign bit) versus logic (add 0s) right shift
- bit masking
    ~ when we want to do something with bitwise operators, usually
        our primary idea is "select a few bits" THEN "do something do them"
        to select a few bits, we usually make a _mask_
            which has 1s in the selected bits and 0s otherwise
            (or sometimes 0s in selected and 1s otherwise)
    ~ we can view bitwise AND, OR, XOR as "applying" a mask
        ~ bitwise AND with a MASK with 1s for the selected bits
            "keep these bits" [and clear the non-selected bits]
        ~ bitwise XOR with a MASK with 1s for the selected bits
            "flip these bits" [and don't change the others]
        ~ bitwise OR with a MASK with 1s for the selected bits
            "set these bits (regardless of what they were before)" [and don't change the others]S
    ---
    ~ in addition to selecting bits, we often want to move bits to particular places,
        often we'll 
            1. select to keep those bits with a mask and bitwise AND; then
            2. use bit-shifting to move those bitsto a particular part of result