- S2018 E1 Q8 ~~ LEA

        int array[4];
        int *y = array; <-- &array[0]
        y += 2;     <-- advance 2 elements in array -- &array[2]

        long **x;    <-- "pointer to pointer to long"
        x += **x;
            add 8 times **x to the address in x
                ^----- sizeof(long*)
                    pretending we're advancing in an array of long *s that this points to
    where x is stored %r8
    
    mov (%r8), %rax         # %rax = *x
    mov (%rax), %rbx        # %rbx = **x <-- how many array elements (that are ptrs) to advance %r8 by
    lea (%r8,%rbx,8), %r8   # %r8 = %r8 + %rbx * 8

    this is basically what option A 

    some wrong options:
        option B:
            leaq (,%r8,8), %rax # %rax = x * 8 <-- doesn't make sense
                    # multiplying the wrong thing by 8, then carried through....

        option C:
            mov (%r8), %rax         
            mov (%rax), %rax        # **x extracted --- good!
            lea (,%rax,8), %r8      # x = **x * 8 <--- no +=

        option D:
            mov (%r8), %rax             # *x
            lea (%r8, %rax, 8), %rax    # temp <- x + *x*8 <-- wrong thing mult by 8?


- F2022 Week 2 Q4/5 ~~ relocation tables

    upperstr.c:
        void upperstr(char *str) {...}
    printbig.c:
        void printbig(char *str) {
            upperstr(str);
            puts(str);
        }

    in the object files, we need:
        SYMBOL TABLE ENTRIES
            ~ way to find upperstr in upperstr.o
            ~ way to find printbig in printbig.o

        RELOCATION TABLE ENTRIES
            ~ way to find where to put the address of upperstr in printbig's machine code
            ~ way to find where to put the address of puts in printbig's machine code

- F2022 Week 3 Q1 ~~ bitwise

        
        (((x & 0xFF0F) >> 2) & 0xFF0) == (((x & 0xFF0) >> 2) & 0xFF0F)

        x = abcd efgh ijkl mnop 
        left-hand side:
            & 0xFF0F: abcd efgh 0000 mnop
            >> 2:     00ab cdef gh00 00mn
            & 0xFF0   0000 cdef gh00 0000

        right-hand side:
            & 0xFF0:  0000 efgh ijkl 0000
            >> 2:     0000 00ef ghij kl00
            & 0xFF0F: 0000 00ef 0000 kl00

            VERSUS:   0000 cdef gh00 0000
                           ^^   ^^   ^^

        to test if cd, gh, kl are zero we can use
                abcd efgh ijkl mnop 
          ( x & 0011 0011 0011 0000 ) == 0
                ^^^^^^^^^^^^^^^^^^^---- Y

    question claims that tihs is the same as (x & Y) == 0 for some Y

        
- Y86 machine code fields
    general format:
    
        first byte:
            upper bits: primary opcode ~~ "row of the table"
            lower bits: if needed, secondary opcode
                for similar instructoins like: variants of jmp/jle/etc. rrmov/cmovle/etc., arithmetic

        second byte (if relevant to instruction):
            register indices
                up to two register indices that are from the assembly from the instruction
                table mapping names to indices

        later bytes (if relevant to instruction):
                [starts at third byte if register numbers, and the second byte otherwise]
            8-byte constant:
                target address of jump or call (where to go to)
                displacement for a rmmov or mrmov
                immediate value for an irmov
            constant stoed in little endian, meaning the the byte w/ the lowest address is the least sig part
                        
                                     sec opcode
                                     v /---------------------\----- 8-byte constant, written w/ lowest address on left
        for example: jmp 0x123 ---> 70 23 01 00 00 00 00 00 00                 as sequence of bytes
                                    ^
                                    |primary opcode
    

~ pipelining ~~ hazard handling
    ~ hazards ~ we make our pipeline and find out it doesn't run instructions correctly
        because it doesn't handle dependencies automatically
            when we were running parts of instructions in different orders
                (for example: 
                    ~ might write a register after it would be read in the pipeline -- "data hazard"
                    ~ might compute the address of the next instruction after we need to fetch it -- "control hazard"
                )
    ~ strategies we learned
         ~ stalling --- get the operatoins back in order by inserting nop (no-operations) between them
                delay the thing that pipeling made happen too early relative to other instruction's work
                e.g. wait to read register until afte it's written
                e.g. wait to run instruction after conditoinal jump until after the jump is computed
         ~ forwarding --- for data hazards only ~~ get the value after it's computed but before it's stored
                solves a amjority of data hazards BUT
                requires that the value be computed before we need it in the later instruction
                    (which won't always be true)
                ~ can combine with stalling:
                    instead of stalling to wait for a register to be written
                               stall to wait for a value to be computed so we can forward it
                ~ most commonly replaced "wrong" value we just read from register file
                    (but we could do forwarding at other times, too)
~ what forwarding is possible
    ~ we need to have the value computed already
    ~ we don't want adding the forwarding to extend our critical path much
        TYPICALLY we forward from the "end" of a cycle to the "end" of a cycle
            OR      from the "beginning" of a cycle to the "beginning" of a cycle
            b/c this won't add much to what needs to happen in a cycle
        
        Let's say we try to forward from "end" of Memory stage to "beginning" of Execute
            tihs would not work well, because now our cycle needs to do:
                beginning of memory, middle of memory, beginning of execute, middle of execute, end of execute
            --> much longer cycle time needed to accomodate this
            --> kinda defeates having separate pipeline stages for execute + memory

- F2022 8a ~~ data cache misses
    data cache misses for A+B with 3-way cache w/ 16 sets, 8-byte blocks (blocks fit two elements of A, B)
    A+B are arrays of 4-byte ints

    for (int ii = 0; ii < 1024; ii += 16) { 
        for (int j = 0; j < 1024; j += 1) {  <--- 1 miss for B[j] for every other iteration of this loop
            for (int i = ii; i < ii + 16; i += 1) {
                result += A[i] * B[j]; <-- 1 miss for every other A[i] when j = 0
            }
        }   
    }
   
    vvvvvvvvvvvvvvvv---- can store all of these in one the ways of 8 sets of the cache 
    A[0]A[1]...A[15] } 
    A[0]A[1]...A[15] } 1024 times

    ~~ for each j loop: 16 * .5 of a miss for A[i] --> 8.5 misses
    ~~ for iteration of j loop .5 of miss for B[j]
    1024 / 16 iterations where j = 0
        1024 / 16 * 8 misses for A[i]s
    1024 / 16 * 1024 iterations of the j loop in total
        1024 / 16 * 1024 * .5 misses for B[j]s

    
- dirty bits
    ~ for writes to caches
        if we are caching a value, we have two options:
            write-through: always update the next level
            write-back: defer the update until later
                ~ hope: maybe we'll change it again and we'll avoid sending extra updates to the next level
                    maybe we'll never actually need to access it other than through tihs cache

        with write-back, we need to actually do the update eventually so it doesn't get lost
            one strategy we could try is whenever we evict something update the next level
                but this would be really slow b/c we evict lots of things that haven't changed 
                    would do lots of extra writes
            INSTEAD: we store a _dirty bit_ that tells us if something needs to be updated in the next level
                ~ we set this when we modify a block in our cache w/o updating hte next level

- aliasing
    ~ "two names" for the same value
        names
            most commonly: A[x] and B[y] 

        means that the compiler can't do some optimizations that might be "obvious"

            A[x] += B[0]
            A[x] += B[1]
            A[x] += B[2]
            A[x] += B[3]

        you might like the compler to put A[x] in a register
            BUT the compiuler needs to handle if A[x] is another name for B[2]
            and we should get the same result as:

            B[2] += B[0]
            B[2] += B[1]
            B[2] += B[2]
            B[2] += B[3]

        also other optimizations (e.g. vectorization, cache blocking, etc.) where
        this would be a concern

- system calls
    ~ processor reserves some operatoins for the OS
        ~ generally: thigs with shared resources
            I/O devices
        ~ OS managed data for things share dbetween users
            (e.g. I shouldn't be able to access your program's data on portal and vice-versa)

    ~ but many legitimate things programs want to do with shared resources
        ~ print to the screen
        ~ stop another program
        ~ ...

    ~ interface to run the OS via its exception handlers called "system calls"
        that programs trigger when they want the OS to help them
        ~ the handler for exceptions in general runs in kernel mode
            so it has access to resreved operatoins/resources

~ related to but different from exceptions NOT triggered by program deliberately
    ~ example: OS might switch from your program even though your program didn't reques tit
        not a system call, since your program didn't attempt to ask for this behavior
    ~ e.g. triggere dby timer or I/O coming in that the OS got to handle despite whatever your program was doing