- exam date
    10 December: probably 12PM 10th --> 12PM 11th
- topics on exam
    - everything that we covered in lecture/quizzes/labs/HWs
- grading scale and CRs
    - what would be a C- in a "normal" semester will be a C/CR
    - minimally 70 --> C, 80 -> B-, etc.

- object file / executable contents
    ~ assembly files
        human-readable representation of instructions (not ready for the processor to use)
        labels to identify the locations of things in the same or other files
    ~ object files
        machine-readable representation of the instructions
            and of constant data
            (sometimes, multiple "segments" of machine code/data that could be an executable later)
        BUT labels can't all be resolved into the addresses the processor expects
        so to deal with labels:
            (1) [relocations] where the labels are used
                    example: "call printf" (uses label printf)
                    example: "jle top_of_loop" (uses label top_of_loop)
                we put SOMETHING in the machine code for the instruction
                and generate a "relocation" saying that the SOMETHING is wrong and needs to be replaced
                information in the relocation --- seperate table (not in machine code/data)
                    * location of the SOMETHING (the placeholder)
                    * name of the label, which we'll find by looking at the files used to make an executable
            (2) where the labels are defined:
                    "top_of_loop: ..."
                    "printf: ..."
                we need to note that the location within the machine code of that label
                        (processor doesn't have a "mark label" instruction, so this out-of-band)
                this is an entry in the *symbol table*, and the labels are called *symbols.
                information in the symbol table -- seperate table (not in machine code/data)
                    * name of the label
                    * its location in the file
    ~ executables
        combine object files and replace relocations (placeholders) with addresses the processor will understand
        when we're done, we should have machine code+data that can just be loaded into memory and jumped to
- relocation versus symbol table

- bitwise operation use
    "0x012 3456 789A" extract bits (12+10+10=) 32 to 42 bits (bit 1 is LSB)
        >> 32 = 0x12 --- removes the bottom 32 bits
        & 0x3FF -- keeps the bottom 10 bits
            0x3FF HEX = 0011 1111 1111 BIN (10 bits set)
        0x12 & 0x3FF = 0x12 --- bits 32 to 42 of the original value
    "0x012 3456 789A" extract bits (12+10=) 22 to 32 bits (bit 1 is LSB)
        >> 22 = 0x48D1 -- removes the bottom 22 bits
        & 0x3FF = 0xD1
        -- alternative: keep bits 22 to 32, then shift over --
        & 0xFFC00000    1111 1111 1100 0000 0000 0000 0000 0000 BIN
            = 0x3440 0000
        >> 22 = 0xD1
    * select particular bits: AND with them (examples above)
    * to do other things to particular bits:
        OR --- set them
        XOR --- to flip them
    * to move bits to the appropriate a part of number, bitshift with << and >>

- RISC v CISC
    Reduced Instruction Set Computer v Complex Instruction Set Computer
    RISC-like instruction set is easier for the hardware designer
        but typically makes more work for the assembly programmer/compiler
    common attributes of RISC:
        - instructions do less stuff
            - typically fewer ways of specifying operands to instructions
            - typically no instructions that both read+write from memory
            - typically no/fewer instructions that write multiple registers
        - instructions that do arithmetic don't also access memory
        - more general purpose registers
            - makes up for fewer instruction being able to access memory
            - it's an easy to way to use extra hardware to possibly make programs faster
    it's hard to find examples of things which are "definitely completely RISC" or
        "definitely completely CISC" --- usually somewhere in between

- amount of stalling to data dependencies without forwarding
                      v--- when we write registers
    F | D | E | M | W |
       ^^^
       when we read registers

    F | D | E | M | W *                     addq %rax, %rbx
        F | D | E | M | W |                 extra instruction 1
            F | D | E | M | W |             extra instruction 2
                F | D | E | M | W |         extra instruction 3
                    F | D | E | M | W |     first instruction that can read %rbx w/o forwarding
                       ^^^

    F | D | EM| W *                     addq %rax, %rbx
        F | D | EM| W                   extra instruction 1
            F | D | EM| W               extra instruction 2
                F | D | EM| W           first instruction that can read %rbx w/o forwarding
                   ^^^
- Quiz 8: setting stall/bubble

                cycle
                0   1
    W |
    M | W |
    E | M | W |
    D | E | M | W |
    F | D | E | M | M | W           <-- scenario in quiz question:
                                            need to "stay in memory stage until cycle 1"
                  | W               <-- inserted nop
        F | D | E | E | M | W
            F | D | D | E | M | W
                F | F | D | E | M | W
                 ^^^
                  \-- quiz Q1-4: what's happening here
        fetch to decode -->   decode in cycle 1 is outputting the instruction that was in decode in cycle 0
        decode to execute --> execute "   "         "           "           "       "     execute "    "
            means that cycle 1 repeats the same calculation with the same inputs as cycle 0 did
        execute to memory --> memory "   "         "           "           "       "      memory "    "
        memory to writeback --> "no instruction" in writeback in cycle 1? --- must be inserted nop

- write-through v write-back
    when we want to write a value using a cache:
    (1) we lookup the value in the cache
    (2) if the value's not in the cache:
            if the cache write-allocate, we add to the cache
                    [could involve eviction]
            if the cache is not write-allocate, we write it to the next level and stop
    (3) if the value is in the cache now (after lookup + any write-allocate):
            update the value in the cache
    (4a) if it is a write-through cache
            we also write to the next level
    (4b) if it is a write-back cache
            we update the dirty bit to remember to write it later

    when we want to evict a value from the cache:
    (1) if it is write-back cache AND the value's dirty bit is 1,
            we write to the value to the next level
    (2) we discard the value

- fully associative
    - # sets = 1 --> 0 index bits
    - degenerate case of a set-associative cache where everything's in the same set
        and we have to compare all the tags (one for each block stored)
    - cache with no conflict miss problems
- Quiz 9: when hit in direct-mapped cache
    ? Question 3-4 --- 2-set, 8B blocks, writeback, write-allocate
                              ^^ --- 0x0->0x7, 0x8->0x0F, 0x10->0x17, ....
    read from 0x104 miss
        cache contains: set 0 [0x100-0x107] set 1 []
    write to 0x101 hit
        cache contains: set 0 [0x100-0x107, dirty] set 1 []
    write to 0x102 hit
        cache contains: set 0 [0x100-0x107, dirty] set 1 []
    write to 0x108 miss
        cache contains: set 0 [0x100-0x107, dirty] set 1 [0x108-0x10F, dirty]
    read from 0x109 hit
        cache contains: set 0 [0x100-0x107, dirty] set 1 [0x108-0x10F, dirty]
    read from 0x10f hit
        cache contains: set 0 [0x100-0x107, dirty] set 1 [0x108-0x10F, dirty]
    read from 0x110 miss, requires replacing something dirty --> writes back a value
        cache contains: set 0 [0x110-0x117] set 1 [0x108-0x10F, dirty]
    read from 0x118 miss, requires replacing something dirty --> writes back a value
        cache contains: set 0 [0x110-0x117] set 1 [0x118-0x11F]
    write to 0x119 hit
        cache contains: set 0 [0x110-0x117] set 1 [0x118-0x11F, dirty]

- loop orders and cache miss counts
    for (i = 0; i < N; ++i) 
        for (j = 0; j < N; ++j)
            A[i], B[j], C[j*N], D[i*N]

    looking at difference between what's accessed in innermost loop as a quick estimate of misses:

    A[i] --> we expect good temporal locality --> hit every time
            (~0 misses per iteration of j loop)
    B[j] --> we expect good spatial locality --> hit every time in the same cache block
            (~1/K misses per iteration of j loop where K is the # of items in a cache block)
    C[j*N] --> we expect bad locality of all kinds --> miss every time
            (1 miss per iteration of j loop)
    D[i*N] --> ewe expect good temporal locality --> hit every time
            (~0 misses per iteration of j loop)

    if we swap i and j --> we get different locality
            sometimes: obvious improvement
            other times: might have to run it to determine if performance matters

- aliasing
    for (i = 0; i < 3; ++i)
        for (j = 0; j < 3; ++j)
            for (k = 0; k < 3; ++k)
                A[i*3+k] += B[k*3+i];

    the loop orders matters BECAUSE A, B could be parts of the same array
        if we set A and B to pointers to parts of the same array

    so the compiler can't do optimizations like changing the loop orders
        or keeping values in registers

    logic for figuring this out:
        > consider, e.g., A = &array[0], B = &array[K],
            discover that you're writing things then reading them back
            and if you change the order of this, that's going to the change answer
        the compiler needs to worry about this b/c its code needs to take that into account

- Quiz 11 Q1 --- OOO processor latency
    1-cycle adder; 3 cycle-multiplier, pipelined

    a = b + c;
    d = e * f;
    h = a + d;
    i = a * d;
                1   2  3   4   5  6
    adder:     b+c        a+D
    mult s1    e*f        a*d <------- couldnt' start this early b/c 'd' wasn't computed yet
         s2        e*f        a*d
         s3            e*f        a*d
                <------------------> = 6 cycles of time needed

- processor versus OS jobs
    processor is the hardware
        typically handles address translation
        calls the OS when programs do something weird using an exception
        knows what the current registers + page table + exception base pointer are
            but not what particular programs are running otherwise
    OS (operating system) is software
        typically the OS tracks what programs are running, what memory is allocated, etc.

- first page table entry address in multi-level table in Quiz 13
    3x10-bit VPN parts, 12-bit PO 
    VA 0x012 3456 789A PTBR 0x44 000
    VPN part 1 is bits 32 to 42 of the VA: 0x12
        address of PTE: PTBR + VPN part 1 * size of PTE = 0x44048
            quiz is miskeyed --> will fix after lecture
- page table size from Quiz 13
    20-bit VA, 4 PTEs, 1024 byte pages
        1024 byte pages --> 10-bit page offset
    <------- 20 bit VA ----->
    |  VPN       |     PO   |
    <--- 10 ---> <--- 10 -->

    minimum VPN = 0
    maximum VPN = 2^10-1
    number of virtual pages = 2^10
    single-level page tables --> page table size = (number of virtual pages) * PTE size = 2^10 * 4 = 2^12 = 4096 bytes
- three-level lookup result Q in Quiz 13
    the page table entry for the first-level was valid and contained physical page number 0x99,
    the page table entry for the second-level was valid and contained physical page number 0xA4, and
    the page table entry for the third-level was valid and contained physical page number 0xC3

    when looking up virtual addresss 0x012 3456 789A

    final PPN: 0xC3

                              vvvvvvv--- same the VA
    final PA: |     PPN      |   PO  |
               ^^^^^^^^^^^^^^
                0xC3 from third-level
    0xC389A --- PO was 0x89A because it's the last 12 bits (12 bits was size of PO in question)

- TLBs
    lookup using the full VPN
        when lookup works: we don't access ANY page tables
    what we lookup is the entry we'd get at the END of the page table lookup process

    when lookup doesn't work: we take the value from the end of the page table lookup process
        and remember it for next time

    TLBs are stored like caches, splitting the VPN into tag and index (offset is 0 bits)
            (VPN instead of physical address)
    TLBs data is a PTE (not bytes from program memory like in caches)