- exam format generally
    - multiple choice and short answer questions, similar to the quizzes
    - generally avoid questions that I would expect you to use a compiler/look things up
    - no calculators
    - you should be able to convert from binary to hexadecimal
    - roughly evenly covering the semester
         (but some topics are more represented, especially if they aren't easy to cover in assignments)
         (and some topics are more represented because of what exam questions I wrote + space)
    - we will supply a reference sheet that's linked off the schedule
    - my estimate: equivalent to ~40-43ish quiz questions
- exam format: specificness of HCL questions
    - generally: not going to ask about HCL syntax details
    - not going to have you memorize HCL built-in components

- Q15 Q1-2 (allocate-on-demand)
    - allocation occurs when a page fault happens
    - so to "setup" allocate-on-demand: OS needs to make sure a page fault will happen
        -> we configure memory region as invalid in the page table
           (page table entries are always looked up by virtual address)
- Q15 Q3-4 (multi-level PT)
    - 0x44440000 PTBR
    - 0x12345678virtual address
        first-level PTE: valid, 0x1111 PPN (stored 0x44440???)
        second-level PTE: valid, 0x4321 PPN (stored 0x1111???)
    - 8-byte page table entries (PTE)
    - 8192 byte pages --> 2^13 byte pages --> 13-bit page offset
    - 1024 entries per table ---> 2^10 entries per table --> 10 bit virtual page number parts
    - Q3: physical address of first-level PTE
        PTBR + (first part of the VPN) * (PTE size) = PTBR + 
            (0x12345678 >> 13 [remove page offset] >> 10 [remove the second part of the VPN]) *
            (8 [PTE size])
    - Q4: final physical address
            0x1FFFF --> 13 1s in binary
        0x4321 [final PPN] << 13 [to move into where the PPN in address) + (0x12345678 & 0x1FFFF [extract page offset])
- Q15 Q5-6 (TLBs?)
- TLB
    - read a 1st level PTE at index 0x0 of a table (VPN part 1 is 0)
    - read a 2nd level PTE at index 0x14 of a table (VPN part 2 is 0x14)
        >> VPN 0x14
                 0001 0100
                      ^^^^ --- TLB index
                ^^^^^--------- TLB tag
    - 32-entry 2-way set associative TLB
        --> 4 index bits (16 = 32/2 sets)

- bin to hex
    every 4 bits is one hex digit
        take groups of 4 bits from the least significant side, find the hex digit
- hex to bin
    every hex digit is 4 bits
        convert (starting with least significant side) each hex digit to binary
        if you want, e.g., bits 13-20 (zero-based), then you can skip entire hex digits
                0x1234
                     ^--- 0-3
                    ^-----4-7
                   ^------8-11
                  ^-------12-16 
        try to skip whole hex digits, e.g. for bits 13-20, you only need to look at '1' above
            (and account for leading 0s that aren't explicitly written)


- static linking
    > end up with an executable we can copy into memory and jump to
    > to make this work, we need to make it so jump, call, etc. instructions have
        addresses (or offsets) filled in correctly
        linker needs to know where all things like
            'call foo'
        or
            'movq $global_string_variable, %rdi'
        ...
        (uses of what should end up being a hard-coded address/offset)
        so it can make these have the correct values in the machine code
            (even though the assembler doesn't know what these values are)
    ---> versus dynamic linking
        ~ same high-level idea, but we do part of the address fixup when we load the program
            so we can choose what libraries, etc. we load at program start time
- object file contents
    > we can't know addresses of anything yet (and won't know most offsets)
    > symbol tables:
        if we have 'foo:' in our assembly, we need to remember the locatoin of 'foo'
        --> symbol table entry for it
            "hey, linker, 'foo' is in this location in the machine code/data"
    > relocations: (insert pointer at location X)
        if we use 'foo' (other than defining it), we need to tell the linker about this
            e.g. 'call foo', 'mov $foo, somewhere', jmp *foo+0x123(%rax,8), ...'
        --> relocation table entry for it
            "hey, linker, put 'foo' here"
    ~~ the linker doesn't understanding machine code, it just does what the relocations say

    'global_string_variable: <<< going to identify symbol table entry (and relocation table entries elsewhere)
        .string "This is the initial value." <<< is going to be part of the data in the .o/.exe
    - assembler directive: ".global foo" --> "foo can be used from other files"
        assembler might not put the name "foo" in the symbol table without this in a way that other files can use

- ISA versus microarchitecture
    - instruciton set architecture
        --> interface supplied to software (operating system, assemblers, etc.)
        generally not including performance
    - microarchitecture
        --> actual hardware implementation
            includes: how fast are instructions, what can be in parallel, how register values stored, ...

- pipeline
    ~~ we divide our instructions into pieces
        - how many pieces? # of pipeline stages (pipehw2: 5 stages)
    ~~ instruction 1 does piece K while instruction 2 does piece K-1 while instruction does piece K-3, ...
    ~~ extra work for "hazards"
        --> cases where instructions depended on each other so that we had do work besides dividing into stages
            (and passing instructions from one stage to the nexT)
- Q6 Q5 (throughput in pipeline)
    - "Suppose a 5-stage pipelined processor achives a throughput of 1000 instructions/microsecond. If one converted this into a single-cycle processor, one would expect a throughput ____."
    - assuming no hazard-related stalls, cycle time of 1000 cycle/microsecond
        best possible pipelined processor case: the pipelined processor's stages all take the same amount of time,
            so we'd get 1000/5 = 200 instructions/microsecond
        worst possible pipelined processor case: we're really doing all the work in one stage
            we'd still get 1000 cycle/microsecond
        --> should be in between best/worst pipelined processor design
- stalling and forwarding
    - forwarding: we need a value in stage K that hasn't been written yet, 
        *but* the instruction that's supposed to write is now in stage K+x and has computed it
            --> solution: take the value from stage K+x (even though it hasn't been written yet)
    - sometimes, we need a value that hasn't been written yet AND hasn't been computed yet
        --> we need to stall so it's been computed
        --> usually after we stlal that much, it still hasn't been written
        --> we use forwarding to resolve "hasn't been written" problem

- cache tag/index splitting
    - to find the tag: first find everything else ---> tag is "left over"
        ~ everythig else = index + offset size
        ~ if you have an address and know the index and offset are at the least sig part
            you can take the least sig part out and get the tag
            --> you might not find not tag size
    - to find the index --> count the number of sets
            think about "rows" in our cache diagram
                --way 0-- --way 1-- ...
                | blk        blk
                | blk        blk
                |
        e.g. cache size         / block size / # ways
             ^^^^^^^^^^^^^^^
             bytes in blocks

- write policies (through/back/allocate)
    if we've already decided we're storing a value in a cache
       we have decided how to handle writes to that value
       --> NO MATTER OUR POLICY: we must update the copy in the cache
       --> POLICY QUESTION: do we also update the next level?
            --> write-through: yes
            --> write-back: no --> so we need to remember to update later (dirty bit)
    if we've haven't decided we're storing a value in a cache
       and there's a write to it:
       --> POLICY QUESTIOn: do we store it anyways?
            --> write-allocate: yes
            --> write-no-allocate: no
        (if we're not storing, it better be sent to the next level, so we don't forget about it)

- cache blocking
    - work on groups of values that fit in our cache
        (typical values from the matrices/arrays)
        and within a group, do as much as possible before moving on the next group
        to take the most advantage of what fits in the cache
    - improves on choosing the right "loop order" --> b/c
        we're hopefully going to avoid having values with very poor spatial/temporal locality
            e.g. in matrix multiply, we're not going to load a value from a matrix and use it just once
                before it gets evicted
- Q10 Q6-9 (spatial/temporal locality in C code)

    void foo1() {
        for (int i = 0; i < 1024; ++i) {
            for (int j = 0; j < 1024; ++j) {
                A[i + j] = B[j * 4] * C[i * 1024 + j];j
                  ^^^^^-- 0 + 1 and 1 + 0 are the same thing ---> potential temporal locality
            }
        }
    }

    void foo2() {
        for (int i_outer = 0; i_outer < 1024; i_outer += 16) {
            for (int j = 0; j < 1024; ++j) {
                for (int i = i_outer; i < i_outer + 16; ++i) {
                    A[i + j] = B[j * 4] * C[i * 1024 + j];
                      ^^^^^-- 0 + 1 and 1 + 0 are the same thing ---> potential temporal locality
                }
            }
        }
    }
    
    - temporal locality in A:
            better in version 2, because we access 0 + 1 and 1 + 0 (and 1 + 1 and 0 + 2 and 2 + 0, and other
                pairs like this) in closer proximity
    - temporal locality in B:   
            better in version 2, because we use the same j 16 times in row
                (versus not repeating th e same j for 1024 iterations)
    - spatial localinty in C:
            better in version 1: we always advance by j --> to adjacent indices
                (versus version 2 where we skip by 1024)
    - temporal locality in C:
            never access the same index twice --> same in both

- OOO architecture block diagram
    - in-order part:
        fetch as usual (but multiple instrucitons at once)
        branch prediction (but multiple instrucitons at once)
        register renaming ---> preprocess instructions to identify potential conflicts
        
    - out-of-order part:
        (unordered) queue of instructions we could run
        execution units that are given instructions as inputs available
        writeback results

    - in-order part:
        "commit": identifying what registers are available for use
            handling exceptions, etc.
            part of handling branch mispredictions/squashing
            other bookkeeping

- exceptions
    ~ something happens that requires the OS to run
        (b/c the processor doesn't know what to do about it itself)
            I/O (e.g. network input, keypress)
            program did something weird (access memory through an invalid page table entry)
            system call
    ~ the processor remembers where the program was
    ~ jumps to the OS in kernel mode

- Q14 Q2-4 (single-level page tables)
    Q2: "Suppose a system has 30-bit virtual addresses, 24-bit physical addresses, and 1024-byte pages. If this system uses a page table stored in memory as an array with 4 byte page table entries, how large is this page table in bytes? (Write your answer as a base-10 number.)"
        1024-byte pages --> 2^10 byte pages --> 10-bit page offests
            30-bit virtual address: [ 20-bit virtual page number ][10 bit page offset]
        page table is single array of 4-byte page table entries ---> need one for each virtual page number
            2^20 possible virtual pages, 2^20 * 4 bytes --> 2^22 bytes

    Q3:
        page table based register of 0x88000
        VA 0x12345
        4096 byte pages --> 2^12 byte page --> 12 bit page offset
        4 byte PTE

        compute the address of the PTE (1-level table)

        PTBR + (0x12345 >> 12 [remove the page offset]) * (4 byte PTE size)