- meeting with Prof Reiss
    my office hours are open-door, so we can't discuss things privately
    you can email me and probably schedule a time
- topic coverage (vectors)
    exam will be a majority on things from the last third of the semester
    about twice as long as a midterm
- likely 3:30pm section will be in Rice 130
    will send email with actual policy

- storing values in TLB
                            vvv---------- what the cach estores
    TLB is cache [VPN] --> [PTE]    
                  ^^^----- index and find entry 

    unlike a "normal" cache:
        * use the VPN instead of a physical address
            * split the VPN into tag/index/offset
        * store single page table entries instead of blocks of multiple bytes of data
            single page table entry --> 0 offset bits
        * rely on the OS running a special instruction to handle writes to PTEs
            (instead of detecting automatically)

    values are stored in the TLB whenever we need to lookup a page table entry for an access
        (and we don't find it)
            stored based on the VPN  used to lookup the page table entry
    values are evicted when the OS runs a special instruction

- changing page tables
    whenever the OS changes a page table entry, it needs to make sure that no TLB
    contains a copy of it (or else the TLB might continue using that copy)

    on a multi-core system, there are multiple TLBs
        the special instructoin to clear out parts of a TLB needs to on the core with the TLB
        this means a multicore OS might need to run some code on a different core to edit a page table entry

    "TLB shootdown" --- trigger an interrupt just to run that piece of code on another core

- overlapping TLB and L1 access
    cache access steps:
        * lookup the correct set of the cache   [index bits]
        * extract the correct part of the block [offset bits]
        * compare the tag to determine the correct block [tag bits]

    typically we will use only physical addresses for our caches
        (yes, there are processors which don't do this, but it creates other issues)

    tag/index/offset therefore are parts of the physical address

        virtual address
     [          VPN        ][         PO      ] 

        physiscal address    vvvvvvvvvvvvvvvvvv---------------------------------same as virtual address
        [        PPN       ][         PO      ]  (based on the page size)
        [  cache tag          ][ index][ offst]  (for the L1)

    we can figure out the page offset part of the physical address IMMEDIATELY
        therefore we can do the cache access steps that only require that piece of address immediately

    if index bits fit within the page offset, we can immediately lookup the correct set
            --> start doing the cache lookup before we have the PPN (which we'll get from the TLB)


- how much space page table need for particular layouts [VM2 exercises]
    with a 4-level page table
    - 1 valid page of data:
        first-level page table ---> second-level page table --> third-level page table --> fourth-level page table
            --> the single valid page

        4 page tables

    - 2 valid pages of data:
        Q: what are the virtual addresses of the pages
            [                  VPN            ][        PO      ]
            [  prt 1 ][  prt 2 ][ prt 3 ][ pt4][        PO      ]
             ^^^^^^^^
             first-level lokup to find second
                       ^^^^^^^^
                       second-level to find third

        if prt1 through prt 3 are the same for both the two pages:
                then they share the first, second, third, and fourth-level ---> 4 pages
        if prt1 is different for the two pages
                then they share the first and nothing else
                    (doesn't matter if prt 2 happens to be equal)
                    one first-level
                    two second-levels
                    two third
                    two fourth
                    ---> 7 pages
        if prt1 is the same, but prt 2 is different
                then they shared the first + second
    - can generalize this to more pages: which page tables do we actually have to create

- F2018 exam 3 #11 (second-level PTE)
    2^16 B pages (16-bit page offsets)
    40-bit VA (40 - 16) = 24 bit VPN
    two-level page tables
    16-byte PTEs and equal # of entries at each level
        12 bit VPN parts for each level

         vv v~~~
    VA 0x13 4815 1234:
        second-level page table is located at PA 0x10 0000
            this address must have been found by reading a page table entry
                which was located using the page table base pointer +
                the first part of the VPN
        addres of the second-level page table entry
            use that base address we got from the first-level + second part of the VPN
            second part of the VPN 0x815
            since each PTE is 16 bytes, 0x815 * 16 bytes from the beginning of the PT
                0x8150 bytes from the beginning
            0x10 0000 + 0x8150 = 0x10 8150

- page tables and normal caches [differences]
    page tables do not contain data (only pointers to data)
        TLBs whcih cache page tables do not contain data (only pointers to data)
    caches store data

    page offset --- only used after the whole page table lookup
    not really like cache offset

- fork()
    Unix/POSIX choice for how to create processes
    makes a copy of the current process
        including its memory and current registers
    a rationale: don't need to have a bunch of interfaces for specifying how the process is setup
    
    typical implementations don't actually copy all the data in memory
        instead they make them shared read-only between the two processes
        and then they fix this whenever the OS is told that one of the two processes tried to write and couldn't
            fix by copying one page at a time
    "copy on write"

- latency and throughput bound
    ~ both ways of approximating the best performance that you could have given certain
        computational hardware
    ~ latency bound: what operations do we have to do one-after-another
        how long is it from when we start each operation until it finishes.
        add this up for our list of operations

            for (int i = 0; i  < N; ++i) {
                sum += a[i];
            }

        need to do N + operations, one after the other

    ~ throughput bound: how much stuff can we do versus how much work do we have to do
        
            if I have to perform N multiplicatoins and 2N additions
            and I have two multipliers that each can start 1 multiplicatoin/cycle
            and four adders that can each start 1 add/cycle

            throughput bound for the multipliers: N/2 cycles to do this multiplications
            throughput bound for the adders: 2N/4 cycle sto do the addds
                but this is assuming we can not worry about dependencies

    - throughput bound ignores dependnecies --- good for knowing how well you can do with 
        optimizations that get rid of dependencie s(e.g. multiple accumulators)
    - latency bound accounts for dependencies --- good for estimating what performance should
        be given code with dependencies or if you can get rid of dependnecies,
        deteminig how close you are to what the hardware can do

- F2019 exam 2K #10 (cache blocking)
    4-byte int in matrix
    16B blocks (4 elements per block
    fully-associative cache with an LRU policy
            int n = 160;
            for (i = 0; i < n; i += 4){ 
                for (j = 0; j < n; j += 4){ 
                    for (iB = i; iB <= i + 3; iB++){    //  ----
                        for (jB = j; jB <= j + 3; jB++){ // ----
                            matrix[jB * n + iB] *= i*j;
                        }   
                    }   
                }   
            }
    for each matrix block (in loops marked ---)
        we access four cache blocks
            matrix[j * n + i] through matrix[j * n + i + 3]
            matrix[(j+1) * n + i] through matrix[(j+1) * n + i + 3]
    and we only access each matrix block once
    so assuming the matrix block fits in the cache, it will stay (LRU) 
    and we'll have four misses per matrix block
        cache is 256B --> 256/16 = 8 cache blocks --- enough to fit a matrix block
        can't have less than four misses per matrix block because we have to load
            each cache blockin the matrix block at least once

    each matrix block is 4x4 out of 160x160 ---> 40x40 matrix blocks
    40 * 40 * 4 misses

    some simplificatoins that helped this question:
        fully associative meant we didn't worry about sets/conflict misses
        matrix blocks being whole cache blocks meant we didn't worry about
            keeping things in the cahce between matrix blocks
        matrix blocks only being used once/never overlappin
            meant we didn't worry about keeping things in the cache between
            matrix blocks
    BUT even for cases wher ethe above doesn't hold, this can still be a useful
        *approximation* of how useful cache blocking is going to be
- counting cache misses w/ cache blocking

- F2018 exam 1 #22 - are C expressions true (bitwise)
    setup: x, y were unsigned ints between 0 and 1000 inclusive

        (x & y) < y
         ^^^^^
            \----- less or equal number of 1 bits than either x or y
                    and since unsigned --> equal or lower value than either x or y
                counterexample: x = y = 0

        (x | y) >= x
         ^^^^^
            \---- more or equal number of 1 bits than either x or y
                and since unsigned --> equal or greater value than x or y

        ((x >> 3) & 0x7) == ((x & 0x3F) >> 3)

        x = ????abcdefgh
        x >> 3 = ????abcde
        & 0x7  &       111
                ----------
                 000000cde

        x = ????abcdefgh
          &       111111
          --------------
            000000cdefgh
        >> 3:    000000cde
            --> same

        ((x + y) << 1) >= (x | (y << 1))
         (x + y) * 2
         x*2 + y*2
                           ^^^^^^^^^^
                           x | (y * 2) <= x + (y * 2)

        for non-negative a, b
            a + b >= a | b
                a | b is like addition without carrying

- F2019 exam 1 #12 (bitwise)
                                                            vvv                     vvv
For example, the integer (specified in hexadecimal) `0x12345678` should become `0x12678678`.

                                       0x??345678 
                                      vvv
a. `x = ((x & ~0xFFFFFF) | ((x << 12) | x) & 0xFFFFFF);`
         ^^^^^^^^^^^^^^     ^^^^^^^^^
            0x123000000     0x??678000

*a. `x = ((x & ~0xFFFFFF) | (((x & 0xFFF) << 12) | (x & 0xFFF));`
                                0x678000             0x678
a. `x = ((x ^ (x >> 12)) | ((x << 12) & 0xFFF000));`
           ^^^^^^
*a. `x = ((x << 12) & 0xFFF000) + (x & 0xFF000FFF);`
         ^^^^^^^^^^^^^^^^^^^^^^
        0x345678000                  0x12000678
        0x000678000