- exam format
    - a combination of quiz-like questions and questions with longer answers
            (longer answers ~~ sentence or two)
- exam time
    anticipating 12p 10 Dec -- 12p 11 Dec
- topics covered
    - anything we covered in lecture, lab, or homeworks

- pipeline processor stages
    - from our textbook:
        fetch:
            read from the instruction cache
            predict the address of the next instruction
        decode:
            read from the register file
            (compute destination register numbers)
        execute:
            do arithmetic
            update and/or read condition codes
        memory:
            access the data memory (if necessary)
        writeback:
            write to the register
    - textbook's choice --- some real processor use these stages, but not the only design


- pipeline timing
    - Quiz 6 Q5
        "Consider a three-stage pipelined processor has a cycle time of 500 picoseconds. Suppose this processor runs 5 instructions in the following order: A (completes first), B, C, D, and E (completes last). What is the time between when the instruction A starts its first stage and when instruction E starts its third stage."

           1  2  3  4  5  6
          v-----------------v
        A  S1 S2 S3
        B     S1 S2 S3
        C        S1 S2 S3
        D           S1 S2 S3
        E              S1 S2 S3

    6 cycles --> 6 cycles * 500 picoseconds/cycle = 3000 picoseconds

    - Quiz 7 Q1
        "Suppose we have a six-stage pipelined processor with 500 ps cycle time.

        Running a particular benchmark program which executes 1 billion (10 to the 9th power) instructions in a simulator, we determine that 5% of its instructions require exactly one cycle of stalling and 10% require exactly two cycles of stalling and no instructions would require more than two cycles of stalling."

        "Assume we attribute each stall to exactly one instruction."

            no stalling: B finishes one cycle after A
            A   F D E M W
            B     F D E M W

            1 cycle of stalling: B finishes two cycles after A
            A   F D E M W
            B     F D D E M W

        1 billion instructions
            if no stalling ---> would take 1 billion cycles (finish one instruction/cycle) + 5 extra cycles for the first instruction to finish
                first instruction needs to go from stage1 to stage6 before it finishes
            each time there's a single cycle of stalling --> 1 extra cycle
                1 * 5% * 1 billion extra cycles
            each time there's exactly two cycle of stlaling --> 2 extra cycle
                2 * 10% * 1 billion extra cycles

            1.25 billion cycles total * 500 ps/cycle = 625 milliseconds

- how many cycles to resolve data hazards with stalling Quiz 7 Q2
    
                                                   v--- value of %rbx is written
    addq %rax, %rbx                  F  D  E  M  W |
    anything that reads %rbx                     F   D  E  M  W
                                                    ^^^--- first time we can read %rbx w/o forwarding
                                    0 1 2 3 4
    addq %rax, %rbx                 F D E M W 5 6 7 8
    subq %rbx, %rdx                   F D D D D E M W 9
    xorq %rbx, %rcx                     F F F F D E M W 0 1 2 
    rrmovq %rdx, %rcx                           F D D D E M W 3 4 5 6
    addq %rbx, %rcx                               F F F D D D D E M W

    --> 16

- Quiz 9 access pattern question at end
    - see 27 October 2020 slides near beginning
        method:
            note that {array[0], array[1]} are the blocks
                (know that array[0] is start of block from question
                and that blocks are two array elements from given block size)
            choose an assignment of blocks to set indexes
                if {array[0], array[1]} are in set K, then {array[2], array[3]} must use set K+1 mod (# sets)
                    (in this problem # sets = 2)
                can't decide to use a different set based on which are sets are empty, b/c the address
                    determines which set we use
            go through accesses in order, track what's in each set
                in the case of a DM cache, we only can store one block per set,
                    so we have to replace that block whenever something new is accessed for a set
            NOTE: we can consider each set separately (and add up the results at the end),
                which might be useful if we had a more complicated problem
                b/c e.g. nothing that uses set 0 can cause anything change in set 1

- exception tables
    - when a processor detects an exception should occur (e.g. external event sends a signal or program "messes up")
        it needs to jump to the OS
    - exception table is a mechanism to tell the processor where to jump to
        - has separate place for processor to jump to for each kind of exception
        - idea: OS provides different code for processor to jump to for different kinds of events
    - exception table is an array of pointers to machine code ("places to jump")
        - processor chooses in advance that entry K of the array corresponds to a particular kind of exception
    - processor needs to have a pointer to the exception table
        - common mechanism: a register that points to the exception table's first element

- virtual pages
    - virtual addresses: is the memory the program sees
    - virtual addresses need to be mapped to physical addresses so they can act like memory
    - choice of page tables:
        * we do that by dividing the virtual addresses into fixed-sized chunks called "virtual pages"
        * the first (index 0) chunk starts at virtual address 0
        * if pages have size K, that means the second (index 1) chunk starts at virtual address K
        * "   "     "    "   ", that means the third (index 2) chunk starts at virtual address 2K
        --> chunks start at virtual address (page number * K)
        and we choose K to be a power of two ---> K = 1000...000 in base two
        --> so beginning of virtual pages always start with lower log_2(K) bits of 0
                --> call these bits the "page offset"
            something in the middle of a virtual page has non-zero lower bits somewhere
        --> if we strip off the page offset bits, we get the virtual page number

- Quiz 13 (VM) Q1 address of PTE
    20-bit virtual addresses
            <------ 20 ---------->
            |   VPN     |  PO    |
            <--- 10 ---> <-- 10-->

    1024 byte pages ---> log_2(1024) = 10 page offset bits
    locate the PTE of VA 0x1000
        vvvvvvv-----------------  VPN = 4
        0001 0000 0000 0000 binary
               ^^^^^^^^^^^^--- page offset

    VPN 4 ---> index 4 in the page table

    a page table base pointer set to physical (byte) address 0x1000
    4 byte page table entries -------------\
    a single-level page table structure    |
                                      /----/
    PTE for VPN 0 @ 0x1000            v
    PTE for VPN 1 @ 0x1004 = 0x1000 + 4 * 1
    "   "    "  2 @ 0x1008 = 0x1000 + 4 * 2
    ...
    VPN 4--> 0x1000 + 4 * 4 = 0x1010

- Quiz 13 (VM) Q4 (PA result of lookup)

        42-bit virtual addresses, with a 30-bit virtual page number and a 12-bit page offset
        30-bit physical addresses, with an 18-bit physical page number and a 12-bit page offset
        three-level page tables, where 10 bits of the virtual page number are used for a lookup at each level

        Suppose that when looking up page table entries for the virtual adddresss 0x012 3456 789A:

            the page table entry for the first-level was valid and contained physical page number 0x99,
            the page table entry for the second-level was valid and contained physical page number 0xA4, and
            the page table entry for the third-level was valid and contained physical page number 0xC3

        WANT: corresponding PA of the actual data
                ---> based on the last-level page table entry (end of the multi-level lookup)
           
                  vvvvvvvvvvv-------------- 0xC3 --- the one we found in the last level of the lookup 
            PA:  |   PPN     |   PO  |
                              ^^^^^^^--- 12 bits, the same as in the VA -- last twleve bits of the VA = 0x89A

        --> 0xC389A

- and Q5 (TLB tag bits)
        when have a TLB hit:
            we give the TLB a VPN 
            it finds a "block" in the TLB which contains just the PTE for that VPN
            and we skip the entire multi-level lookup
        the VPN needs to be split into tag/index/offset
            offset: for a TLB is always empty, because the block only has one entry (2^0 = 1 --> 0 offset bits)
                TLB contains PTEs not bytes of data
                    each PTE is identified by a particular VPN
                        (like in normal caches, each byte is identified by particular address)
            index: we are told that it's 3 index bits
                8 ways --> 8 blocks / set
                64 blocks (= # entries) / ( 8 blocks / set) --> 8 sets = 2^3 sets --> 3 index bits
            tag: must be everything left over
                VPN had 30 bits total - 3 bits for index --> 27 bits left over

- TLB accesses
        when have a TLB hit:
            we give the TLB a VPN
                the TLB splits up the VPN into index + tag
                it reads the entries in the set corresponding that index and sees if any tags match
            it finds a "block" in the TLB which contains just the PTE for that VPN
            and we skip the entire multi-level lookup
        when we have a TLB miss:
            we give the TLB a VPN
            it finds nothing ("miss")
            we do the (potentially multi-level) lookup normally
            we tell the TLB about the result, and it stores it for next time
                possibly evicts a PTE with the same index it has cached

~~~~~~~

- Quiz 2 Q3 -- ZF after assembly snippet

long example(long a, long b) {
    while (a > b) { a = a - b; }
    return a;
}

    example:
	cmpq %rsi, %rdi*** <--- must have set the flags
                                rdi - rsi --> ZF is 1 iff rdi = rsi
                                 a  -  b
                                 ^------- return value
	jle L_done***
	subq %rsi, %rdi
	jmp example
    L_done:
	movq %rdi, %rax***
	ret*** <--- 

            if a == b when we're done
            same as: "if the return value == b when we're done" 

- Quiz 2 Q5 -- what the linker does

    .global array
    array:
	.byte 1
	.byte 2
	.byte 3
	.byte 4

    .text
    .global bar 
    bar:
	cmpq $0, %rdi
	je end_bar
	movq $array, %rdi
	call print_array
    end_bar:
	movq $array, %rax
	ret

    When using the resulting object file to produce an executable (using the static (non-dynamic) linking scheme we discussed in lecture), the linker will _______. Select all that apply.
    Yes: "write the four bytes that are stored after the label array to the executable file"
        > we need the value of the array to end up the executable, so it's the initial value of the
          global variable array
        > this is going to happen because the executable contains that value, and the linker needs to
          write the entire executable
            (if the linker doesn't put something in the executable file, it won't be there)
        > BUT: the linker is writing these bytes because it copies them from the object file
            the linker does know/need to know that it represents an array
    Yes: "find a symbol table entry for print_array in some other object file"
        > the linker needs to write "call print_array" instruction in the executable
            the call instruction needs to refer to the *address* (not name) of print_array
            since print_array isn't declared in the shown assembly file, it won't end up
            in *that* object file --> there must be another one
    No: "write the memory address (in some format) chosen for the 'call print_array' instruction somewhere in the executable"
        > the memory address of the call instruction NOT the memory address it contains
        > nothing needs to point to the call instruction 
        > BUT something is going to put the function (bar) that instruction is in
            (just not the particular instruction in the middle)
    Yes: "write the memory address for array (in some format) in the resulting executable somewhere"
        > we need this for the movq instruction which loads that address into %rdi

- Quiz 6 Q6: cycle time when changing CPU designs
    "Suppose a processor designer took a three-stage pipelined processor design with a cycle time of 800 ps and turned it into a six-stage processor pipelined design that implemented the same ISA. Both the three-stage and six-stage processors were designed to achieve about the best cycle times possible (despite the constraints that lead to diminishing returns as one tries to add pipeline stages to a processor). Which of the following would be the most likely cycle time for the six-stage processor design?"
    > start with 3-stage 800 ps
    > we make it 6-stage processor
        - we can't do this perfectly, we run into constraints that lead to diminishing returns

    if we could do this perfectly, then we'd end up with a 6-stage 400 ps
            > but we can't do that perfectly
    we can easily achieve 6-stage 800 ps processor by adding three stages at the end that do nothing
        > but if we can do absolutely anything that take advantage of pipelining when going from 3 to 6 stages
            we'll have a better cycle than that

    best option: 500 ps --- this is not perfect (perfect would be very unlikely)
                        --- but better than the silly design of adding stages that do nothing
                                (why we add stages if that was all we could do?)

- Quiz 10 Q3

    for (int i = 0; i < N; ++i) {
        for (int j = i; j < N; ++j) {
            C[j] += A[i*N+j] * B[j];
        }
    }

    estimate number of misses for B:
        each iteration of the j loop:
            we're accessing something that's one element after the previous element of B we accessed
            --> cahce miss 1/K of the time where K is the number of elements that fit in a block
                (1/K of the time B[j-1] and B[j] are in different blocks)
        approx N * N/2 iterations of the j loop
            (N/2 b/c j loop goes from N/2 to N on average)

        result: N*N/2/K (with particular values of N, K from question)