- powers of two K/M/G/etc.
    ~ we didn't talk about this yet
    ~ K = 2^10
      M = 2^20
      G = 2^30
      T = 2^40

      2^14 = 2^4 * 2^10 = 2^4 * K --> 16K

      2G = 2^1 * 2^30 = 2^31

- pipeline question on quiz --- timing
    not pipelined took 10 ns
    pipelined: 2 ns per stage x 5 stages

    cycle#
    time  op 1  2  3  4  5
    ----
      1   #1  --- -- -- --
      2   #2   #1 -- -- --
      3   #3   #2 #1 -- --
      4   #4   #3 #2 #1 --
      5   #5   #4 #3 #2
      6        #5 #4
      7           #5
      8              #5
      9                 #5

    9 cycles * 2 ns per cycle = 18 ns

    5 cycles to start everything,
    4 cycles for the last to finish
    
    5 cycles for the first to finish
    1 cycle for each other one to finish

- S2018 Q8 ~ LEA
    C declaration: long **x;   x is a pointer to pointer to long
    x is stored in %r8

    x += **x;

        suppose x=%r8 contains 0x1000 (pointer to pointer to long)
        *x is the value in memory at address 0x1000
        suppose memory at 0x1000 contains 0x2000 (pointer to long)
        **x is the value in memory at addresss 0x2000
        suppose memory at 0x2000 contains 42 (long, not a pointer)
                                AKA 0x2a

    x += 42;  "advance x by 42 of whatever it points to"
        
        x becomes 0x1000 + 0x2a * sizeof(what x points to) = 0x1000 + 0x2a * 8 
            [8 is the sizeof(long*)]

        dereference 0x2000 (find **x)
        then use lea to add it to x

        leaq offset(base, index, scale), output ---> output = offset + base + index * scale
        movq (%r8), %rax     // rax <- *x
        movq (%rax), %rax    // rax <- **x
        leaq (%r8, %rax, 8), %r8  // r8 <- r8 + rax * 8

        movq ((%r8)), %rax --> assembler error

- conditional jumps and CCs
    ZF and SF (Y86)      / OF and CF (extra ones to handle overflow)

    jle  jump if last result was <= 0
        if no overflow:
            ZF = 1 --> result was 0
         or SF = 1 --> result was negative
    jg   jump if last result was > 0
            ZF = 0 --> result was not 0
        and SF = 0 --> result was positive

        if overflow OF
            then SF is wrong (if worried about signed overflow)
                SF ^ OF --> SF "corrected for signed overflow"

        if overflow CF
            then SF is wrong (if worried about unsigned overflow)

    last result: set by almost any arithmetic
        includes ALL OPq instructions in Y86 (addq, andq, subq, xorq)`
        not reset because

- Y86 mrmovq encoding question (from quiz)
    - read the encoding table
    - we won't require you to memorize rA/rB ordering exceptions (we'll give the relevant part of the table)
    - constants are always little endian
        lowest address byte has least signigicant bit (1's place)
        textbook's table has the lowest address in the column labelled "0"

- register file inputs and outputs
    - 2 "read ports" --- read a register during a cycle
        each read port (A, B) has:
            4-bit source register # input (which register to read)
            64-bit register value output (value read from that register)

    - 2 "write ports" --- write a register at the rising edge of the clock
        each write port (E, M) has:
            4-bit destination register # input (which register to write)
                15 means "no register" = 0xF = REG_NONE in HCLRS
            64-bit register value input (value to write to that register)
        

- timing for single-cycle processor
    - reads and computation happen between rising edges of the clock
    - writes happen at the rising edge of the clock ("end of the clock cycle")
    - everything instruction takes one clock cycle
        - everything written for the instruction is written all at once
- stages in general
    - in single-cycle: don't tell you when things actually happen
    - organizational division --- make it easier to think about processor design?
- stages for PUSH/POP
    - fetch [read instruction, split instruction into pieces, compute address of next instruction]
        PUSH/POP: read instruction, find icode
        extract rA
        compute PC + 2 (because PUSH/POP are both two bytes: icode/ifun byte + ra/rb byte)
    - decode [read registers]
        PUSH: 
            read rA
            read RSP (so we know where to write rA. Also we need to update RSP based on its old value)
        POP:
            read RSP (so we know where to read new value of rA. Also to update RSP)
    - execute [use ALU: address AND "normal" arithmetic]
        PUSH:
            compute new RSP = old RSP - 8
        POP:
            compute new RSP = old RSP + 8

                "%rsp points to the most recently pushed value, not to the next unused stack address"
    - memory [use data memory]
        PUSH:
            write rA into memory at address the new RSP = old RSP - 8
                                = ALU output
        POP:
            read from memory at address the old RSP (= register file output != ALU output)
    - writeback [send values to register file to update]
        PUSH:
            RSP <- new RSP = ALU output
        POP:
            rA <- value read from data memory = data memory output
            RSP <- new RSP = ALU output
    - PC update
        PC register input <- PC + 2 (computed in the fetch stage)
- what do we need to memorize about stages
    - you should know what each stage does (what components/operations are "part" of that)
    - you should be able to figure out a correct way of using the processor components to do something


------------

- format of the exam
    - similar to prior semesters
    - multiple choice or very short answer (one word/number)
    - 20-25Q

- icode versus opcode
    - I've been sloppy about this
    - sometimes we use opcode to mean the first byte which has
        icode (4-bit number indicating which instruction it is, counting jXX, cmovXX and OPq each as one instruction) and
        ifunc (4-bit number indicating which OPq or jXX or cmovXX an instruction)
    - sometimes we use opcode to only mean the icode

- compilation -- when do you know what
    - [C/C++/etc. source code] ---> [assembly] ---> [object file] ----[combined with other .o files]---> executable
                                    ^^^^^^^^^
                                    decided what instructions to use
                                                    ^^^^^^^^^^^^^
                                                    translated instructions to machine code,
                                                    but don't know where anything is in memory
                                                    also translated constants, etc. to bytes
                                                                                                       ^^^^^^^^^^^^
                                                                                        decided on the addresses of
                                                                                        everything, and filled in
                                                                                        any address fields in the machine
                                                                                        code
    - to let filling address work, in the object file we hvae
        [relocations]: "at this part of the machine code, replace this with address of something"
            call printf ---> "replace bytes 2-10 with address of 'printf'"

        [symbol table entries]: "the name something is at this part of the machien code
                ...
            printf: 
                pushq ...
            --> "'printf' is at byte 100 of the machine code of this object file"
    each object file has its own symbol table and own relocation table
    linker combines all the object file's symbol and relocation tables together

- how do things like printf get linked in
        gcc -o file.exe main.o
        --
        actually runs the linker with a lot of extra .o and .so files
            sometimes the linker can be told to only include .o or .so files if they're used

- dynamic and static linking
    - dynamic linking --- we do what the linker does, but at runtime instead of when producing the executable
    - main advantage: smaller executables -- load, e.g., the C library from common file at runtime
        (don't have N copies of the library among N executables)

- RISC versus CISC
    - RISC: simpler for the hardware maker (and who cares about the software)
            ~ simpler instructions
            ~ no accessing memory and computing someting in instruction
            ~ ...

        ~ RISC: typically more registers
            more registers is an easy way of using extra HW to speed things up
    - CISC: do whatever's convenient fo rthe software people
        (even if it makes it hard to make the hardware work)
            ~ adding instructions for common tasks, no matter how complicated tasks
                ("string in string" or "memcpy" or "push/pop")