- logistics: memorizing opcodes, etc.
    - we do not expect you to remember opcode numbers, function code numbers, etc.
         (we supply a table if you need them)
      we expect you to know what types of operands Y86 takes
        if register or displacement + register to make address etc
        but not whether it's rA or rB
    - K/M/G/T we might expect you to know

- contents of object files, etc.
    - C/C++/etc. source code ---> assembly ---> object file   ----> executable
                           complation    assembling         linking
    - after compilation into assembly:      .string "Hello, World"      .byte 'H', 'e', ...
        have names for instructions
        have labels to indicate functions to call, instruction to jump to,
            global variables to load, etc. 
    - after assembling into  object file:       
        have bytes (machine code + constants + intiial values of globals)
        have relocation table entries to represent "holes" to fill in
            in bytes
                e.g. "there is a call to printf to fill in with the locatoin of printf"
        have symbol table entries to represent labels that might be used elsewhere
            e.g. "byte 1000 of the file is where the main function starts"
    - after linking to executable:  [assuming static linking]
        (typically from multiple object files)
        have bytes to load into memory directly

- C pointer arithmetic and arrays
        int *p;
        ^^^--------- how compiler choses what += does

        p[X]  <-----> *(p + X)
        p = &a[0];
        p += 2;
        *p  ---> *(&a[0] + 2) --> a[2]
        p[1] --> *(p + 1) --> *((&a[0] + 2) + 1) --> *(&a[0] + 3) --> a[3]
    
- bitwise operators --- bit puzzle strategy
        AND:
            keep only certain bits
            clear certain bits
        OR:
            set certain bits
        XOR
            flip certain bits

        "certain bits" --> specify with a "mask" = number with those bits set to 0/1


    - mask+shift idea
        [bit 0 = least significant bit]
        extract bits 5-9 (inclusive) from x, make them bits 0-5 (iclusive)

            (x & [value with bits 5-9 set to 1]) >> 5
                11 1110 0000
            (x & 0x3E0) >> 5
                0b11_1110_0000 or 0b1111100000 --- not in C

    - divide and conquer
        for each piece of the number we care about, figure out how transform
                shift to the locatoin we want it
                turn it into a bias e.g. 0 or 1  /  0 or -1

- LEA versus MOV
    - lea __memory locatoin__, %rax   like mov __memory locatoin__, %rax,  but
        skip memory access of move (shortcut address value into register file write input)

    - lea (%rax), %rbx <---> mov %rax, %rbx
            mov (%rax), %rbx would read from memory at address R[# of RAX]
    - lea 1(%rax), %rbx <---> rbx becomes (rax + 1)
    - lea (%rax, %rbx), %rcx <--> rcx becomes (rax + rbx)
        b/c mov (%rax, %rbx), %rcx would access memory at address rax + rbx
    - lea (%rax, %rbx, 2), %rcx <---> rcx becomes (rax + rbx * 2)


    - lea %rax, %rbx -- SYNTAX ERROR (I hope)

- RISC v CISC
    - instruction set design choices can make it 
        easier/harder to make hardware
        easier/harder to write assembly

    - RISC generally: tries to make hardware simpler to design
        assuming that making assembly harder to write/require more instructions
        is okay

        "let the compiler do it"

    - RISC:  simpler hardware
        - simpler instructions
            - generally no instructions that write two values
        - fixed-length instructions (simpler to decode in hardware)
        - less instructions
        - separate instructions from accessing memory
            - more registers to hold values to make up for this
        - one way of specifying opreands

    - CISC: more complicated instructions,
        often designed for making particular things convenient in assembly

    - fuzzy line ---- very few ISAs are definitely RISC or definitely CISC

- endianness of Y86
    - it's little endian whenever it comes up
- decoding Y86
    - you need the table of opcodes, etc.
    - table has bytes of instructions labelled
        look in first byte for matching opcode
        table tells you which byte # of each instruction contains important values
        match the blanks to values in bytes of instruction
        
    0x123 = 0x0000 0123 = 0x0000 0000 0000 0123
                                             ^^
                                           ^^
                                        ^^
        little endian of that, least significant byte at lowest address
        address 0: 0x23
        address 1: 0x01
        address 2: 0x00
        ...

- Q5 on Fall 2018 --- assembly code and value in regs
0x000: 30 f4 30 00 00 00 00 00 00 00 | irmovq $0x30, %rsp   <--- RSP becomes 0x30
0x00a: 30[f0 25 00 00 00 00 00 00]00 | irmovq $0x25, %rax   <--- RAX becomes 0x25
          ^^
          least sig part (lowest address)
                                        0x00000025f0
0x014: 61 04                         | subq %rax, %rsp      <--- RSP = RSP - RAX
                                                                     = 0x30 - 0x25 = 0xB (11)
0x016: b0 0f                         | popq %rax            <----RAX becomes M[0xB]
                                                                 RSP becomes RSP + 8
                                                                           = 0x13 (19)
0x018: 00                            | halt


- Q5 on Fall 2016

- HCL regiser Q on quiz

register xY {
    foo : 64 = 1;
    bar : 64 = 1;
}
x_foo = Y_foo + Y_bar;
x_bar = x_foo + 1;

                        x_bar = x_foo + 1;
                        x_foo = Y_foo + Y_bar;

cycle 0: Y_foo = 1, Y_bar = 1
         x_foo = Y_foo + Y_bar --> 1 + 1 --> 2
         x_bar = x_foo + 1 --> (1 + 1) + 1 --> 3
--- rising edge of the clock ---- copy x_foo to Y_foo, etc.
cycle 1: Y_foo = 2, Y_bar = 3
         x_foo = Y_foo + Y_bar --> 2 + 3 --> 5
         x_bar = x_foo + 1 --> 5 + 1 --> 6
--- rising edge of the clock ---- copy x_foo to Y_foo, etc.
cycle 2: Y_foo = 5

- stages and order things happen
    - "fake"/conceptual stages for the single-cycle CPU
        things happened in an order because of
            dependencies --- can't read RBX for add RBX, RBX until we read
                the instruction to find out that it's RBX and not RAX
            updates triggered by the rising edge of the clock

        stages were just an organizational tool that sometimes matched up to the
            dependency order
    - pipelined CPU --- stages were the actual order things happened
        because we saved results of prior stage in pipeline registers before
            starting the next stage

            "fetch" -- read instr memory + compute addr of next instruction
                     + compute next PC (??? with special cases for jmps/ret?)
            "decode" -- read register file
            "execute" -- use the ALU
            "memory" -- use the data memory
            "writeback" -- write to the register file

- push/pop effects on registers and ordering of operations
- ALU for call
    - push/pop/call/ret all need to compute a new stack pointer
        will use the ALU to do this in our textbook's design
    - push: compute a new stack pointer and write to the new location
    - pop: compute a new stack pointer and read from the OLD location
        - ALU's output will be written to RSP
        - data memory's input will come from the old RSP (not the ALU)
            (from the register file more directly)

- timing with pipeline
    cycle length + number of cycles

    cycle length = length of longest path from register to register
                                    or register to memory or register to register file
                pipelined CPU --- "slowest stage" (including pipeline register time)
                                    e.g. 200 ps

    fetch   decode execute   writeback
    A                                   cycle 0
    B       A                           cycle 1
    C       B       A                   cycle 2
                                        -------
            C       B        A          cycle 3
                    C        B          cycle 4
                             C          cycle 5

    once things are started, one instruction finishes per cycle
        (alt: one instruction starts per cycle)
    plus extra 3 cycles to for first instructoin to finish
        (or extra three cycles for last instruction to finish)

    throughput = one per cycle (time for many)  = 1 instruction/200 ps
    latency = 4 cycles for each instruction == 800 ps
    total time for three instructions (from start to finish) 3  + 3 -- 6 * 200  =  1200 ps


- logical v arithmetic shift

    in C compilers we care about:
        x >> 4 is a logical shift if x is unsigned, arithmetic shift if x is signed
        arithmetic shift copies the sign bit to the shifted in bits
            divide by power of 2, round toward negative infinity for signed
        logical shift always shifts in 0s
            divide by power of 2 for unsigned
            get weird non-mathematical results if thinking of it as signed + negative