- dynamic linking
    - static linking: executable file contains all the stuff we load into memory
        con: many copies of common library routines like printf
    - dynamic linking: instead, (potentially) common library routines
        are loaded when the application starts from another file
            - executable file doesn't contain copy of printf,
                just note about where to get a copy of printf
                (e.g. "read libc.so")
            - often extra indirection to limit changes to machine code at runtime
                calling printf through a function pointer
                - if we have 1000 calls to pritnf, we don't want to edit 1000 places
                    in the machine code at  load time
                    alternative: those calls use a table of function pointers,
                    at load time, set entry in the table

- object files and relocations/tables
    - bytes (machine code or data) to be copied into executable
    - information to fixup "holes" in those bytes
        - relocations: holes in the current file that need information from elsewhere
            e.g. "call to printf, which I don't know where it is"
          `
        - symbol table: names of things in this file
            that might be referenced by relocations
            e.g. "byte 1000 of this file is where the function "main" starts"

- bitwise operations
    - shifts
        << -- left shift
            arithmetic: multiply by power of 2
            binary: add zeros to the end
                if bits go off the front (most significant side), ignore them

        >> -- right shift
            two types:
                arithmetic right shift:   typically >> with signed numbers (incl. x86/us)
                    signed divide by power of 2 rounding towards -infinity
                        -5 >> 1 ---> -3
                        5 >> 1 --> 2
                    add copies of the sign bit at the beginning (most significant side)
                    and let bits go off the end (least significant side) + ignore them

                logical right shift:    >> with unsigned numbers
                    unsigned divide by poewr of 2 rounding towards -infinity OR 0
                    add 0s to beginning (most sig)
                    and let bits go off the end (least sig)


    - bitwise logical operations
        bit-by-bit (pairs) do a single logic gate
            AND --- 1 only if both inputs 1
                "keep only certain bits" --- AND with mask where kept bits are 1
                                                        
                "clear certain bits"  --- AND with mask with cleared bits are 0
                
                BINARY 00010111 and want to extract bits 5-3 (inclusive)
                              ^
                         543210 

                BINARY 00010111 AND mask=00111000  (hex 0x38
                       00111000 <---- number with bits 5-3 (inclusive) set to 1
                       --------
                       00010000  

                MOVE bits 5-3 into bits 0-2

                00010000 >> 3 -->  000000 010

                (x & 0x38) >> 3

            OR --- 1 if any input 1
                "set certain bits"
            XOR --- 1 only if exactly one intpu was 1
                "flip certain bits"

    - pattern for bitwise problems:
        split into subproblems
        isolate parts of number with AND + shifting

- ISA versus microarchitecture
    - instruction set architecture (specification, like in our textbook)
        all you need to know about to write a compiler+assembler
        to produce correct, working programs
            what instructoins are there
            * what do the instruction do (changes to registers, memory, etc.)
            * what does their machine code look like
            what types of operands do the instructions take
            how are those operands represented in machine code
        
        notably NOT included:
            how fast is anything?
    
        many possible microarchitectures to implement a particular ISA

    - microarchitecture (specified by something like HCL code)
        particular hardware implementation
        decide how fast instructions are
        how is your register file designed

        example: we've seen single-cycle + pipelined
            single-cycle:
                one cycle per instruction
            pipelined:
                multiple cycles per instruction
                most of the time one instruction completes per cycle, but no always


- HCL register question on latest quiz

    register xY {
            foo : 64 = 1;
            bar : 64 = 1;
    }
    x_foo = Y_foo + Y_bar;
    x_bar = x_foo + 1;
                                                x_bar = x_foo + 1;
                                                x_foo = Y_foo + Y_bar;

    cycle 0: Y_foo = Y_bar = 1
             x_foo = 1 + 1;
             x_bar = x_foo + 1 --> (1 + 1) + 1 --> 3
    --- rising edge of the clock ---
                    copies register inputs x_foo, x_bar to outputs Y_foo, Y_bar
    cycle 1: Y_foo = (old x_foo) 1 + 1 --> 2
             Y_bar = (old x_bar) 3
             x_foo = 2 + 3 --> 5
             x_bar = x_foo + 1 --> 5 + 1 --> 6
    --- rising edge of the clock ---
    cycle 2: Y_fo = (old x_foo) 5


- rA/rB bit indexing
    in HCLRS, since we read from memory in little endian,
        least significant byte is address we read from + 0      <-- opcode/fn/cc byte
        2nd least significnat byte is address we read from + 1  <-- rA/rB byte
        3nd least significnat byte is address we read from + 2
        ...
    in HCLRS, we number bits starting at the least significnat bit
        bits 0-7 (inclusive) of a number are the least significant byte
        bits 8-15 (inclusive) are the 2nd least significant byte
        ...

        0x[rA][rB] --> rA is the most significant 4 bits   bits 4 to 7 inclusive
                       rB is the least significnat 4 bits  bits 0 to 3 inclusive

    i10bytes --- number from memory
    
        byte_containing_rA_and_rB  = i10bytes[8..16];
        rA = byte_containing_rA_and_rB[4..8]

        --- same as ---
        rA = i10bytes[12..16];

        machine code table maybe written in a weird order for bit numbering:
                11111
        7654321054321
         op  fn  rA  rB

        could have written differently:  (but trickier for hand-en/decoding?)
                                    11
                                    109876543210
                                 rA  rB  op  fn 

- instruction stages
    - "fake"/conceptual stages with our single-cycle processor
        how our textbook divided up the components of the processor
    - real stages with our pipelined processor
        stages are separated by pipeline registers
        and instructions are sent from one stage to the other using these
        (so an instruction doesn't advance to the next until a clock signal rises)

    - Fetch
        read instruction memory / split up instructoin into pieces
        compute length of instruction + address of following instructoin
        [for pipelined:] also update the PC so next fetch can happen right away
            (but there are cases where we won't be able to do this
             we'll talk about this after the exam)
    - Decode
        read register file
    - Execute
        use the ALU to compute anything relevasnt
            add/sub/xor --- the new value
            memory instructions --- memory address we need
    - Memory
        use the data memory
    - Writeback
        setup the register file write
    - [for single-cycle only] PC Update
        change the PC

- timing: single-cycle / pipelined CPUs
    - critical path:
        longest path from state to state
            (pipeline register to pipeline register
             register to memory write/register file write)
        determines the cycle time
        
        all paths happen at once --- need enough time for ALL to finish
    - single-cycle:
        longest path probably goes from the PC to some memory/register file
    - pipelined:
        longest path probably goes from pipeline register to something or 
            something to pipeline register
        longest path == worst stage

    each instruction needs to go through every stage

    F    D    E     W   
                        -------
    A                   cycle 0
    B    A              cycle 1
    C    B    A         cycle 2
         C    B     A   cycle 3
                        -------
              C     B   cycle 4
                    C   cycle 5

    amount of time per cycle -- critical path (slowest stage)

- pipelining -- latency and throughput

    latency = time from start to finish
        e.g. with 200 ps cycle, A has a latency of 4 * 200 = 800 ps
            (200 ps cycle --- determined by slowest stage)
    throughput = time between finishing instructions
               = time between starting instructions
        e.g. above 1 instruction/cycle + 200 ps/cycle ---> 1 instruction/200 ps
                
- Q3 on last quiz: doubling stages --> effect on latency throughput 
    probably we finish about 1 instruction/cycle before
             we finish about 1 instructoin/cycle after

    if the slowest stage takes takes HALF the time, then we double the throughput
        but to do that, we need to divide up the existing slowest stage exactly
            in half AND not have to worry about an extra pipeline register

            --- probably can't do this
                can't divide stage *exactly* in half, maybe
                definitely need to add pipeline register

            stage that took X ps now takes X/2 + pipeline register related extra

            original stage: [ pipeline reg][ computation                   ]
            new stages:     [ pipline reg ][comput part 1  ][pipeline reg][ compute pt 2 ]

- single cycle register updates
    during the cycle, we prepare all of the values we change on the inputs
    to 
        the PC register
        register file
        data memory