- logistics: memorizing opcodes, etc. - we do not expect you to remember opcode numbers, function code numbers, etc. (we supply a table if you need them) we expect you to know what types of operands Y86 takes if register or displacement + register to make address etc but not whether it's rA or rB - K/M/G/T we might expect you to know - contents of object files, etc. - C/C++/etc. source code ---> assembly ---> object file ----> executable complation assembling linking - after compilation into assembly: .string "Hello, World" .byte 'H', 'e', ... have names for instructions have labels to indicate functions to call, instruction to jump to, global variables to load, etc. - after assembling into object file: have bytes (machine code + constants + intiial values of globals) have relocation table entries to represent "holes" to fill in in bytes e.g. "there is a call to printf to fill in with the locatoin of printf" have symbol table entries to represent labels that might be used elsewhere e.g. "byte 1000 of the file is where the main function starts" - after linking to executable: [assuming static linking] (typically from multiple object files) have bytes to load into memory directly - C pointer arithmetic and arrays int *p; ^^^--------- how compiler choses what += does p[X] <-----> *(p + X) p = &a[0]; p += 2; *p ---> *(&a[0] + 2) --> a[2] p[1] --> *(p + 1) --> *((&a[0] + 2) + 1) --> *(&a[0] + 3) --> a[3] - bitwise operators --- bit puzzle strategy AND: keep only certain bits clear certain bits OR: set certain bits XOR flip certain bits "certain bits" --> specify with a "mask" = number with those bits set to 0/1 - mask+shift idea [bit 0 = least significant bit] extract bits 5-9 (inclusive) from x, make them bits 0-5 (iclusive) (x & [value with bits 5-9 set to 1]) >> 5 11 1110 0000 (x & 0x3E0) >> 5 0b11_1110_0000 or 0b1111100000 --- not in C - divide and conquer for each piece of the number we care about, figure out how transform shift to the locatoin we want it turn it into a bias e.g. 0 or 1 / 0 or -1 - LEA versus MOV - lea __memory locatoin__, %rax like mov __memory locatoin__, %rax, but skip memory access of move (shortcut address value into register file write input) - lea (%rax), %rbx <---> mov %rax, %rbx mov (%rax), %rbx would read from memory at address R[# of RAX] - lea 1(%rax), %rbx <---> rbx becomes (rax + 1) - lea (%rax, %rbx), %rcx <--> rcx becomes (rax + rbx) b/c mov (%rax, %rbx), %rcx would access memory at address rax + rbx - lea (%rax, %rbx, 2), %rcx <---> rcx becomes (rax + rbx * 2) - lea %rax, %rbx -- SYNTAX ERROR (I hope) - RISC v CISC - instruction set design choices can make it easier/harder to make hardware easier/harder to write assembly - RISC generally: tries to make hardware simpler to design assuming that making assembly harder to write/require more instructions is okay "let the compiler do it" - RISC: simpler hardware - simpler instructions - generally no instructions that write two values - fixed-length instructions (simpler to decode in hardware) - less instructions - separate instructions from accessing memory - more registers to hold values to make up for this - one way of specifying opreands - CISC: more complicated instructions, often designed for making particular things convenient in assembly - fuzzy line ---- very few ISAs are definitely RISC or definitely CISC - endianness of Y86 - it's little endian whenever it comes up - decoding Y86 - you need the table of opcodes, etc. - table has bytes of instructions labelled look in first byte for matching opcode table tells you which byte # of each instruction contains important values match the blanks to values in bytes of instruction 0x123 = 0x0000 0123 = 0x0000 0000 0000 0123 ^^ ^^ ^^ little endian of that, least significant byte at lowest address address 0: 0x23 address 1: 0x01 address 2: 0x00 ... - Q5 on Fall 2018 --- assembly code and value in regs 0x000: 30 f4 30 00 00 00 00 00 00 00 | irmovq $0x30, %rsp <--- RSP becomes 0x30 0x00a: 30[f0 25 00 00 00 00 00 00]00 | irmovq $0x25, %rax <--- RAX becomes 0x25 ^^ least sig part (lowest address) 0x00000025f0 0x014: 61 04 | subq %rax, %rsp <--- RSP = RSP - RAX = 0x30 - 0x25 = 0xB (11) 0x016: b0 0f | popq %rax <----RAX becomes M[0xB] RSP becomes RSP + 8 = 0x13 (19) 0x018: 00 | halt - Q5 on Fall 2016 - HCL regiser Q on quiz register xY { foo : 64 = 1; bar : 64 = 1; } x_foo = Y_foo + Y_bar; x_bar = x_foo + 1; x_bar = x_foo + 1; x_foo = Y_foo + Y_bar; cycle 0: Y_foo = 1, Y_bar = 1 x_foo = Y_foo + Y_bar --> 1 + 1 --> 2 x_bar = x_foo + 1 --> (1 + 1) + 1 --> 3 --- rising edge of the clock ---- copy x_foo to Y_foo, etc. cycle 1: Y_foo = 2, Y_bar = 3 x_foo = Y_foo + Y_bar --> 2 + 3 --> 5 x_bar = x_foo + 1 --> 5 + 1 --> 6 --- rising edge of the clock ---- copy x_foo to Y_foo, etc. cycle 2: Y_foo = 5 - stages and order things happen - "fake"/conceptual stages for the single-cycle CPU things happened in an order because of dependencies --- can't read RBX for add RBX, RBX until we read the instruction to find out that it's RBX and not RAX updates triggered by the rising edge of the clock stages were just an organizational tool that sometimes matched up to the dependency order - pipelined CPU --- stages were the actual order things happened because we saved results of prior stage in pipeline registers before starting the next stage "fetch" -- read instr memory + compute addr of next instruction + compute next PC (??? with special cases for jmps/ret?) "decode" -- read register file "execute" -- use the ALU "memory" -- use the data memory "writeback" -- write to the register file - push/pop effects on registers and ordering of operations - ALU for call - push/pop/call/ret all need to compute a new stack pointer will use the ALU to do this in our textbook's design - push: compute a new stack pointer and write to the new location - pop: compute a new stack pointer and read from the OLD location - ALU's output will be written to RSP - data memory's input will come from the old RSP (not the ALU) (from the register file more directly) - timing with pipeline cycle length + number of cycles cycle length = length of longest path from register to register or register to memory or register to register file pipelined CPU --- "slowest stage" (including pipeline register time) e.g. 200 ps fetch decode execute writeback A cycle 0 B A cycle 1 C B A cycle 2 ------- C B A cycle 3 C B cycle 4 C cycle 5 once things are started, one instruction finishes per cycle (alt: one instruction starts per cycle) plus extra 3 cycles to for first instructoin to finish (or extra three cycles for last instruction to finish) throughput = one per cycle (time for many) = 1 instruction/200 ps latency = 4 cycles for each instruction == 800 ps total time for three instructions (from start to finish) 3 + 3 -- 6 * 200 = 1200 ps - logical v arithmetic shift in C compilers we care about: x >> 4 is a logical shift if x is unsigned, arithmetic shift if x is signed arithmetic shift copies the sign bit to the shifted in bits divide by power of 2, round toward negative infinity for signed logical shift always shifts in 0s divide by power of 2 for unsigned get weird non-mathematical results if thinking of it as signed + negative