- dynamic linking - static linking: executable file contains all the stuff we load into memory con: many copies of common library routines like printf - dynamic linking: instead, (potentially) common library routines are loaded when the application starts from another file - executable file doesn't contain copy of printf, just note about where to get a copy of printf (e.g. "read libc.so") - often extra indirection to limit changes to machine code at runtime calling printf through a function pointer - if we have 1000 calls to pritnf, we don't want to edit 1000 places in the machine code at load time alternative: those calls use a table of function pointers, at load time, set entry in the table - object files and relocations/tables - bytes (machine code or data) to be copied into executable - information to fixup "holes" in those bytes - relocations: holes in the current file that need information from elsewhere e.g. "call to printf, which I don't know where it is" ` - symbol table: names of things in this file that might be referenced by relocations e.g. "byte 1000 of this file is where the function "main" starts" - bitwise operations - shifts << -- left shift arithmetic: multiply by power of 2 binary: add zeros to the end if bits go off the front (most significant side), ignore them >> -- right shift two types: arithmetic right shift: typically >> with signed numbers (incl. x86/us) signed divide by power of 2 rounding towards -infinity -5 >> 1 ---> -3 5 >> 1 --> 2 add copies of the sign bit at the beginning (most significant side) and let bits go off the end (least significant side) + ignore them logical right shift: >> with unsigned numbers unsigned divide by poewr of 2 rounding towards -infinity OR 0 add 0s to beginning (most sig) and let bits go off the end (least sig) - bitwise logical operations bit-by-bit (pairs) do a single logic gate AND --- 1 only if both inputs 1 "keep only certain bits" --- AND with mask where kept bits are 1 "clear certain bits" --- AND with mask with cleared bits are 0 BINARY 00010111 and want to extract bits 5-3 (inclusive) ^ 543210 BINARY 00010111 AND mask=00111000 (hex 0x38 00111000 <---- number with bits 5-3 (inclusive) set to 1 -------- 00010000 MOVE bits 5-3 into bits 0-2 00010000 >> 3 --> 000000 010 (x & 0x38) >> 3 OR --- 1 if any input 1 "set certain bits" XOR --- 1 only if exactly one intpu was 1 "flip certain bits" - pattern for bitwise problems: split into subproblems isolate parts of number with AND + shifting - ISA versus microarchitecture - instruction set architecture (specification, like in our textbook) all you need to know about to write a compiler+assembler to produce correct, working programs what instructoins are there * what do the instruction do (changes to registers, memory, etc.) * what does their machine code look like what types of operands do the instructions take how are those operands represented in machine code notably NOT included: how fast is anything? many possible microarchitectures to implement a particular ISA - microarchitecture (specified by something like HCL code) particular hardware implementation decide how fast instructions are how is your register file designed example: we've seen single-cycle + pipelined single-cycle: one cycle per instruction pipelined: multiple cycles per instruction most of the time one instruction completes per cycle, but no always - HCL register question on latest quiz register xY { foo : 64 = 1; bar : 64 = 1; } x_foo = Y_foo + Y_bar; x_bar = x_foo + 1; x_bar = x_foo + 1; x_foo = Y_foo + Y_bar; cycle 0: Y_foo = Y_bar = 1 x_foo = 1 + 1; x_bar = x_foo + 1 --> (1 + 1) + 1 --> 3 --- rising edge of the clock --- copies register inputs x_foo, x_bar to outputs Y_foo, Y_bar cycle 1: Y_foo = (old x_foo) 1 + 1 --> 2 Y_bar = (old x_bar) 3 x_foo = 2 + 3 --> 5 x_bar = x_foo + 1 --> 5 + 1 --> 6 --- rising edge of the clock --- cycle 2: Y_fo = (old x_foo) 5 - rA/rB bit indexing in HCLRS, since we read from memory in little endian, least significant byte is address we read from + 0 <-- opcode/fn/cc byte 2nd least significnat byte is address we read from + 1 <-- rA/rB byte 3nd least significnat byte is address we read from + 2 ... in HCLRS, we number bits starting at the least significnat bit bits 0-7 (inclusive) of a number are the least significant byte bits 8-15 (inclusive) are the 2nd least significant byte ... 0x[rA][rB] --> rA is the most significant 4 bits bits 4 to 7 inclusive rB is the least significnat 4 bits bits 0 to 3 inclusive i10bytes --- number from memory byte_containing_rA_and_rB = i10bytes[8..16]; rA = byte_containing_rA_and_rB[4..8] --- same as --- rA = i10bytes[12..16]; machine code table maybe written in a weird order for bit numbering: 11111 7654321054321 op fn rA rB could have written differently: (but trickier for hand-en/decoding?) 11 109876543210 rA rB op fn - instruction stages - "fake"/conceptual stages with our single-cycle processor how our textbook divided up the components of the processor - real stages with our pipelined processor stages are separated by pipeline registers and instructions are sent from one stage to the other using these (so an instruction doesn't advance to the next until a clock signal rises) - Fetch read instruction memory / split up instructoin into pieces compute length of instruction + address of following instructoin [for pipelined:] also update the PC so next fetch can happen right away (but there are cases where we won't be able to do this we'll talk about this after the exam) - Decode read register file - Execute use the ALU to compute anything relevasnt add/sub/xor --- the new value memory instructions --- memory address we need - Memory use the data memory - Writeback setup the register file write - [for single-cycle only] PC Update change the PC - timing: single-cycle / pipelined CPUs - critical path: longest path from state to state (pipeline register to pipeline register register to memory write/register file write) determines the cycle time all paths happen at once --- need enough time for ALL to finish - single-cycle: longest path probably goes from the PC to some memory/register file - pipelined: longest path probably goes from pipeline register to something or something to pipeline register longest path == worst stage each instruction needs to go through every stage F D E W ------- A cycle 0 B A cycle 1 C B A cycle 2 C B A cycle 3 ------- C B cycle 4 C cycle 5 amount of time per cycle -- critical path (slowest stage) - pipelining -- latency and throughput latency = time from start to finish e.g. with 200 ps cycle, A has a latency of 4 * 200 = 800 ps (200 ps cycle --- determined by slowest stage) throughput = time between finishing instructions = time between starting instructions e.g. above 1 instruction/cycle + 200 ps/cycle ---> 1 instruction/200 ps - Q3 on last quiz: doubling stages --> effect on latency throughput probably we finish about 1 instruction/cycle before we finish about 1 instructoin/cycle after if the slowest stage takes takes HALF the time, then we double the throughput but to do that, we need to divide up the existing slowest stage exactly in half AND not have to worry about an extra pipeline register --- probably can't do this can't divide stage *exactly* in half, maybe definitely need to add pipeline register stage that took X ps now takes X/2 + pipeline register related extra original stage: [ pipeline reg][ computation ] new stages: [ pipline reg ][comput part 1 ][pipeline reg][ compute pt 2 ] - single cycle register updates during the cycle, we prepare all of the values we change on the inputs to the PC register register file data memory