- latency and throughput ~ calculating for SEQ v PIPEA - latency: time from start to finish for some operation - throughput: rate at which operations complete usually operations/time can also write it as time/operation completed (still throughput time ----> (S = start operatoin, E = end operation) S 1 2 3 E S 1 2 3 E S 1 2 3 E ... --> throughput of 1 operation/time unit = 1 time unit/operation completed --> latency of 5 time units/operatoin - single-cycle processor: we do one instruction at a time, so time per instruction = time between when instructions complete - time per circuit = critical path: longest sequence of operations that needs to complete in a clock cycle - single cycle: operations from getting the new PC (beginning of cycle) all the way to writing all the instructions' values - pipelined processor: operations for pipeline stage typically "between two pipeline registers" (except: ends) - set cycle time = critical path length - subtly with critical paths are that not everything is on the critical path: - single-cycle: operations which can be done in parallel for example: adding X to the PC while reading registers - pipelined: only the longest stage matters (and only longest path through a stage) - calculating time: sum of things on critical path - but remember that pipeline registers take time (if there are pipeline registers) "register delay" or "pipeline register delay" - RET and Load/Use Hazards - in the five-stage processor we used RET F D E M W ^^^^^^^\ can't \ fetch F yet result is 3 cycles of stalling Load/Use hazard MRMOVQ -> R9 F D E M W value not ready^^^^^^^\ \ ADDQ R9, R10 x F D E ^ \--- 1 cycle of waiting, even with forwarding anytime we read a value from memory and want to use it in execute probably forwarding from end of memory to end of decode no hazard examples: ADDQ R8, R9 F D E M W \ \ SUBQ R9, R10 F D E M W end of execute to end of decode ADDQ R8, R9 NOP SUBQ R9, R10 end of memory to end of decode ... - S2018 Q5 ~~ hazards in a five-stage pipeline mrmovq -> %r11 F D E M W addq %r11, %r8 x F D E M needs %r11 from mrmovq subq %r8, %r11 F D E need %r8 (addq) need %r11 (mrmovq) popq %r8 F D E need %rsp - S2018 Q13 ~~ addresses mapping to same set 4-way, 16KB cache, 16B blocks offset = location w/in block, 16 locations --> 4 bits (2^4 = 16) [set] index = number of the set, 256 sets --> 8 bits (2^8 = 256) 16KB cache / 16B blocks --> 1K blocks 4 blocks / set 1K blocks/(4 blocks/set) = (1K/4) sets = 256 sets tag = everything else (in this case, 16 bit addresses - 8 - 4 = 4 bits) same set as 0x1234 tag / index / offset 0x1 0x23 0x4 ^^^^ \---- determines which set to use same set: 0x1235 ^^ 0xF23F ^^ not same set: 0x1A34 ^^ 0x1004 ^^ - direct-mapped v set-associative direct-mapped ==== 1-way set associative N-way set associative means N places to put each block means can store N different with the same index bits fully-assocative ==== (# blocks in cache)-way set associative - write strategies (write-alloc/no-alloc, etc.) write-allocate: when the program writes, do I bring something into the cache because of it pro: hopefully the prgoram will read that value/near it con: writes now become reads (rest of block) + writes write-no-allocate: when the program writes, modify value if I have it, otherwise send it to memory w/o any extra work write-back: when the program writes, don't send it to memory until I need to typically but not always w: write-allocate write-through: when the program writes, always send it to memory immediately - high-level strategy for cache miss counting: figure out how big blocks are in terms of what's being accessed what values map to the same block? figure out what values map to the same set rule: find # bytes in way, values separated by K * # bytes in a way map to the same set can analyze each set independently - post-quiz cache miss quetions on last post quiz int a[1024]; ... int sum = 0; for(int j = 0; j < 32; j++) for(int i = 0; i < 32; i++) sum += a[i * 32 + j]; // i = 0, j = 0 --> a[0] // i = 16, j = 0 --> a[512] // i = 0, j = 1 --> a[1] // ALWAYS A MISS, because we accesses a[512] in between a[0] and a[1] // i = 16, j = 1 --> a[513] // ALWAYS A MISS because we access a[1] in between a[512] and a[513] // similar for j = 2, j = 3 16Byte blocks --> 4 ints/block a[0] and a[1] and a[2] and a[3] are in the same block, assuming blocks start at a[0] (which the quesiton says they do) a[512] and a[513] and a[514] and a[515] are in the same block DM 2048-byte cache a[0] and a[512] are 512 * 4 = 2048 bytes apart so a[0] and a[512] (and a[1024] and a[1024 + 512]) map to the same set suppose 2-way set associative a[0] and a[256] and a[512] and a[768] all map to the same set between accessing a[0] and a[1], we'd access a[256] and a[512], probably evicting a[0] (assuming LRU-like) other quiz question: int a[1024]; ... int sum = 0; for(int i = 0; i < 1024; i++) sum += a[i]; access a[0], a[1], a[2], ... nothing in between access to the same block to evict them same as if for(int i = 0; i < 32; i++) for(int j = 0; j < 32; j++) sum += a[i * 32 + j]; - S2018 Q7/Q8 ~~ cache miss counting - F2017 Q9 - cache blocking [slides] - function inlining - copy the contents of a function in a place of a function call - compiler will do it if - the function is "small enough", and - the compiler doesn't need to look across .o files - locality spatial/temporal spatial locality: nearby values ~ caches take advantage of this with cache blocks being larger than one value temporal locality: repeated access to same value ~ caches take advantage of this by adding things to the cache when they are accessed and trying to keep them if they've been recently accessed for (int i = 0; i < N; ++i) for (int j = 0; j < 100; ++j) use(A[i]) more temporarl locality than for (int i = 0; i < N; ++i) for (int j = 0; j < 10; ++j) use(A[i])