- functional unit issue rate/etc. Fall 2016 5, 6 - on Exam 3, not 2 - writing HCL code? - no writing, but may need to read or understand how to write (no memorizing signals, etc.) - spatial and temporal locality in loops mostly: only innermost loop matters exception: not too many iterations of innermost loop for (int i = 0; i < N; ++i) for (int j = 0; j < M; ++j) A[i * N + j] <-- j changes most frequently <-- if M is big --- most of what matters B[i * 10] C[j * 1000] Answer question: What sequence of values is accessed: if i = 0: (chosen arbitrarily) A[0] (j=0), A[1] (j=1), A[2], A[3], ... B[0], B[0], B[0], B[0], B[0], B[0], ..., B[10] C[0], C[1000], C[2000], ... temporal locality? Are we accessing the same value twice close in time? B[0] and B[0] and B[0] ... if M is small: yes for C, C[0], then M iters later, C[0] again et spatial locality? Are we accessing closeby values close in time? A[0] and A[1] are close A[1] and A[2] are close ... B[0] and B[10] kinda close? not very frequent if M is large (outer loop)? - LRU and identifying misses, etc. - cache parameters and access pattern: (1) find the # of index, offset bits (2) find the index of everything being accessed in the cache (you can ignore the block offset for finding misses) if you have array[x] --> assume the array starts at some address (e.g. 0), and work out the index hopefully if you have array[x], we've told you enough to know that values don't overlap two blocks (3) look at each set: [accesses with the same index] - go in order to accesses, and (a) check if it is already in the set [initially, it's not, probably] no --> miss, yes --> hit (b) if it is miss, replace something in the set - if not all the blocks in the set are used, put it there - otherwise, follow replacement policy LRU: least recently accessed block previously (e.g. look at previous accesses to blocks in the set) - fewest misses Fall 2017 Exam 2 -- 5 - random replacement wins: 1 miss for every 64B block accessed each iteratoin - minus some times where we get lucky versus: 1 miss and no times we get lucky - counting misses with array access Fall 2017 Exam 2 --- 9 - cache parameter effects - a "catch" --- what else changes? # sets, block size, associativity + increase assoc only --> cache size increases total size, block size, associativity + increase assoc only --> # sets decreases - with "normal" programs > associativity+ --- less conflict misses > cache size+ --- less capacity > block size+ --- less compulsory misses, more conflict misses* A[0], A[1], A[2] block size 1 ---> Miss, Miss, Miss block size 2 --> Miss, Hit (because of A[0]), Miss, Hit (because of A[2])... [* depending on what else changes; rule of thumb --- less sets -> more conflicts] > number of sets- --- more conflict misses if #blocks/set is the same > write-allocate+ --- less misses - types of misses - compulsory --- still a miss with infinite cache size - capacity --- still a miss with associativty = # blocks (fully assoc) - conflict --- not compulsory or capacity evicting but still space elsewhere` - write policies: write-back/through, (no-)write-allocate - writing to a cache? [write-allocate?] if it's not in the cache yet, should I put it there? [write-back/write-through?] if it's in the cache (now), do I update just that or also memory? write-back --> need to track what's out of date ("dirty") 1 extra bit per block otherwise: need to write memory on every eviction, which would be very expensive - direct v set-assoc v fully-assoc associativity = # blocks / set direct-mapped <---> associativity = 1 fully-associative <---> associativity = # blocks - bubbling versus stalling - two things we call stalling: - stalling AN INSTRUCTION - keeping the instruction from an advancing in our pipeline - stalling A REGISTER - keeping the registe from being updated - stalling instruction D F D E M W --------- E D C B A --normal advance: F E D C B -- stalling D: E D * C B to achieve this: stall registers containing values for: E (PC) D (fetch -> decode) make * not be another copy of D --> "bubble" register for decode -> execute "bubbling" --> insert a NOP bubbling a register --> make it output a nop next cycle - stalling and bubbling for data hazards - most of the time, we can forward without stalling - otherwise: (1) When is the value we need computed? (2) When is the value we need used? - example: load/use hazard in a our five-stage pipeline with mrmovq (%rax), %rax addq %rax, %rax (1) near the end of the memory stage of mrmmovq (2) near the beginning of the execute stage of addq if we can't forward: (1) at the end of writeback of mrmovq (2) near the beginning of decode of addq IMPOSSIBLE: 1 2 mrmovq F D E M W addq F D E CORRECT BY STALLING AND FORWARDING: 1 2 mrmovq F D E M W addq F D D E - data hazards and load/store/etc. - load creates hazards which need stalling in five-stage procesor because memory is after execute - other pieplines would have different hazards e.g. two decode stages or different execute stages - control hazards - didn't know correct PC in fetch - examples: RET (need to read it from memory), Jxx (need conditoin codes) - solution 1: don't fetch new instrutions until we know the PC - "stall" the next instruction - solution 2: make a guess, and correct it later (branch prediction) - later on, replace instructions with nops - also, keep any permanent changes from being made (e.g. don't let them update condition codes, or restore old values of changed regsiters/memory/etc.) - aliasing: what it is/why it matters? - two names for one value example: int A[100]; int *x = &A[10]; int *y = &A[20]; x[10] is the same thing as y[0] - matters because it makes some optimizations we want to do INCORRECT. e.g. changing loop order on: for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[i * N + j] = B[j * N + i]; could be wrong if A and B setup like x, y above e.g. removing redundant loads through a pointer