- precise exceptions, OOO (out-of-ordeR) - OOO: processors execute instructions out-of-order - precise exceptions: covered later if at all - cache misses questions on most recent post quiz - tag/index/offset/etc. questions - one way: memorize the formulas - picture of the cache: rows as sets columns as ways width of columns is block size - bits: log_2(number of things indexed) - you can do Algebra - example: compuitng tag without block size - tag bits = address size - (set index bits + block offset bits) - tag bits = address size - log_2(cache size / ways) = address size - log_2(# sets * block size) = address size - log_2(cache size / ways) - bubbling and squashing - how do you set the bubble/stall signals - figure out what instructions should be running next - bubble signal if next instruction after regs should be nop - stall signal if next instruction after regs should be same as current - branch predictions: - compute the result of the branch - in *that cycle*, we set bubble signals so in the next cycle: "squashing" - there are nops replacing every instruction that came after a mispredicted branch (PIPE: instruction going to decode, instruction going to execute) - in the next cycle, we fetch the correct instruction - finding #s of stalls - forwarding requirements: - value must be computed before its needed - don't make any clock cycles much longer (no feeding memory results to the ALU) - example: ret: compute the return address near the end of memory - stalling: adjust instruction so forwarding, etc. works - example: ret stall fetch until return address computed --> stall fetch until ret's writeback -{ popq %rax addq %rax, %rax }- popq %rax F D E M* W [addq %rax F D E* M W <--- not possible] addq %rax F D D E M W ^^^ <- nop in execute - no stalling *with forwarding* addd %rax, %rax F D E*M W addq %rax, %rax F D E M W - stalling *without forwarding* addq %rax F D E M W* addq %rax, %rax F D D D D E M W 1 2 3 popq %rax F D E M W* addq %rax, %rax F D D D D E M W - Spring 2017 Q4 --- returns and pipeline hazards - return reads %rsp - wait until memory FINISHES to fetch next instruction - good temporal locality/spatial locality? - temporal: is it access twice? if so, how frequently are duplicate accesses? - spatial: what is the distance between consecutive accesses? - locality quests on most recent post quiz - cache blocking (breaking up a loop) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) A[i * N + j] = B[j * N + i]; bad locality in B and good in A (swap loop orders: good locality in B and bad in A) -- common problem with loop orders for (int ii = 0; ii < N; i += 2) for (int j = 0; j < N; ++j) for (int i = ii; i < ii+2; i += 1) A[i * N + j] = B[j * N + i]; better locality in B (distance between consecutive B acceses is often 1) worse locality in A (distance between consectuve A accesses is sometimes N) --> overall better locality --> expansions: better choice than "2" --> repeat for j --> worse locality doesn't matter --- q: what fit sin cache? - loop unrolling for (int i = 0; i < N; ++i) foo(i); for (int i = 0; i < N; i += 2) { foo(i); foo(i+1); } executes less instructions per index (except you need to handle the "leftover") but makes the code bigger (maybe more cache misses?) - inlining foo(i) { something complicated } for (int i = 0; i < N; ++i) foo(i); --> for (int i = 0; i < N; ++i) something complicated makes code bigger, but less instructions for function call/return - aliasing two things are different names for the same thing: foo(int *px, int *py) { for (int i = 0; i < 10; ++i) { *px += *py; } } foo(&x, &y) foo(&x, &x) <-- possible call