- F2021 final, Q3b ~~ assembly and memory changing %rax 0x1000 %rcx 0x12345678 v--------------------------- "quad" -- 8 bytes rmmovq %rcx, 0x10(%rax) ^^^^^^^^^^--- destination displacement(base) syntax means "access memory at displacement + base" 0x10+%rax = 0x10 + 0x1000 = 0x1010 is the mmeory address we write to value we write is 0x12345678 which is an 8-byte value because it's in %rcx so %rcx must store some leading 0s the amount we write 8 bytes so we'll modify addresses 0x1010 through 0x1017 inclusive little-endian --> least significant byte goes the lowest address 0x1010: least significant byte of 0x12345678 --> 0x78 (1's place) 0x1011: 0x56 <--- 0x1012: 0x34 0x1013: 0x12 0x1014-7: 0x00 <--- What are the values at 0x100f, 0x1011, 0x1017 after this? 0x100f: unchanged, 0x1011: 0x56, 0x1017: 0x00 - F2021 final, Q11b ~~ pipelining pipelined porcessor with 7 stages: F, D, E1, E2, M1, M2, W running 0 1 2 3 4 5 6 7 8 9 10 subq %r9, %r11 F D E1 E2 M1 M2 W andq %r12, %r13 F D E1 E2 M1 M2 W rmmovq %r12, 0(%r11) F D E1 E2 M1 M2 W rmmovq %r13, 0(%r15) F D E1 E2 M1 M2 W ret F D E1 E2 M1 M2 W how many cycles of stalling are needed? dependencies: [these might require us to wait, depending on how the pipeline works] %r11 written by subq and read by rmmovq (1st) the value of %r11 is available at end of cycle 3 = end of rmmovq (1st)'s decode so we'll forward it to decode and solve the problem %r13 written by andq and read by rmmovq (2nd) the value of %r13 is available at end of cycle 3 = end of rmmovq (2nd)'s decode so we'll forward it to decode and solve the problem 0 from these dependencies BUT: in the assembly, %r9 (used by subq) is computed by an earlier addq: addq %r8, %r9 jle label subq %r9, %r11 we'll have fetched a bunch of mispredicted instructions during the jle, so the addq will be out of the pipeline by the time the subq reads %r9 don't actually hvae a hazard there BUT: after ret we can't figure out what instruction to run next and isn't that part of running the ret on the exam: we accepted two interprations: 1. the stall for ret was part of the next instructoin (0 cycles of stlaling) 2. the stall for ret was part of the ret, and that stall took 5 cycles: 1 2 3 4 5 F D E1 E2 M1 M2 W ret F instruction returned to - loading vectors and the caches vector-load that's contiguous (the one we've seen in the lab, etc.) are just a big cache read e.g. reading 256 bits instead of 64-bits but work the same way as a 256-bit read if the value we are reading is split across two cache blocks, then the processor needs to somehow turn thi sinto two cache accesses BUT this also happens with non-vector reads (e.g. reading an 8-byte integer that starts 4 bytes before the end of a cache block) - F2021 final Q1b ~~ caching system with an L1 and L2 cache both had the same block size L1 cache: write-back, write-allocate 16KB, 4-way, 64B block --> 6 index bits, 6 offset bits L2 cache: write-back, write-allocate 512, 8-way, 64B block --> 10 index bits, 6 offset bits tag index offset ------------------------------|------| | | | | L1 |-------------------------------------- | | | | L2 |---------------|--------------|------| if two addresses have same index in L1 --> may or may in L2 if two addresses have same tag in L2 --> may or may in L1 if two addresses have same tag in L1 --> same in L2 when a value is written to the L1, which of these could happen? 1. another value with the same index bits (but different tag bits) will be written to the L2 YES --> writing a value to the L1 will allocate a block for that value alllocating a block for that value might require evicting a previously stored block that previously sstored block might be dirty (b/c write-back) and if so, we'd need to update the next level of cache 2. another value with the same index bits (but different tag bits) will be written to main memory YES --> when we write to the L2 cache, the same thing in (1) could happen in the l2 it will be in the same set in the L2 it's possible that something which shares index bits in the L1 also shares all the index bits in the L2 (if the first tag bits in the L1 are the same) 3. that value (the one written to the L1) is also written to the L2 NO --> write-back policy --> we're going to wait as long as possible 4. that value is also written to main memory NO ---> write-back policy 5. another value with the same tag bits but different index bits written to the L2 NO --> we'll only evict values with the same index bits because each set of the cache works indepednently 6. another value with the same tag bits but different index bits written to the main memory NO --> we know that the L2 cache might evict something, could that thing hvae the same tag bits but different index bits looking at L1 if the L1's tag are all the same, then this means that the L2 tag bits are all the same (because the L2 has less tag bits than the L1) so this is still the same scenario as (5) - what about the AVX stuff we expect you to know ~ we don't expect you to know any instruction or instrinsic names ~ we may ask about whether code is easy/hard to vectorize ~ we might give you the names/descriptions of instructions/instrinsics and expect you to use them or undertand how they're used - exceptions and pipelines ~ we didn't really talk about how exceptions interact with pipelines/OOO ~ when an exception occurs, the processor should choose an instruction to stop at and cancel/undo the operatoins of everything at/after that instruction - "precise exceptions" ~ THEN the processor will run the OS ---- ~ so, if the OS jumps back to the instructoin that we stopped at, that should run with the same register values, etc. that it originally had (unles the OS changed something) ~~~~~~~~~~~~~~~~~~~~round 2~~~~~~~~~~~~~~~~~~~ - Q14 Q1 (vectorization) which of these would benefit from a vector load instruction that rathre than taking one address and loading a bunch of things from there takes a vector of pointers and loads one thing from each of those pointers? [the "load vector of pointerS" instruction is going to be slower than the load a bunch from one pointer instruction] A: for (int i = 0; i < 8192; i += 1) { A[i] += B[i-1] + B[i] + B[i+1]; } to vectorize this, I'm going to try doing multiple i at a time: for (int i = 0; i < 8192; i += 4) { A[i+0] += B[i-1] + B[i+0] + B[i+1]; A[i+1] += B[i+0] + B[i+1] + B[i+2]; A[i+2] += B[i+1] + B[i+2] + B[i+3]; A[i+3] += B[i+2] + B[i+3] + B[i+4]; } and I can do each of the adds vertically with one vector-add instruction and the values I want to load in vectors to do this are: A[i+0] through A[i+3] B[i-1] through B[i+2] B[i+0] through B[i+3] B[i+1] through B[i+4] which are all contiguous ---> I can use the "load a bvunch of things from one address" (though maybe I wnat to figure out a way of loading from B fewer times becaues the loads overlap ---> but I could do that with "shuffling" instructions that move values around in vectors in registers) B: for (int i = 0; i < 8192; i += 1) { for (int j = 0; j < 8192; j += 1) { A[i*8192+j] = B[i*8192 + j] * C[j]; } } to vectorize this, let's try doing multiple j (this is my first idea because it has better locality than multiple i for A, which is what we're writing to) for (int i = 0; i < 8192; i += 1) { for (int j = 0; j < 8192; j += 4) { A[i*8192+j] = B[i*8192 + j] * C[j]; A[i*8192+j+1] = B[i*8192 + j+1] * C[j+1]; A[i*8192+j+2] = B[i*8192 + j+2] * C[j+2]; A[i*8192+j+3] = B[i*8192 + j+3] * C[j+3]; } } ---> looks like all the multiplies can be done with vector instructions well we want to load: A[i*8192+j] through A[i*8192+j+3] --> load a bunch GOOD B[i*8192+j] through B[i*8192+j+3] --> load a bunch GOOD C[j] through C[j+3] --> load a bunch GOOD C: for (int i = 0; i < 8192; i += 1) { B[i] = A[T[i]]; } for (int i = 0; i < 8192; i += 4) { B[i] = A[T[i]]; B[i+1] = A[T[i+1]]; B[i+2] = A[T[i+2]]; B[i+3] = A[T[i+3]]; } want to load: T[i] through T[i+3] --> load a bunch GOOD A[T[i]], A[T[i+1]], A[T[i+2]], A[T[i+3]] these could be A[X], A[X+54321], ... depending on T --> load a bunch is NOT GOOD --> probably we want to use the fancier load instruction D: for (int i = 0; i < 8192; i *= 3) { A[i] *= B[i]; } question was meant to be: for (int i = 1; i < 8192; i *= 3) { A[i] *= B[i]; } for (int i = 1; i < 8192; i *= 3 * 3 * 3 * 3) { A[i] *= B[i]; A[i*3] *= B[i*3]; A[i*3*3] *= B[i*3*3]; A[i*3*3*3] *= B[i*3*3*3]; } multiply we could do nicely with vector instructions, but loading/stoing is not easy: want to load A[i], A[i*3], A[i*3*3], ... --> load from one address instruction won't work ~~~~~~round 3~~~~ - bitwise operators: what do we need to know ~ bitwise AND (&), OR (|), XOR (^), complement (~) ~ left and right shift ~ arithemetic (copy sign bit) versus logic (add 0s) right shift - bit masking ~ when we want to do something with bitwise operators, usually our primary idea is "select a few bits" THEN "do something do them" to select a few bits, we usually make a _mask_ which has 1s in the selected bits and 0s otherwise (or sometimes 0s in selected and 1s otherwise) ~ we can view bitwise AND, OR, XOR as "applying" a mask ~ bitwise AND with a MASK with 1s for the selected bits "keep these bits" [and clear the non-selected bits] ~ bitwise XOR with a MASK with 1s for the selected bits "flip these bits" [and don't change the others] ~ bitwise OR with a MASK with 1s for the selected bits "set these bits (regardless of what they were before)" [and don't change the others]S --- ~ in addition to selecting bits, we often want to move bits to particular places, often we'll 1. select to keep those bits with a mask and bitwise AND; then 2. use bit-shifting to move those bitsto a particular part of result