- exam format - a combination of quiz-like questions and questions with longer answers (longer answers ~~ sentence or two) - exam time anticipating 12p 10 Dec -- 12p 11 Dec - topics covered - anything we covered in lecture, lab, or homeworks - pipeline processor stages - from our textbook: fetch: read from the instruction cache predict the address of the next instruction decode: read from the register file (compute destination register numbers) execute: do arithmetic update and/or read condition codes memory: access the data memory (if necessary) writeback: write to the register - textbook's choice --- some real processor use these stages, but not the only design - pipeline timing - Quiz 6 Q5 "Consider a three-stage pipelined processor has a cycle time of 500 picoseconds. Suppose this processor runs 5 instructions in the following order: A (completes first), B, C, D, and E (completes last). What is the time between when the instruction A starts its first stage and when instruction E starts its third stage." 1 2 3 4 5 6 v-----------------v A S1 S2 S3 B S1 S2 S3 C S1 S2 S3 D S1 S2 S3 E S1 S2 S3 6 cycles --> 6 cycles * 500 picoseconds/cycle = 3000 picoseconds - Quiz 7 Q1 "Suppose we have a six-stage pipelined processor with 500 ps cycle time. Running a particular benchmark program which executes 1 billion (10 to the 9th power) instructions in a simulator, we determine that 5% of its instructions require exactly one cycle of stalling and 10% require exactly two cycles of stalling and no instructions would require more than two cycles of stalling." "Assume we attribute each stall to exactly one instruction." no stalling: B finishes one cycle after A A F D E M W B F D E M W 1 cycle of stalling: B finishes two cycles after A A F D E M W B F D D E M W 1 billion instructions if no stalling ---> would take 1 billion cycles (finish one instruction/cycle) + 5 extra cycles for the first instruction to finish first instruction needs to go from stage1 to stage6 before it finishes each time there's a single cycle of stalling --> 1 extra cycle 1 * 5% * 1 billion extra cycles each time there's exactly two cycle of stlaling --> 2 extra cycle 2 * 10% * 1 billion extra cycles 1.25 billion cycles total * 500 ps/cycle = 625 milliseconds - how many cycles to resolve data hazards with stalling Quiz 7 Q2 v--- value of %rbx is written addq %rax, %rbx F D E M W | anything that reads %rbx F D E M W ^^^--- first time we can read %rbx w/o forwarding 0 1 2 3 4 addq %rax, %rbx F D E M W 5 6 7 8 subq %rbx, %rdx F D D D D E M W 9 xorq %rbx, %rcx F F F F D E M W 0 1 2 rrmovq %rdx, %rcx F D D D E M W 3 4 5 6 addq %rbx, %rcx F F F D D D D E M W --> 16 - Quiz 9 access pattern question at end - see 27 October 2020 slides near beginning method: note that {array[0], array[1]} are the blocks (know that array[0] is start of block from question and that blocks are two array elements from given block size) choose an assignment of blocks to set indexes if {array[0], array[1]} are in set K, then {array[2], array[3]} must use set K+1 mod (# sets) (in this problem # sets = 2) can't decide to use a different set based on which are sets are empty, b/c the address determines which set we use go through accesses in order, track what's in each set in the case of a DM cache, we only can store one block per set, so we have to replace that block whenever something new is accessed for a set NOTE: we can consider each set separately (and add up the results at the end), which might be useful if we had a more complicated problem b/c e.g. nothing that uses set 0 can cause anything change in set 1 - exception tables - when a processor detects an exception should occur (e.g. external event sends a signal or program "messes up") it needs to jump to the OS - exception table is a mechanism to tell the processor where to jump to - has separate place for processor to jump to for each kind of exception - idea: OS provides different code for processor to jump to for different kinds of events - exception table is an array of pointers to machine code ("places to jump") - processor chooses in advance that entry K of the array corresponds to a particular kind of exception - processor needs to have a pointer to the exception table - common mechanism: a register that points to the exception table's first element - virtual pages - virtual addresses: is the memory the program sees - virtual addresses need to be mapped to physical addresses so they can act like memory - choice of page tables: * we do that by dividing the virtual addresses into fixed-sized chunks called "virtual pages" * the first (index 0) chunk starts at virtual address 0 * if pages have size K, that means the second (index 1) chunk starts at virtual address K * " " " " ", that means the third (index 2) chunk starts at virtual address 2K --> chunks start at virtual address (page number * K) and we choose K to be a power of two ---> K = 1000...000 in base two --> so beginning of virtual pages always start with lower log_2(K) bits of 0 --> call these bits the "page offset" something in the middle of a virtual page has non-zero lower bits somewhere --> if we strip off the page offset bits, we get the virtual page number - Quiz 13 (VM) Q1 address of PTE 20-bit virtual addresses <------ 20 ----------> | VPN | PO | <--- 10 ---> <-- 10--> 1024 byte pages ---> log_2(1024) = 10 page offset bits locate the PTE of VA 0x1000 vvvvvvv----------------- VPN = 4 0001 0000 0000 0000 binary ^^^^^^^^^^^^--- page offset VPN 4 ---> index 4 in the page table a page table base pointer set to physical (byte) address 0x1000 4 byte page table entries -------------\ a single-level page table structure | /----/ PTE for VPN 0 @ 0x1000 v PTE for VPN 1 @ 0x1004 = 0x1000 + 4 * 1 " " " 2 @ 0x1008 = 0x1000 + 4 * 2 ... VPN 4--> 0x1000 + 4 * 4 = 0x1010 - Quiz 13 (VM) Q4 (PA result of lookup) 42-bit virtual addresses, with a 30-bit virtual page number and a 12-bit page offset 30-bit physical addresses, with an 18-bit physical page number and a 12-bit page offset three-level page tables, where 10 bits of the virtual page number are used for a lookup at each level Suppose that when looking up page table entries for the virtual adddresss 0x012 3456 789A: the page table entry for the first-level was valid and contained physical page number 0x99, the page table entry for the second-level was valid and contained physical page number 0xA4, and the page table entry for the third-level was valid and contained physical page number 0xC3 WANT: corresponding PA of the actual data ---> based on the last-level page table entry (end of the multi-level lookup) vvvvvvvvvvv-------------- 0xC3 --- the one we found in the last level of the lookup PA: | PPN | PO | ^^^^^^^--- 12 bits, the same as in the VA -- last twleve bits of the VA = 0x89A --> 0xC389A - and Q5 (TLB tag bits) when have a TLB hit: we give the TLB a VPN it finds a "block" in the TLB which contains just the PTE for that VPN and we skip the entire multi-level lookup the VPN needs to be split into tag/index/offset offset: for a TLB is always empty, because the block only has one entry (2^0 = 1 --> 0 offset bits) TLB contains PTEs not bytes of data each PTE is identified by a particular VPN (like in normal caches, each byte is identified by particular address) index: we are told that it's 3 index bits 8 ways --> 8 blocks / set 64 blocks (= # entries) / ( 8 blocks / set) --> 8 sets = 2^3 sets --> 3 index bits tag: must be everything left over VPN had 30 bits total - 3 bits for index --> 27 bits left over - TLB accesses when have a TLB hit: we give the TLB a VPN the TLB splits up the VPN into index + tag it reads the entries in the set corresponding that index and sees if any tags match it finds a "block" in the TLB which contains just the PTE for that VPN and we skip the entire multi-level lookup when we have a TLB miss: we give the TLB a VPN it finds nothing ("miss") we do the (potentially multi-level) lookup normally we tell the TLB about the result, and it stores it for next time possibly evicts a PTE with the same index it has cached ~~~~~~~ - Quiz 2 Q3 -- ZF after assembly snippet long example(long a, long b) { while (a > b) { a = a - b; } return a; } example: cmpq %rsi, %rdi*** <--- must have set the flags rdi - rsi --> ZF is 1 iff rdi = rsi a - b ^------- return value jle L_done*** subq %rsi, %rdi jmp example L_done: movq %rdi, %rax*** ret*** <--- if a == b when we're done same as: "if the return value == b when we're done" - Quiz 2 Q5 -- what the linker does .global array array: .byte 1 .byte 2 .byte 3 .byte 4 .text .global bar bar: cmpq $0, %rdi je end_bar movq $array, %rdi call print_array end_bar: movq $array, %rax ret When using the resulting object file to produce an executable (using the static (non-dynamic) linking scheme we discussed in lecture), the linker will _______. Select all that apply. Yes: "write the four bytes that are stored after the label array to the executable file" > we need the value of the array to end up the executable, so it's the initial value of the global variable array > this is going to happen because the executable contains that value, and the linker needs to write the entire executable (if the linker doesn't put something in the executable file, it won't be there) > BUT: the linker is writing these bytes because it copies them from the object file the linker does know/need to know that it represents an array Yes: "find a symbol table entry for print_array in some other object file" > the linker needs to write "call print_array" instruction in the executable the call instruction needs to refer to the *address* (not name) of print_array since print_array isn't declared in the shown assembly file, it won't end up in *that* object file --> there must be another one No: "write the memory address (in some format) chosen for the 'call print_array' instruction somewhere in the executable" > the memory address of the call instruction NOT the memory address it contains > nothing needs to point to the call instruction > BUT something is going to put the function (bar) that instruction is in (just not the particular instruction in the middle) Yes: "write the memory address for array (in some format) in the resulting executable somewhere" > we need this for the movq instruction which loads that address into %rdi - Quiz 6 Q6: cycle time when changing CPU designs "Suppose a processor designer took a three-stage pipelined processor design with a cycle time of 800 ps and turned it into a six-stage processor pipelined design that implemented the same ISA. Both the three-stage and six-stage processors were designed to achieve about the best cycle times possible (despite the constraints that lead to diminishing returns as one tries to add pipeline stages to a processor). Which of the following would be the most likely cycle time for the six-stage processor design?" > start with 3-stage 800 ps > we make it 6-stage processor - we can't do this perfectly, we run into constraints that lead to diminishing returns if we could do this perfectly, then we'd end up with a 6-stage 400 ps > but we can't do that perfectly we can easily achieve 6-stage 800 ps processor by adding three stages at the end that do nothing > but if we can do absolutely anything that take advantage of pipelining when going from 3 to 6 stages we'll have a better cycle than that best option: 500 ps --- this is not perfect (perfect would be very unlikely) --- but better than the silly design of adding stages that do nothing (why we add stages if that was all we could do?) - Quiz 10 Q3 for (int i = 0; i < N; ++i) { for (int j = i; j < N; ++j) { C[j] += A[i*N+j] * B[j]; } } estimate number of misses for B: each iteration of the j loop: we're accessing something that's one element after the previous element of B we accessed --> cahce miss 1/K of the time where K is the number of elements that fit in a block (1/K of the time B[j-1] and B[j] are in different blocks) approx N * N/2 iterations of the j loop (N/2 b/c j loop goes from N/2 to N on average) result: N*N/2/K (with particular values of N, K from question)