final exam

Answer each of the following questions. This exam is open-book and open-notes, but you may use only resources that were created before this exam was released (at noon eastern time on 10 December 2020). You may not collaborate with other students.

Please show your work for questions in the comment field were applicable so we are able to give you partial credit.

If you think a question is ambiguous or unclear, please make your best guess about what was meant and explain what you did in the comments field for the question. We are unlikely to be able to answer your inquiries during the exam time.

For the following questions, consider this assembly snippet:

    movq $0, %rax               /* line 1 */
    movl $0x12345678, %r9d      /* line 2 */
start_loop:                     /* line 3 */
    movl 0(%rbx,%rax,8), %r8d   /* line 4 */
    xorl %r9d, %r8d             /* line 5 */
    movl %r8d, 0(%rbx,%rax,8)   /* line 6 */
    addq $1, %rax               /* line 7 */
    cmpq $100, %rax             /* line 8 */
    jne start_loop              /* line 9 */
Question 1 (12 pt; mean 7.25) (see above)

Assuming:

  • %rbx represents an int* (pointer to int) named array
  • %rax represents a long named i

write C code or pseudocode using a while or for loop that is equivalent to the assembly loop:

Answer:
Key:

for (long i = 0; i < 100; ++i) array[i*2] ^= 0x12345678;

Question 2 (4 pt; mean 3.23) (see above)

Which of the following transformations of the above loop would result in an unrolled loop which writes the same values to memory but executes fewer instructions? Select all that apply.

Question 3 (4 pt; mean 3.22) (see above)

Which of the following changes would not change what values the loop writes to memory? Select all that apply.

Question 4 (3 pt; mean 2.26) (see above)

When line 9 executes for the second time, the values of SF and ZF will be:

Question 5 (4 pt; mean 3.03)

Suppose we were generating an object file for the following Y86-64 snippet with a relocation table and symbol table to permit linking the object file along with other object files to generate an executable:

example_function3:
    irmovq $100, %rsi
    call example_function1
    rrmovq %rax, %rsi
    call example_function2
    ret

Relocation table entries in the resulting object file would likely contain a symbol (or label) name and a location in the machine code. Which of the following would be likely relocation table entries in the resulting object file? Select all that apply.

For the following questions, consider a six-stage pipelined Y86-64 processor implementation that uses the following stages:

Everything but the execute stages work as in the five-stage pipelined processor we implemented (including branch prediction that predicts all branches as taken and forwarding). For the execute stages:

Question 6 (4 pt; mean 2.2) (see above)

Consider the following assembly snippet executing on the procesor described above

addq %rcx, %rdx             /* line 1 */
xorq %rcx, %rdx             /* line 2 */
mrmovq 8(%rdx), %rax        /* line 3 */
rmmovq %rbx, 16(%rax)       /* line 4 */
subq %rax, %rbx             /* line 5 */

If the addq instruction on line 1 runs its fetch stage during cycle number 0, then during what cycle number will the subq instruction on line 5 run its writeback stage?

Answer:
Key: 13
.                           0  1  2  3   4  5
addq %rcx, %rdx             F  D  E1 E2  M  W  6  7
xorq %rcx, %rdx                F  x  D   E1 E2 M  W  8  9
mrmovq 8(%rdx), %rax                 F   x  D  E1 E2 M  W  10 11 12
rmmovq %rbx, 16(%rax)                       F  x  x  D  E1 E2 M  W  13
 subq %rax, %rbx                                      F  D  E1 E2 M  W
Question 7 (4 pt; mean 2.05) (see above)

Consider the following assembly snippet executing on the procesor described above

    xorq %r8, %r9               /* line 1 */
    je after                    /* line 2 */
    mrmovq 8(%r10), %r11        /* line 3 */
after:
    rmmovq %r12, 16(%r13)       /* line 4 */
    subq %rax, %rbx             /* line 5 */

Suppose %r8 is initially 15 and %r9 is initially 20. If the the xorq instruction on line 1 runs its fetch stage during cycle number 0, then during what cycle number will the subq instruction on line 5 run its writeback stage?

Answer:
Key: 11
.                           0  1 2  3  4  5
xorq %r8, %r9               F  D E1 E2 M  W  6 
je after                       F D  E1 E2 M  W  7  8  9
mrmovq 8(%r10), %r11                   F  D  E1 E2 M  W
after:                                                   10
rmmovq %r12, 16(%r13)            F  D     F  D  E1 E2 M  W  11
subq %rax, %rbx                     F        F  D  E1 E2 M  W
Question 8 (6 pt; mean 5.01) (see above)

Consider the following assembly snippet executing on the processor described above:

addq %r8, %r9              /* line 1 */
addq %r9, %r8              /* line 2 */
nop                        /* line 3 */
nop                        /* line 4 */
nop                        /* line 5 */
mrmovq 8(%r8), %r10        /* line 6 */
subq %r9, %r10             /* line 7 */

Which of the following forwarding must occur to execute the above correctly (with minimum stalling)?

For the following questions, consider the process of adding a pcaddq REG1 instruction that would jump to valP plus the value of REG1. For example, given the assembly snippet:

pcaddq %rax
nop

if the nop instruction was stored at address 0x1000 and %rax had the value 0x24, then the next instruction executed would be 0x1024.

Question 9 (4 pt; mean 2.71) (see above)

Which of the following encoding would be possible and most appropriate for avoiding adding additional inputs to MUXes or additional MUXes to the single-cycle Y86-64 procesor design we discussed in lecture?

(In each of the encodings , values of each byte are provided with the most significant bits written first (left-most).)

Question 10 (3 pt; mean 2) (see above)

When this instruction is executing, the data memory will

Question 11 (3 pt; mean 1.57) (see above)

When this instruction is executing, one of the register file's destination register number inputs (reg_dstE and reg_dstM) will be REG_NONE (the constant 0xF) and the other will be:

Question 12 (4 pt; mean 2.44)

Suppose a five-stage pipelined processor has a cycle time of 800 ps. When running a particular benchmark program, 10% of the instructions require exactly one cycle of stalling to resolve hazards and 5% of the instructions require exactly two cycles of stalling to resolve hazards, and no instructions require any more cycles of stalling. By changing to a six-stage design, one can reduce the cycle time but 15% (instead of 10%) of the instructions in the benchmark program would require exactly one cycle of stalling (and 5% would still require exactly two cycles of stalling). For this change to make the benchmark program run at least as fast as on the five-stage processor, about what is the minimum cycle time it should have? Write your answer as a base-10 number of picoseconds, rounded to the nearest picosecond. You may assume the benchmark program has a very large number of instructions (so, for example, the amount of time spent waiting for the first instructions of the benchmark to finish is negligible).

Answer:
Key: 768

1.20 cycles per instruction with 5 cycle implies 960 ps/instruction; 1.25 cycles per instruction with 6 cycle implies 1.25x ps/instruction = 960, solve for x = 768

Consider a system with:

(For this questions KB = 1024 bytes, and B = 1 byte.)

The following questions consider the operation of reading a 4-byte little endian value from the virtual address 0x12345678 on this system. (Note that some leading zeroes of the virtual address might not be written.) For each of the questions, assume the corresponding physical (byte) address is 0xABCDD678.

Question 13 (3 pt; mean 1.63) (see above)

What is the size of virtual addresses on this system in bits?

Answer:
Key: 43

8KB page tables/8 byte PTEs = 1024 entries per page table --> 10-bit VPN parts; since 3 levels, virtual paeg numbers are 30 bits in total. Since pages are 8KB page offsets are 13 bits. Virtual addresses are composed from a virtual pgae number and page offset, so that's 30 + 13 = 43 bits

Question 14 (4 pt; mean 2.08) (see above)

If accessing the value requires a TLB lookup, what set index of the TLB will be accessed?

Answer:
Key: 2

(0x12345678 >> 13) & 0x3 (2 TLB index bits)

Question 15 (4 pt; mean 1.7) (see above)

If accessing the value requires a page table lookup and the third-level page table (that is, the last one used in the multi-level lookup) used in that lookup starts at physical (byte) address 0x400000, what would the physical address of the corresponding (third-level) page table entry be? (Write your answer as a hexadecimal number.)

Answer:
Key: 0x400d10

0x12345678 has VPN 0x91a2. The 10 least significant bits of this are 0x1A2. We multiply that times the PTE size (8 bytes) to get the offset in bytes in the page table to get 0xD10.

Question 16 (4 pt; mean 1.42) (see above)

If accessing the value requires a page table lookup and the third-level page table (that is, the last one used in the multi-level lookup) used in that lookup starts at physical (byte) address 0x400000, then what physical page number was in the second-level page table entry accessed as part of this lookup? (Write your answer as a hexadecimal number.)

Answer:
Key: 0x200

physical address 0x400000 has PPN 0x200 (= 0x400000 >> 13, the page offset size) and page offset 0x0

Question 17 (4 pt; mean 1.49) (see above)

If accessing the value results in a hit in the L1 data cache, and the data in the corresponding block in the L1 data cache is (written as hexadecimal, lowest address first):

01 23 45 67 02 34 56 78
03 34 56 78 04 56 78 9A
04 56 78 9A 05 67 89 AB
06 78 9A BC 07 89 AB CD

then what 4-byte value will be read? Assume values are read in little endian and write your answer as a hexadecimal number. (Like a hexadecimal constant in C or Python, your hexadecimal number should have the one's place written right-most.)

Answer:
Key: 0xBC9A7806

cache block ranges from physical addresses 0xABCDD660 through 0xABCDD680, and it's bytes 24-28 of it (the offset bits of the physical address are 0xABCDD678 & 0x1F == 24)

Question 18 (12 pt; mean 4.41) (see above)

Write a C function int couldHaveBeenEvicted(unsigned long p) that takes in a physical address p, and returns true if and only if the value at p could be evicted from the L2 cache as a result of an accessing the value at physical address 0xABCDD678. (Since we're asking whether whether it could have been evicted, you can assume the value at p is part of the least recently used block in the set, or any other assumptions that would make eviction more likely.) Your C function must not use standard library functions (but you can definite temporary variables, use all the C arithmetic and bitwise operators, etc.)

Answer:
Key:

assuming an inclusive L2 cache: int ref_index = (0xABCDD678 >> 5) & 0xFFF; int index = (p >> 5) & 0xFFF; int ref_tag = (0xABCDD678 >> 5) >> 12; int tag = p >> (5+12); return index == ref_index && tag != ref_tag; (for an exclusive cache, omit the tag check; we didn't expect you to know about exclusive caches, but accepted both answers)

Question 19 (4 pt; mean 1.3)

Suppose a system:

If a program's page tables are setup so that the only valid pages are the ones with virtual page number 0x00001, 0x00002, 0x00003, 0xFFDC0, 0xFF0C1, 0xFF0C2, 0xFF0C3, and 0xFFFFF (all hexadecimal page numbers), then what is the minimum of number of second-level page tables used to do this? (Count only second-level page tables; do not count the first-level page table.)

Answer:
Key: 3

second-level page table for 0x00000-0x003FF and for 0xFF000-0xFF3FF and for 0xFFC00-0xFFFF

For the following questions, consider a 64KB, 4-way set associative cache with 128-byte blocks, a random replacement policy, and write-back, write-allocate policy on a system with 64-bit physical addresses. (1KB = 1024 bytes)

Question 20 (5 pt; mean 3.62) (see above)

Suppose the following 1-byte accesses occur in this cache in the given order, and the cache is empty before these accesses occur:

  • reading address 0x00157E
  • writing address 0x00157F
  • reading address 0x001000
  • writing address 0x001001
  • writing address 0x001002
  • reading address 0x101000
  • reading address 0x201000
  • reading address 0x301000
  • reading address 0x401000

(Note that not all leading zeros of addresses are written out.)

Which of the following statements are true about the results of this access pattern?

Question 21 (6 pt; mean 1.83) (see above)

In addition to 64KB of storage for data, the cache needs storage for metadata. How much metadata storage, in bits, will it need? Write your answer as a base-10 number. Remember to show your work in the comments field.

Answer:
Key: 26624

7 offset bits, 7 index bits --> 50 tag bits; 64K/128 = 512 cache blocks * (1 valid bit + 1 dirty bit + 50 tag bits) = 26624 bits

Consider the following C function:

void foo(int *result, int *vector1, int *vector2, int N) {
    for (int i = 0; i < N; ++i) {
        int temp = vector1[i];
        for (int j = 0; j < N; ++j) {
            result[j * N + i] = temp * vector2[j];
        }
    }
}

For the following questions, assume:

Question 22 (3 pt; mean 1.79) (see above)

Assuming a 16KB, 4-way data cache with 4B blocks, would a write-allocate or write-no-allocate policy be better for the function foo above?

Since we write 4 byte values (full cache blocks) and we write to a value once and never read from it, a write-allocate policy can only hurt performance: we may bring in cache blocks for result that could evict cached parts of vector1 and vector2 we want to read later.

Question 23 (8 pt; mean 4.09) (see above)

Assuming a 32KB, 8-way data cache with 8B blocks and a write-back, write-allocate policy, write code or pseudocode for a modified version of the function foo that has improved cache performance (fewer cache misses) but computes the same values.

Answer:
Key:

change loop order or cache blocking

For the following questions, consider the following x86-64 assembly snippet:

movq $3, %rax
imulq %r8, %rax
imulq %r9, %rax
imulq %r10, %rax
addq %rax, %rcx
movq $3, %rax
imulq %r11, %rax
addq %rax, %rcx

imulq REG1, REG2 multiplies the value of REG1 and REG2 together and stores the result in REG2.

Suppose the above assembly snippet is run on an out-of-order processor with the following execution units:

Question 24 (8 pt; mean 4.14) (see above)

Compute the minimum number of cycles the processor described above would take to complete the arithmetic in the assembly snippet above. Do not include time needed for fetching instructions, executing the movq instructions, forwarding values, or reading/writing registers. Show your work in the comments field.

Answer:
Key: 14

4 times 3 cycles for the 3*r8*r9*r10=raxV1 multiplication chain (which can happen in parallel with the 3*r11 computation) and an additional cycles for adding raxV1 to rcx and an additional cycle for adding that value to 3*r11

Question 25 (8 pt; mean 4.69) (see above)

Rewrite the above assembly snippet, using only movq, two-argument imulq, and addq instructions, to reduce the time required for arithmetic on the processor described above. Your rewritten assembly snippet must produce the same result in %rcx. (We do not care if other registers values end up being the same.)

Answer:
Key:

convert ((3*r8)*r9)*r10 to (3*r8)*(r9*r10) or similar. Or use repeated addqs to multiply by 3. Or factor out the multiply by 3 to do one less multiply

Question 26 (4 pt; mean 2.47)

We discussed how virtual memory can be used to allow physical memory to be used as a cache for slower storage like hard drives and solid state disks. When this happens and a program tries to access a value in virtual memory that is not loaded into physical memory, which of the following is likely to happen as a result of this access? Select all that apply.

Question 27 (4 pt; mean 2.62)

The abstraction of process gives programs the illusion of a dedicated machine, including a dedicated processor and main memory, even though the underlying may be shared between processes. As part of implementing processes by sharing a single processor and its memory over time, the operating system will change which process is active on the processor. Which of the following is true about this procedure?

Question 28 (4 pt; mean 2.52)

Suppose a system has:

and when a miss occurs in the L1 or L2 cache, the cache must wait until the hit time completes before it accesses the next level of cache or main memory (so the total access time on a miss is the sum of the hit time and the next-level access time).

On a benchmark program the L1 cache experiences a 90% hit rate and the L2 cache experiences a 50% hit rate. What was the average memory access time experienced by the benchmark program in cycles? (Round your answer to the nearest tenth of a cycle. You may assume all the memory accesses in the benchmark program go through the L1 cache.)

Answer:
Key: 9

For the following questions, consider the following circuit:

Assume:

Question 29 (4 pt; mean 2.49) (see above)

If the registers X, Y, and Z initially store 1, then what value will X store after 2 rising edges of the clock signal?

Answer:
Key: 7
Question 30 (3 pt; mean 1.83) (see above)

Suppose the 64-bit registers have a 50 ps register delay and the adders take 500 ps to produce a stable result. What is the minimum cycle time for the circuit? Write your answer as a base-10 number of picoseconds.

Answer:
Key: 1050