final exam

Answer each of the following questions. This exam is open-book and open-notes, but you may use only resources that were created before this exam was released (at noon eastern time on 10 December 2020). You may not collaborate with other students.

Please show your work for questions in the comment field were applicable so we are able to give you partial credit.

If you think a question is ambiguous or unclear, please make your best guess about what was meant and explain what you did in the comments field for the question. We are unlikely to be able to answer your inquiries during the exam time.

For the following questions, consider this assembly snippet:

    movq $0, %rax               /* line 1 */
    movl $0x12345678, %r9d      /* line 2 */
start_loop:                     /* line 3 */
    movl 0(%rbx,%rax,8), %r8d   /* line 4 */
    xorl %r9d, %r8d             /* line 5 */
    movl %r8d, 0(%rbx,%rax,8)   /* line 6 */
    addq $1, %rax               /* line 7 */
    cmpq $100, %rax             /* line 8 */
    jne start_loop              /* line 9 */

Question 1 (12 pt; mean 7.25) (see above)

Assuming:

%rbx represents an int* (pointer to int) named array
%rax represents a long named i

write C code or pseudocode using a while or for loop that is equivalent to the assembly loop:

Answer:

Key:

for (long i = 0; i < 100; ++i) array[i*2] ^= 0x12345678;

Question 2 (4 pt; mean 3.23) (see above)

Which of the following transformations of the above loop would result in an unrolled loop which writes the same values to memory but executes fewer instructions? Select all that apply.

67%
⊤
placing a duplicate of lines 4-7 (inclusive) after line 7
14%
placing a duplicate of lines 4-6 (inclusive) after line 6 and replacing the 1 in line 7 with a 2
85%
⊤
placing a duplicate of lines 4-6 (inclusive) with 0 replaced with 8 after line 6 and replacing the 1 in line 7 with 2
16%
placing a duplicate of line 4 with 0 replaced with 8 after line 4, placing a duplicate of line 5 after line 5, placing a duplicate of line 6 with 0 replaced with 8 after line 6, and replacing the 1 in line 7 with 2

Question 3 (4 pt; mean 3.22) (see above)

Which of the following changes would not change what values the loop writes to memory? Select all that apply.

12%
replacing line 9 with jg start_loop
86%
⊤
replacing line 9 with jl start_loop
67%
⊤
replacing line 8 with two lines: movq $100, %r10; xorq %rax, %r10
80%
⊤
replacing line 8 with two lines: movq $100, %r10; subq %rax, %r10

Question 4 (3 pt; mean 2.26) (see above)

When line 9 executes for the second time, the values of SF and ZF will be:

75%
⊤
SF = 1, ZF = 0
8%

SF = 0, ZF = 1
14%

SF = 0, ZF = 0
0%

something else / it depends (explain in comments)

Question 5 (4 pt; mean 3.03)

Suppose we were generating an object file for the following Y86-64 snippet with a relocation table and symbol table to permit linking the object file along with other object files to generate an executable:

example_function3:
    irmovq $100, %rsi
    call example_function1
    rrmovq %rax, %rsi
    call example_function2
    ret

Relocation table entries in the resulting object file would likely contain a symbol (or label) name and a location in the machine code. Which of the following would be likely relocation table entries in the resulting object file? Select all that apply.

30%
an entry for example_function1 specifying the first byte of the first call instruction
75%
⊤
an entry for example_function1 specifying the second byte of the first call instruction
26%
an entry for example_function3 specifying the first byte of the irmovq instruction
17%
an entry for example_function3 specifying the third byte of the irmovq instruction

For the following questions, consider a six-stage pipelined Y86-64 processor implementation that uses the following stages:

fetch
decode
execute part 1
execute part 2
memory
writeback

Everything but the execute stages work as in the five-stage pipelined processor we implemented (including branch prediction that predicts all branches as taken and forwarding). For the execute stages:

the results of any arithmetic is not available until the execute part 2 stage. Consequently, if the execute stages compute a value that value cannot be used for another computation, memory access, etc. until the cycle after the execute part 2 stage that computes the value.
whether or not any conditional jump is taken is determined in the execute part 1 stage. If the conditional jump was mispredicted, the correct instruction is fetched udring the conditional jump's execute part 2 stage.

Question 6 (4 pt; mean 2.2) (see above)

Consider the following assembly snippet executing on the procesor described above

addq %rcx, %rdx             /* line 1 */
xorq %rcx, %rdx             /* line 2 */
mrmovq 8(%rdx), %rax        /* line 3 */
rmmovq %rbx, 16(%rax)       /* line 4 */
subq %rax, %rbx             /* line 5 */

If the addq instruction on line 1 runs its fetch stage during cycle number 0, then during what cycle number will the subq instruction on line 5 run its writeback stage?

Answer:

Key: 13

.                           0  1  2  3   4  5

addq %rcx, %rdx             F  D  E1 E2  M  W  6  7

xorq %rcx, %rdx                F  x  D   E1 E2 M  W  8  9

mrmovq 8(%rdx), %rax                 F   x  D  E1 E2 M  W  10 11 12

rmmovq %rbx, 16(%rax)                       F  x  x  D  E1 E2 M  W  13

 subq %rax, %rbx                                      F  D  E1 E2 M  W

Question 7 (4 pt; mean 2.05) (see above)

Consider the following assembly snippet executing on the procesor described above

    xorq %r8, %r9               /* line 1 */
    je after                    /* line 2 */
    mrmovq 8(%r10), %r11        /* line 3 */
after:
    rmmovq %r12, 16(%r13)       /* line 4 */
    subq %rax, %rbx             /* line 5 */

Suppose %r8 is initially 15 and %r9 is initially 20. If the the xorq instruction on line 1 runs its fetch stage during cycle number 0, then during what cycle number will the subq instruction on line 5 run its writeback stage?

Answer:

Key: 11

.                           0  1 2  3  4  5

xorq %r8, %r9               F  D E1 E2 M  W  6

je after                       F D  E1 E2 M  W  7  8  9

mrmovq 8(%r10), %r11                   F  D  E1 E2 M  W

after:                                                   10

rmmovq %r12, 16(%r13)            F  D     F  D  E1 E2 M  W  11

subq %rax, %rbx                     F        F  D  E1 E2 M  W

Question 8 (6 pt; mean 5.01) (see above)

Consider the following assembly snippet executing on the processor described above:

addq %r8, %r9              /* line 1 */
addq %r9, %r8              /* line 2 */
nop                        /* line 3 */
nop                        /* line 4 */
nop                        /* line 5 */
mrmovq 8(%r8), %r10        /* line 6 */
subq %r9, %r10             /* line 7 */

Which of the following forwarding must occur to execute the above correctly (with minimum stalling)?

13%
the value of %r8 from the addq in line 1 to the addq in line 2
86%
⊤
the value of %r9 from the addq in line 1 to the addq in line 2
14%
the value of %r9 from the addq in line 1 to the subq in line 7
68%
⊤
the value of %r8 from the addq in line 2 to the mrmovq in line 6
9%
the value of %r8 from the addq in line 2 to the subq in line 7
83%
⊤
the value of %r10 from the rmovq in line 6 to the subq in line 7

For the following questions, consider the process of adding a pcaddq REG1 instruction that would jump to valP plus the value of REG1. For example, given the assembly snippet:

pcaddq %rax
nop

if the nop instruction was stored at address 0x1000 and %rax had the value 0x24, then the next instruction executed would be 0x1024.

Question 9 (4 pt; mean 2.71) (see above)

Which of the following encoding would be possible and most appropriate for avoiding adding additional inputs to MUXes or additional MUXes to the single-cycle Y86-64 procesor design we discussed in lecture?

(In each of the encodings , values of each byte are provided with the most significant bits written first (left-most).)

5%

[4-bit icode][4 bit register nubmer]
67%
⊤
[4 bit icode][4 unused bits] (1st byte) then [4-bit register number][4 unused bits]
7%

[4 bit icode][4 bit register number] (1st byte) then 8 bytes of zeroes
10%

[4 bit icode][4 bit condition code] (1st byte) then [4 unused bits][4 unused bits] then [64-bit register value]
7%

none of the above are possible; they are all missing information that must be present in the instruction's machine code

Question 10 (3 pt; mean 2) (see above)

When this instruction is executing, the data memory will

67%
⊤
not need to do anything
24%

perform a read
6%

perform a write

Question 11 (3 pt; mean 1.57) (see above)

When this instruction is executing, one of the register file's destination register number inputs (reg_dstE and reg_dstM) will be REG_NONE (the constant 0xF) and the other will be:

52%
⊤
also REG_NONE (0xF)
7%

equal to a constant value that is not REG_NONE
8%

equal to part of the output of the instruction memory
13%

equal to the output of the ALU (or part of the output of the ALU)
5%

equal to the value of the program counter (or part of the value of the program counter)
8%

equal to the result of adding a constant to the value of the program counter
1%

none of the above (explain in comments)

Question 12 (4 pt; mean 2.44)

Suppose a five-stage pipelined processor has a cycle time of 800 ps. When running a particular benchmark program, 10% of the instructions require exactly one cycle of stalling to resolve hazards and 5% of the instructions require exactly two cycles of stalling to resolve hazards, and no instructions require any more cycles of stalling. By changing to a six-stage design, one can reduce the cycle time but 15% (instead of 10%) of the instructions in the benchmark program would require exactly one cycle of stalling (and 5% would still require exactly two cycles of stalling). For this change to make the benchmark program run at least as fast as on the five-stage processor, about what is the minimum cycle time it should have? Write your answer as a base-10 number of picoseconds, rounded to the nearest picosecond. You may assume the benchmark program has a very large number of instructions (so, for example, the amount of time spent waiting for the first instructions of the benchmark to finish is negligible).

Answer:

Key: 768

1.20 cycles per instruction with 5 cycle implies 960 ps/instruction; 1.25 cycles per instruction with 6 cycle implies 1.25x ps/instruction = 960, solve for x = 768

Consider a system with:

an 4-way set associative, 64KB L1 (level 1) data cache with 32B cache blocks that uses physical addresses
a 8-way set associative, 1MB L2 (level 2) cache shared between instructions and data with 32B cache blocks that uses physical addresses
a 12-way set associative TLB which can hold up to 48 page table entries
all caches use a replacement policy that approximates LRU
8KB pages
8 byte page table entries
a 3-level page table structure, where a page table at any level are one page in size

(For this questions KB = 1024 bytes, and B = 1 byte.)

The following questions consider the operation of reading a 4-byte little endian value from the virtual address 0x12345678 on this system. (Note that some leading zeroes of the virtual address might not be written.) For each of the questions, assume the corresponding physical (byte) address is 0xABCDD678.

Question 13 (3 pt; mean 1.63) (see above)

What is the size of virtual addresses on this system in bits?

Answer:

Key: 43

8KB page tables/8 byte PTEs = 1024 entries per page table --> 10-bit VPN parts; since 3 levels, virtual paeg numbers are 30 bits in total. Since pages are 8KB page offsets are 13 bits. Virtual addresses are composed from a virtual pgae number and page offset, so that's 30 + 13 = 43 bits

Question 14 (4 pt; mean 2.08) (see above)

If accessing the value requires a TLB lookup, what set index of the TLB will be accessed?

Answer:

Key: 2

(0x12345678 >> 13) & 0x3 (2 TLB index bits)

Question 15 (4 pt; mean 1.7) (see above)

If accessing the value requires a page table lookup and the third-level page table (that is, the last one used in the multi-level lookup) used in that lookup starts at physical (byte) address 0x400000, what would the physical address of the corresponding (third-level) page table entry be? (Write your answer as a hexadecimal number.)

Answer:

Key: 0x400d10

0x12345678 has VPN 0x91a2. The 10 least significant bits of this are 0x1A2. We multiply that times the PTE size (8 bytes) to get the offset in bytes in the page table to get 0xD10.

Question 16 (4 pt; mean 1.42) (see above)

If accessing the value requires a page table lookup and the third-level page table (that is, the last one used in the multi-level lookup) used in that lookup starts at physical (byte) address 0x400000, then what physical page number was in the second-level page table entry accessed as part of this lookup? (Write your answer as a hexadecimal number.)

Answer:

Key: 0x200

physical address 0x400000 has PPN 0x200 (= 0x400000 >> 13, the page offset size) and page offset 0x0

Question 17 (4 pt; mean 1.49) (see above)

If accessing the value results in a hit in the L1 data cache, and the data in the corresponding block in the L1 data cache is (written as hexadecimal, lowest address first):

01 23 45 67 02 34 56 78
03 34 56 78 04 56 78 9A
04 56 78 9A 05 67 89 AB
06 78 9A BC 07 89 AB CD

then what 4-byte value will be read? Assume values are read in little endian and write your answer as a hexadecimal number. (Like a hexadecimal constant in C or Python, your hexadecimal number should have the one's place written right-most.)

Answer:

Key: 0xBC9A7806

cache block ranges from physical addresses 0xABCDD660 through 0xABCDD680, and it's bytes 24-28 of it (the offset bits of the physical address are 0xABCDD678 & 0x1F == 24)

Question 18 (12 pt; mean 4.41) (see above)

Write a C function int couldHaveBeenEvicted(unsigned long p) that takes in a physical address p, and returns true if and only if the value at p could be evicted from the L2 cache as a result of an accessing the value at physical address 0xABCDD678. (Since we're asking whether whether it could have been evicted, you can assume the value at p is part of the least recently used block in the set, or any other assumptions that would make eviction more likely.) Your C function must not use standard library functions (but you can definite temporary variables, use all the C arithmetic and bitwise operators, etc.)

Answer:

Key:

assuming an inclusive L2 cache: int ref_index = (0xABCDD678 >> 5) & 0xFFF; int index = (p >> 5) & 0xFFF; int ref_tag = (0xABCDD678 >> 5) >> 12; int tag = p >> (5+12); return index == ref_index && tag != ref_tag; (for an exclusive cache, omit the tag check; we didn't expect you to know about exclusive caches, but accepted both answers)

Question 19 (4 pt; mean 1.3)

Suppose a system:

has 32-bit virtual addresses
uses 12-bit page offsets (implying 4096 byte pages)
uses two-level page tables with 1024 entries in tables at each level (so 10 bits of the VPN are used for lookups at each level)

If a program's page tables are setup so that the only valid pages are the ones with virtual page number 0x00001, 0x00002, 0x00003, 0xFFDC0, 0xFF0C1, 0xFF0C2, 0xFF0C3, and 0xFFFFF (all hexadecimal page numbers), then what is the minimum of number of second-level page tables used to do this? (Count only second-level page tables; do not count the first-level page table.)

Answer:

Key: 3

second-level page table for 0x00000-0x003FF and for 0xFF000-0xFF3FF and for 0xFFC00-0xFFFF

For the following questions, consider a 64KB, 4-way set associative cache with 128-byte blocks, a random replacement policy, and write-back, write-allocate policy on a system with 64-bit physical addresses. (1KB = 1024 bytes)

Question 20 (5 pt; mean 3.62) (see above)

Suppose the following 1-byte accesses occur in this cache in the given order, and the cache is empty before these accesses occur:

reading address 0x00157E
writing address 0x00157F
reading address 0x001000
writing address 0x001001
writing address 0x001002
reading address 0x101000
reading address 0x201000
reading address 0x301000
reading address 0x401000

(Note that not all leading zeros of addresses are written out.)

Which of the following statements are true about the results of this access pattern?

77%
⊤
there was a cache hit when writing to address 0x00157F
77%
⊤
the value of address 0x00157F may be stored in the cache afterwards
21%
the value of address 0x00157F may have been written to the next level of cache or main memory
67%
⊤
the value of address 0x001000 may be stored in the cache afterwards
62%
⊤
the value of address 0x001000 may have been written to the next level of cache or main memory

Question 21 (6 pt; mean 1.83) (see above)

In addition to 64KB of storage for data, the cache needs storage for metadata. How much metadata storage, in bits, will it need? Write your answer as a base-10 number. Remember to show your work in the comments field.

Answer:

Key: 26624

7 offset bits, 7 index bits --> 50 tag bits; 64K/128 = 512 cache blocks * (1 valid bit + 1 dirty bit + 50 tag bits) = 26624 bits

Consider the following C function:

void foo(int *result, int *vector1, int *vector2, int N) {
    for (int i = 0; i < N; ++i) {
        int temp = vector1[i];
        for (int j = 0; j < N; ++j) {
            result[j * N + i] = temp * vector2[j];
        }
    }
}

For the following questions, assume:

the code the compiler generates for this function only accesses the data cache for the reads/writes to result, vector1, and vector2 shown, and does not reorder or omit or combine any reads or writes;
result, vector1, and vector2 are pointers to non-overlapping arrays, and each of the pointers is a multiple of 16384 (that is, two to the 14th power)
ints are 4 bytes;
1KB = 1024 bytes and 1B = 1 byte;

Question 22 (3 pt; mean 1.79) (see above)

Assuming a 16KB, 4-way data cache with 4B blocks, would a write-allocate or write-no-allocate policy be better for the function foo above?

24%

write-allocate
58%
⊤
write-no-allocate
12%

they would be approximately the same

Since we write 4 byte values (full cache blocks) and we write to a value once and never read from it, a write-allocate policy can only hurt performance: we may bring in cache blocks for result that could evict cached parts of vector1 and vector2 we want to read later.

Question 23 (8 pt; mean 4.09) (see above)

Assuming a 32KB, 8-way data cache with 8B blocks and a write-back, write-allocate policy, write code or pseudocode for a modified version of the function foo that has improved cache performance (fewer cache misses) but computes the same values.

Answer:

Key:

change loop order or cache blocking

For the following questions, consider the following x86-64 assembly snippet:

movq $3, %rax
imulq %r8, %rax
imulq %r9, %rax
imulq %r10, %rax
addq %rax, %rcx
movq $3, %rax
imulq %r11, %rax
addq %rax, %rcx

imulq REG1, REG2 multiplies the value of REG1 and REG2 together and stores the result in REG2.

Suppose the above assembly snippet is run on an out-of-order processor with the following execution units:

two single-cycle ALUs that can perform additions, and
a pipelined multiplier that produces a result after four cycles (so if a pair of values to multiply is input during cycle 1, then a result will be produced in cycle 4 and available to use in other arithmetic during cycle 5)

Question 24 (8 pt; mean 4.14) (see above)

Compute the minimum number of cycles the processor described above would take to complete the arithmetic in the assembly snippet above. Do not include time needed for fetching instructions, executing the movq instructions, forwarding values, or reading/writing registers. Show your work in the comments field.

Answer:

Key: 14

4 times 3 cycles for the 3*r8*r9*r10=raxV1 multiplication chain (which can happen in parallel with the 3*r11 computation) and an additional cycles for adding raxV1 to rcx and an additional cycle for adding that value to 3*r11

Question 25 (8 pt; mean 4.69) (see above)

Rewrite the above assembly snippet, using only movq, two-argument imulq, and addq instructions, to reduce the time required for arithmetic on the processor described above. Your rewritten assembly snippet must produce the same result in %rcx. (We do not care if other registers values end up being the same.)

Answer:

Key:

convert ((3*r8)*r9)*r10 to (3*r8)*(r9*r10) or similar. Or use repeated addqs to multiply by 3. Or factor out the multiply by 3 to do one less multiply

Question 26 (4 pt; mean 2.47)

We discussed how virtual memory can be used to allow physical memory to be used as a cache for slower storage like hard drives and solid state disks. When this happens and a program tries to access a value in virtual memory that is not loaded into physical memory, which of the following is likely to happen as a result of this access? Select all that apply.

66%
a system call will occur
74%
⊤
some code that is part of the operating system will run
27%
after the value is loaded into physical memory, the program will restart execution from the beginning
66%
⊤
the valid bit in the page table entry corresponding to the virtual address will change from 0 to 1

Question 27 (4 pt; mean 2.62)

The abstraction of process gives programs the illusion of a dedicated machine, including a dedicated processor and main memory, even though the underlying may be shared between processes. As part of implementing processes by sharing a single processor and its memory over time, the operating system will change which process is active on the processor. Which of the following is true about this procedure?

72%
⊤
the operating system can cause a process running on the processor to stop running even if the code the process is running doesn't call the operating system
78%
⊤
the code in the operating system that switches which process is active is likely to be run in kernel mode
43%
on a processor with many extra physical registers (like most out-of-order processors), the operating system will save the registers of the previous process using these extra registers as it switches to the next process
45%
if the two processes both use the same memory addresses in their machine code (e.g. for two different global variables), then, to prevent the programs from interfering with each other, the operating system will need to unload one process from memory (e.g. by copying some of its data to disk) and load the other as part of switching from one process to another.

Question 28 (4 pt; mean 2.52)

Suppose a system has:

a 16KB 4-way set-associative L1 cache with 64-byte blocks and a 3-cycle hit time
a 1MB 8-way set-associative L2 cache with 64-byte blocks and a 10-cycle hit time
main memory with a 100-cycle access time

and when a miss occurs in the L1 or L2 cache, the cache must wait until the hit time completes before it accesses the next level of cache or main memory (so the total access time on a miss is the sum of the hit time and the next-level access time).

On a benchmark program the L1 cache experiences a 90% hit rate and the L2 cache experiences a 50% hit rate. What was the average memory access time experienced by the benchmark program in cycles? (Round your answer to the nearest tenth of a cycle. You may assume all the memory accesses in the benchmark program go through the L1 cache.)

Answer:

Key: 9

For the following questions, consider the following circuit:

Assume:

all registers store 64-bit integer values
all registers store a new value upon a rising edge of a common clock signal
all adders perform 64-bit additions (and do not rely on any clock signal to determine when to do so)
clock cycles are set long enough to avoid problems from values changing

Question 29 (4 pt; mean 2.49) (see above)

If the registers X, Y, and Z initially store 1, then what value will X store after 2 rising edges of the clock signal?

Answer:

Key: 7

Question 30 (3 pt; mean 1.83) (see above)

Suppose the 64-bit registers have a 50 ps register delay and the adders take 500 ps to produce a stable result. What is the minimum cycle time for the circuit? Write your answer as a base-10 number of picoseconds.

Answer:

Key: 1050