$\qquad$

## CS 3330 Exam 2 Fall 2019

Name: $\qquad$

## Computing ID:

$\qquad$
Letters go in the boxes unless otherwise specified (e.g., for C 8 write "C" not " 8 ").
Write Letters clearly: if we are unsure of what you wrote you will get a zero on that problem.
Bubble and Pledge the exam or you will lose points.
Assume unless otherwise specified:

- little-endian 64 -bit architecture
- \%rsp points to the most recently pushed value, not to the next unused stack address.
- questions are single-selection unless identified as select-all

Variable Weight: point values per question are marked in square brackets.
Mark clarifications: If you need to clarify an answer, do so, and also add a $\star$ to the top right corner of your answer box.
$\qquad$

Question 1 [ $\mathbf{2} \mathbf{~ p t}]$ : What is the total amount of data (in bytes) that can be stored in a two way set associative cache with 16 sets and 256 byte blocks?

Question 2 [2 pt]: Which assembly instructions can be issued (sent to execution units) out-of-order before the mrmovq instruction below completes?

```
1| mrmovq 0(%r10), %r8
2| addq %r10,%r11
3| xorq %r8, %r9
4| subq %r9,%r10
5| xorq %r11,%r12
```

Answer:

Write the numbers of the lines that get executed. (For example, if you think lines 2 and 3 can be executed with the mrmovq, then write " 2 and 3 ".)

## Information for questions 3-3

Consider the following C code:

```
unsigned char array[512 * 512];
/* ... */
int sum = 0;
for (int x = 0; x < 5; ++x) {
    for (int i = 0; i < 128 + 64; ++i) {
        sum += array[i];
    }
}
```

Assume that:

- only array is kept in memory (all other variables are stored in registers)
- array is stored at an address that is a multiple of $4096\left(2^{12}\right)$
- that the compiler does not remove or reorder any accesses to array
- unsigned chars are 1 byte
- the cache is empty upon entry to the outer for loop above

Question 3 [ $\mathbf{2} \mathbf{~ p t}]$ : (see above) If the loops above are run on a 128 B directmapped cache with 1-byte cache blocks, how many data cache misses will occur? (You may write your answer as unsimplified arithmetic expression.)
Answer:
$\qquad$

## Information for questions 4-5

Consider the following access pattern where W represents writes and R represents reads:

| W | $0 \times \mathrm{ABC}$ |
| :--- | :--- |
| R | $0 \times B A D$ |
| W | $0 \times B A C$ |
| R | $0 \times 00 \mathrm{D}$ |
| W | $0 \times B A E$ |

Assume that the access pattern is repeated 4 times on processor with a 48 byte fully associative cache with 16 byte blocks. Initially the cache is empty. The cache is using a write-allocate, writeback policy and LRU replacement policy.

Question 4 [2 pt]: (see above) How many total cache misses will occur (during those 4 repeats)?

Question 5 [2 pt]: (see above) After all 4 repeats of the access pattern above occur, the cache may contain some blocks which must be written to memory if they are evicted anytime later. Write the tags of each of these blocks. (If there are no such blocks, write "none".)


Answer:

Question 6 [ 2 pt$]$ : Which of the following is/are true? Place a $\checkmark$ in each box corresponding to a correct answer and leave other boxes blank.

A increasing the cache size generally decreases the miss rate
B increasing the associativity of the cache while keeping the cache size the same decreases capacity misses.
C increasing the cache size generally decreases the hit time
D increasing the associativity of the cache while keeping the cache size the same generally increases the miss rate
$\qquad$

## Information for questions 7-8

Consider the following assembly code

```
mrmovq 8(%rsp), %r10
subq %r10, %r12
rmmovq %r12, (%rsp)
xorq %r12, %r11
addq %r10,%rsp
```

Question $7[\mathbf{2 ~ p t}]$ : (see above) Assume the above code is run on the fivestage pipelined processor described in lecture using only stalling to resolve data hazards. How many stalls are needed to correctly execute the above code? Count each cycle in which a new instruction isn't fetched as a single stall. (Write your answer as an integer.)


Question 8 [ $\mathbf{2 ~ p t ] : ~ ( s e e ~ a b o v e ) ~ A s s u m e ~ t h e ~ a b o v e ~ c o d e ~ i s ~ r u n ~ o n ~ t h e ~ f i v e - ~}$ stage pipelined processor described in lecture which uses forwarding to avoid stalling when executing the above code. Write the minimum set of registers that would must be forwarded. Example answer: \%r13,\%r14. If no forwarding is helpful, write none.

Answer:

Question 9 [ 3.0 pt$]$ : Consider the following C function:

```
void outerProduct(int n, int *a, int *b, int *result) {
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                result[i * n + j] += a[i] * b[j];
                }
    }
}
```

Which of the following is/are true about optimizing this function? Place a $\checkmark$ in each box corresponding to a correct answer and leave other boxes blank.
A on some processors, the function could benefit from using multiple accumulators for the additions to the elements of the result array
B unrolling the inner loop in this function would likely increase its size in the executable
C a compiler cannot store a[i] in a register between iterations of the innermost loop without adding checks that certain pointers refer to different areas of memory
D a compiler cannot unroll the loops without adding checks that certain pointers refer to different areas of memory
E $\square$ the locality of the function would be improved by swapping the loop over i and the loop
over $\mathbf{j}$
F a compiler cannot reorder the loops without adding checks that certain pointers refer to different areas of memory
$\qquad$

Question 10 [ $\mathbf{2 ~ p t}]$ : Consider the code below running on a system with a 256B fully-associative data cache with 16-byte blocks and an LRU replacement policy. Assume matrix is a very large array of 4-byte integers, and is aligned (so the first byte of matrix[0] is the first byte of a cache block). Assume only accesses to matrix use the data cache and that cache is initially empty.

```
int n = 160;
for (i = 0; i < n; i += 4){
    for (j = 0; j < n; j += 4){
        for (iB = i; iB <= i + 3; iB++){
            for (jB = j; jB <= j + 3; jB++){
                matrix[jB * n + iB] *= i*j;
            }
        }
    }
}
```

How many cache misses occur when the code is run?

Information for questions 11-12
Consider the following assembly snippet. Initially, \%r11 contains $0 \times 1$; \%r12 contains $0 \times 1$; and \%rsp contains $0 \times 13$.

Assuming a 5 stage pipelined processor like the one described in lecture, with forwarding and a branch prediction scheme that assumes all branches are taken.

```
    subq %r11, %r12
    jne bar
    call foo
    jmp end
bar: add %rsp,%r11
    jmp end
foo: add %r11, %r11
ret
```

end: nop

Question 11 [ $\mathbf{2} \mathbf{~ p t ] : ~ ( s e e ~ a b o v e ) ~ I f ~ t h e ~ s u b q ~ i s ~ f e t c h e d ~ d u r i n g ~ c y c l e ~} 0$, then during what cycle is the nop fetched?

Question 12 [ $\mathbf{2 p t}$ ]: (see above) What value is stored for register \%r11 in the register file when the branch misprediction is discovered for the jne instruction?

| Answer: |
| :--- |
| Answer: |
|  |

$\qquad$

## Information for questions 13-14

Suppose a seven-stage pipelined processor with forwarding and branch prediction has the following pipeline stages:

- fetch 1
- fetch 2
- decode
- execute
- memory
- writeback

Question 13 [ $\mathbf{2} \mathbf{~ p t}]$ : (see above) Suppose this processor experiences a branch misprediction. It detects and handles that misprediction during the conditional jump instruction's decode stage. In the following cycle, as a result of correcting this misprediction, which of the following pipeline registers should output values corresponding to a nop? Place a $\checkmark$ in each box corresponding to a correct answer and leave other boxes blank.

A


Question 14 [ $\mathbf{2} \mathbf{~ p t ] : ~ ( s e e ~ a b o v e ) ~ S u p p o s e ~ t h i s ~ p r o c e s s o r ~ i s ~ e x e c u t i n g ~ a n ~ i n s t r u c t i o n ~ i n ~ t h e ~}$ memory stage and discovers that it must stall due to a data cache miss, keeping that instruction in the memory stage. As a result of this stall, which of the following pipeline registers should output values corresponding to a nop in the next cycle? Place a $\checkmark$ in each box corresponding to a correct answer and leave other boxes blank.

| A | $\square$ the ones between fetch 1 and fetch 2 |
| :--- | :--- |
| B | $\square$ between fetch 2 and decode |
| C | $\square$ between decode and execute |
| D | $\square$ between execute and memory |
| E | $\square$ between memory and writeback |

$\qquad$

Question 15 [ $\mathbf{2 ~ p t}]$ : Which of the following is/are differences between a processor that starts multiple instructions per cycle and executes them out of order and an in-order pipelined processor (like the one we built in the homeworks)?

A the number of data dependency increases
B the register file is likely to have more read and write ports
C resolving branch mispredictions is likely to require more cycles
D the number of hazards increases


Question 16 [ $\mathbf{2 p t}$ ]: Consider the code the below. How many version of the of the register \%r10 will be generated for these instructions by an out-of-order processor that resolve hazards using register renaming?

```
mrmovq (%rsp), %r10
addq %r10,%r10
addq %r10,%r11
addq %r12, %r10
rmmovq %r11, (%r10)
```

Answer:

Question 17 [ $\mathbf{2} \mathbf{~ p t}]$ : Assume that the code below is run on a 5 stage pipeline, with forwarding.

```
xorq %r10,%r9
subq %r10,%r11
andq %r11, %r11
```

It is possible to solve the data dependences above using forwarding and no stalling. To do this, which of the following forwarding operations would be useful? Place a $\checkmark$ in each box corresponding to a correct answer and leave other boxes blank.
A $\quad$ \%r11 from the end of the xorq's execute stage to the end of the andq's decode stage
B $\quad$ \%r10 from the end of the xorq's memory stage to the end of the subq's decode stage
C $\quad$ \%r11 from the end of the xorq's memory stage to the end of the andq's decode stage
D $\quad$ \%r10 from the end of the xorq's execute stage to the end of the subq's decode stage

Question 18 [ $\mathbf{2} \mathbf{~ p t}]$ : Consider a 4 -way set associative 32 KB cache with 64 byte cache blocks, an LRU replacement policy, a write-back policy, and a write-allocate policy. On a system with 64 -bit addresses, how many bits of an address is used for the tag?


Question 19 [ $\mathbf{2 p t}$ ]: Consider a processor with an 8 cycle non-pipelined divider and 3 cycle pipelined multiplier. After values are produced by an execution unit (like the divider or multiplier), they can be used as inputs to an execution unit in the following cycle. What is the minimum number of cycles required to execute $(a \times(b \times c)) \div(d \times e)$ using these execution units?
Answer:
$\qquad$

Question 20 [ $\mathbf{2} \mathbf{~ p t ] : ~ S u p p o s e ~ w e ~ h a v e ~ L 1 ~ c a c h e ~ w i t h ~ a ~} 2$ cycle hit time and $90 \%$ hit rate, and an L2 cache with a $80 \%$ hit rate (assume all L2 accesses come via this L1 cache), and a main memory with a 100 cycle access time. What hit time, in cycles, does the L2 cache need into have to result in an overall average memory access time of 5 cycles for the L1 cache?

Recall the formula for average memory access time: hit time + miss penalty
Answer: $\times$ miss rate.

## Pledge:

On my honor as a student, I have neither given nor received aid on this exam.

Your signature here

