### Exam Review 2

## ROB: head/tail

rename map (for next rename)

| РС | log. reg | prev.<br>phys. | store? | except? | ready? | log<br>RC            | ) X0                 |
|----|----------|----------------|--------|---------|--------|----------------------|----------------------|
| Α  | R3       | Х3             | no     | none    | yes    | <b>←</b> old tail R1 |                      |
| В  | R1       | X1             | no     | none    | yes    | ←tail R2             |                      |
| С  | R1       | Х6             | no     | none    | yes    | R3                   |                      |
| D  | R4       | X4             | no     | none    | yes    | R4                   | X12                  |
| E  |          |                | yes    | none    | yes    |                      |                      |
| F  |          |                | no     | none    | yes    |                      |                      |
| Α  | R3       | X5             | no     | none    | yes    |                      | C 11                 |
| В  | R1       | X7             | no     | fault   | yes    |                      | free list:           |
| С  | R1       | X10            | no     | none    | no     | ]                    | X11, <mark>X3</mark> |
| D  | R4       | X8             | no     | none    | yes    | <b>←</b> head        |                      |
|    |          |                |        |         |        | <b>←</b> next entry  |                      |

exercise: result of processing rest?

## **Questions?**

#### vector instructions

```
register types: scalar, vector, predicate/mask, length
made-up syntax follows:
@MaskRegister V0 \leftarrow V1 + V2,
@MaskRegister VADD V0, V1, V2
for (int i = 0;
     i < MIN(VectorLengthRegister,</pre>
             MaxVectorLength);
     i += 1) {
    if (MaskRegister[i]) {
        V0[i] = V1[i] + V2[i];
    }
```

#### vector exercise

```
void vector_add_one(int *x, int length) {
    for (int i = 0; i < length; ++i) {
        x[i] += 1;
    }
}</pre>
```

exercise: write as a vector machine program with 64-element vectors

vector length register or predicate (mask) registers

### vector exercise answer

End:

```
void vector_add_one(int *x, int length) {
    for (int i = 0; i < length; ++i) {</pre>
         x[i] += 1;
// R1 contains X, R2 contains length
       VL \leftarrow R2 MOD 64
Loop: IF R2 <= 0, goto End
       V1 ← MEMORY[R1]
       V1 \leftarrow V1 + 1
       MEMORY[R1] \leftarrow V1
       R2 \leftarrow R2 - VL
       VL ← 64
       goto Loop
```

## relaxed memory models ex 1

reasons for reorderings?

#### relaxed reasons

#### optimizations to think about:

```
executing loads/stores out-of-order (if addresses don't
conflict)
combining two loads for same address ("load
forwarding")
combining load + store for same address ("store
```

not waiting for invalidations to be acknowledged (esp.

non-bus network)

forwarding")

### relaxed memory models ex 2

What can happen?

examples of possible sequential orders? (there are 8)

examples of non-sequential orders?

what could happen to cause other orders?

## possible sequential orders

|                                                                                |                                                                            | R1 | R2 | R3 | R4 |
|--------------------------------------------------------------------------------|----------------------------------------------------------------------------|----|----|----|----|
|                                                                                |                                                                            | 0  | 0  | 1  | 0  |
| X = Y = 0                                                                      |                                                                            | 0  | 0  | 1  | 1  |
| CPU1:                                                                          | CPU2:                                                                      | 0  | 0  | 2  | 0  |
| $\begin{array}{ccc} R1 \; \leftarrow \; Y \\ X \; \leftarrow \; 1 \end{array}$ | $\begin{array}{ccc} R4 & \leftarrow & X \\ X & \leftarrow & 2 \end{array}$ | 0  | 0  | 2  | 1  |
| $R2 \leftarrow Y$                                                              | $\begin{array}{ccc} X & \downarrow & 2 \\ Y & \leftarrow & 2 \end{array}$  | 0  | 2  | 1  | 0  |
| R3 $\leftarrow$ X                                                              |                                                                            | 0  | 2  | 2  | 0  |
|                                                                                |                                                                            | 0  | 2  | 2  | 1  |
|                                                                                |                                                                            | 2  | 2  | 1  | 0  |

### non-seq orders

```
X = Y = 0
 CPU1: CPU2:
 R1 \leftarrow Y \qquad R4 \leftarrow X
 X \leftarrow 1 \qquad X \leftarrow 2
 R2 \leftarrow Y \qquad Y \leftarrow 2
 R3 \leftarrow X
R2 = 2 and R3 = 1 and R4 = 1
     example cause: store forwarding (use stored value in X)
     example cause: load forwarding (reuse first load)
R1 = 2 and R3 = 2
     example cause: reordered stores in CPU2
     example cause: CPU2 doesn't wait for CPU1 invalidate
```

what is a transaction?

limitations?

when is performance good/bad?

what is a transaction?

atomic — as if uninterrupted by other things
limitations?

when is performance good/bad?

```
what is a transaction?
   atomic — as if uninterrupted by other things
limitations?
   I/O
   amount of space to store "transaction log"
when is performance good/bad?
```

```
what is a transaction?
     atomic — as if uninterrupted by other things
limitations?
     1/0
     amount of space to store "transaction log"
when is performance good/bad?
     livelock — transcations abort each other over and over?
     possibly more "wasted work" if contention (e.g. short
```

overhead to manipulate transaction log if lots of items?

transaction aborts long one)

fairness?

12





Cache has virtual indexes?





Cache has virtual indexes?

Solution #1: Disallow overlap

## Physically Tagged, Virtually Indexed



Figure 2. A typical cache and TLB design.

## Plausible splits

```
page #/tag set index offset tag only
```



nome // /+om --- !--- !--



Cache has virtual indexes?

Solution #1: Disallow overlap

Solution #2: Translate first

#### **Translate First**



### **Virtual Caches**

no translation for entire cache lookup including tag checking

exist, but more complicated

need to handle aliasing
multiple virtual addresses for one physical

example ways:

OS must prevent/manage aliasing physical L2 tracks virtual to physical mappping in L1

### 000 tradeoffs

### gem5 pipeline



# OOO tradeoffs (1)

dependencies plus latency limits performance diminishing returns from additional computational resources

latencies that can be especially long:

cache/memory accesses branch resolution

speculation helps "cheat" on dependencies

branch prediction

memory reordering (+ check if addresses conflict later)

# OOO tradeoffs (2)

limits on number of instructions "in flight"

```
number of physical registers
size of queues (instruction, load/store)
size of reorder buffer
# active cache misses
```

# OOO tradeoffs (3)

miscellaneous issues:

right types of functional units for programs? wasted work from frequent "exceptions"? might include, e.g., memory ordering error

### 000 tradeoff exercise

what programs will be most affected by a smaller/larger:

reorder buffer
instruction queue
number of floating point adders
number of physical registers
number of instructions
fetched/decoded/renamed/issued/committed per cycle



fetch instruction **bundles**parallel pipelines, shared registers
specialized pipelines

#### Longer instruction word pipeline

|       | Read Regs | Simple ALU    | _             | Write Back |  |
|-------|-----------|---------------|---------------|------------|--|
| Fetch | Read Regs | Address ALU   | Memory        | Write Back |  |
|       | Read Regs | Int/Mul ALU 1 | Int/Mul ALU 2 | Write Back |  |

#### VLIW vs 000

VLIW is like OOO but...

instructions are scheduled at compile-time, not run-time

eliminates OOO scheduling logic/queues

compiler does dependency detection including dealing with functional unit latency

possibly eliminates reorder buffer

### **VLIW** problems

requires smart compiler can't reschedule based on memory latency, etc. assembly/machine code tied to particular HW design

### **VLIW** exercise

int \*foo; int \*bar;

```
for (int i = 0; i < 1000; ++i) {</pre>
     *foo = *foo * *bar;
     foo += 1;
     bar += 1;
ouline what assembly for a VLIW processor with:
     bundles of two instructions.
     1: load/store (address is reg+offset) or add/subtract
     2: compare-and-branch or multiply or add/subtract
     all instructions take two cycles to produce usable result
     all instructions take registers or constants
     adds can load a constant
```

### **VLIW** exercise: slow answer

```
# RO: FOO; R1: BAR; R2: I
# R3: F00 temp1, R4: BAR temp1
R2 \leftarrow 0
        . NOP
Loop:
                 . IF R1 < 1000 GOTO End
NOP
R3 \leftarrow M[R0+0] . R2 \leftarrow R2 + 1 // foo . ++i
R4 \leftarrow M[R1+0] . R1 \leftarrow R1 + 4 // bar . ++bar
                    R0 \leftarrow R0 + 4 // . ++foo
NOP
NOP
                    R3 \leftarrow R3 \times R4 // ... \times
                           // wait for 	imes
NOP
                    NOP
                               // foo .
M[R1-4] \leftarrow R3 . NOP
                 . GOTO Loop
NOP
End:
```

### **VLIW** exercise: faster answer?

needed nops due to instruction delays/lack of work alternative:

unroll loop several times

move loads/stores between iterations of the loop eliminate branch at beginning

#### final notes

a bunch of multiple choice (because I could write it) have room until 7:15PM — will give 2 hours office hours Friday 10am–12pm / Piazza super last minute questions? office hours Monday 1pm–3pm