## CS 6354: Homework 1 Post-Mortem / MIPS R10000

26 September 2016

#### To read more...

This day's paper: Yeager, "The MIPS R10000 Superscalar microprocessor"

Also discussed:

Homework 1 on caches

Supplementary readings:

Kanter, "Intel's Haswell CPU Microarchitecture"

### MIPS R10000: Weird names

instruction queue  $\approx$  (shared) reservation station

active list  $\approx$  reorder buffer

both don't store values — actually in register file

## MIPS R10000: Stages



# MIPS R10000: Register Renaming/Queues



## **MIPS R10000: Register Renaming**

explicit register map data structure

#### **MIPS R10000: Instruction Queue**



## MIPS R10000: Instruction Queue v. Reservation Station

- shared register file
- queue only tracks register numbers
- metadata:
- branch mask for branch mispredicts
- ready bits local copy of busy bits
- pointer to active list (ROB)

### **MIPS R10000: Functional Units**

| Unit                | Latency<br>(cycles) | Repeat rate<br>(cycles) | Instruction                                                  | Table 2. Latency and repeat rates for floating-point instructions. |                     |                         |                            |
|---------------------|---------------------|-------------------------|--------------------------------------------------------------|--------------------------------------------------------------------|---------------------|-------------------------|----------------------------|
| Either ALU<br>ALU 1 | · · 1               | , 1                     | Add, subtract, logical, move Hi/Lo, trap<br>Integer branches | Unit                                                               | Latency<br>(cycles) | Repeat rate<br>(cycles) | instruction                |
| ALU 1               | 1                   | 1                       | Shift                                                        | Unit                                                               | (c)cics)            | (cjcics)                | instactor.                 |
| ALU 1               | - 1                 | 1                       | Conditional move                                             | Add                                                                | 2                   | 1                       | Add, subtract, compare     |
| ALU 2               | 5/6                 | 6                       | 32-bit multiply                                              | Multiply                                                           | 2                   | 1                       | Integer branches           |
|                     | 9/10                | 10                      | 64-bit multiply                                              | Divide                                                             | 12                  | 14                      | 32-bit divide              |
|                     |                     |                         | (to Hi/Lo registers)                                         |                                                                    | 19                  | 21                      | 64-bit divide              |
| ALU 2               | 34/35               | 35                      | 32-bit divide                                                | Square root                                                        | 18                  | 20                      | 32-bit square root         |
|                     | 66/67               | 67                      | 64-bit divide                                                |                                                                    | 33                  | 35                      | 64-bit square root         |
| Load/store          | 2                   | 1                       | Load integer                                                 | Load/store                                                         | 3                   | 1                       | Load floating-point value  |
|                     | _                   | 1                       | Store integer                                                |                                                                    | _                   | 1                       | Store floating-point value |

### Moving load/stores around

| program order | desired (fast) order |
|---------------|----------------------|
| store X       | load Z               |
| store Y       | store X              |
| load Z        | store Y              |

### Moving load/stores around

| program order               | desired (fast) order |
|-----------------------------|----------------------|
| store X                     | load Z               |
| store Y                     | store X              |
| load Z                      | store Y              |
| what if $X == \overline{Z}$ | Z or $Y == Z?$       |

## **MIPS R10000: Memory requests**

- 16 entry address queue
- kept in program order
- tracks dependencies (overlapping memory accesses)
- special-case for two accesses to same cache set
- match cache accesses against all loads
- load to store forwarding

### **MIPS R10000: Synchronization**

execute memory accesses in order

... in case other processors are listening

treat like exception if other processors are listening

## LL/SC atomic increment

retry: //  $$t0 \leftarrow value$ ll \$t0, value // \$t0 < \$t0 + 1 addi \$t0, \$t0, 1 // value  $\leftarrow$  \$t0 if memory unchanged //  $t0 \leftarrow 1$  if stored, 0 otherwise sc \$t0, value // if sc unsuccessful, goto retry begz \$t0, retry

nop // (delay slot)

## MIPS R10000: Weird Tricks

predecoding in instruction cache — opcode preprocessed

instruction cache specialized for unaligned accesses within a block

multibanked data cache — half the sets in one cache, half in another

### core storage (approx sizes)

```
data — approx 8KB
register files — 8192 bits
```

metadata — approx 4KB register map tables — 390 bits free list — 192 bits active list — 672 bits busy bits: — 128 bits instruction queues — 1600 bits address queue — 1232 bits

## SGI's workload

graphics

lots of floating point

big images

#### evolution of modern processors

fetch/cycle reorder buffer instruction queue execute/cycle memory/cycle operand width L1 cache L2 cache L3 cache L1 TLB L2 TLB predecoding branch pred. cores/package

MIPS R10000 (1996) 4 instructions 32 entry 16 int + 16 FP + 16 mem2 int + 2 FP1 load or store 32 bit 32K I. 32K D off-chip none 64 entry none in I-cache local, 512 entry 1

Intel Haswell (2013) 5 instructions 192 entry 60 unified 2 int/FP + 2 int $2 \log 4 + 1 \text{ store}$ 32 bit to 256 bit 32K I. 32K D 256K 1 + MB64 entry D, 64 entry I 1024 entry micro-op cache ??? 2 - 1816

~

#### evolution of modern processors

fetch/cycle reorder buffer instruction queue execute/cycle memory/cycle operand width L1 cache L2 cache L3 cache L1 TLB L2 TLB predecoding branch pred.

cores/package

MIPS R10000 (1996) 4 instructions 32 entry 16 int + 16 FP + 16 mem2 int + 2 FP1 load or store 32 bit 32K I. 32K D off-chip none 64 entry none in I-cache local, 512 entry 1

Intel Haswell (2013) 5 instructions 192 entry 60 unified 2 int/FP + 2 int $2 \log 4 + 1 \text{ store}$ 32 bit to 256 bit 32K I. 32K D 256K 1 + MB64 entry D, 64 entry I 1024 entry micro-op cache ??? 2 - 1816

~

#### evolution of modern processors

fetch/cycle reorder buffer instruction queue execute/cycle memory/cycle operand width L1 cache L2 cache L3 cache L1 TLB L2 TLB predecoding branch pred. cores/package

MIPS R10000 (1996) 4 instructions 32 entry 16 int + 16 FP + 16 mem2 int + 2 FP1 load or store 32 bit 32K I. 32K D off-chip none 64 entry none in I-cache local, 512 entry 1

Intel Haswell (2013) 5 instructions 192 entry 60 unified 2 int/FP + 2 int $2 \log 4 + 1 \text{ store}$ 32 bit to 256 bit 32K I. 32K D 256K 1 + MB64 entry D, 64 entry I 1024 entry micro-op cache ??? 2 - 1816

~

#### **Micro-ops**

complex instruction encodings don't allow pre-decode trick

complex insturctions can't go to a single functional unit

trick: split into micro-ops

extra decoding step

Intel Haswell: cache for micro-ops

#### Homework 1: Rubric

140 points total: For each thing benchmarked:
5 points: benchmark description, including how to read results
5 points: raw results and code are included, match description, interpreted plausibly

10 points: tested system described, report parts clear

# Homework 1: General Concerns: Benchmarking Discipline

how consistent are your measurements?

is that really an increase?

be honest

# Homework 1: General Concerns: Latency v. Bandwidth

bandwidths much better than latencies (everywhere) memory system relies on overlapping many memory accesses

measuring sizes? better to measure latency

better to avoid prefetching — e.g. random access pattern, pointer chasing

#### Next Time: SMT

multiple threads on one core

later: multiple processors/cores

#### **Definition: Thread**

stream of program execution

own registers

own program counter (current instruction pointer) may or may not share memory

appears to execute at same time as other threads

## Multithreading

```
thread_one_func(int offset) {
    for (int i = 0; i < N / 2; ++i)
        sum1 += array[offset + i];
}
thread two func() {
    for (int i = N / 2; i < N; ++i)</pre>
        sum2 += array[i];
}
compute sum() {
    thread one = thread create(thread one func);
    thread two = thread create(thread two func);
    wait for thread(thread one);
    wait_for_thread(thread_two);
    sum = sum1 + sum2;
```

#### **Different Parallelism**

instruction-level parallelism sequential sequence of instructions not actually sequential transparent to programmer

next up: thread-level parallelism multiple sequential sequences of instructions run in parallel (apparently) exposed to programmer

later: vectorization

sequential sequence of instructions that each does multiple copies of the same thing exposed to programmer

# Flynn's Taxonomy

|               | Single instruction | Multiple instruction |
|---------------|--------------------|----------------------|
| Single data   | serial             | ???                  |
| Multiple data | vectors            | threads              |