CS 6354: Tomasulo

21 September 2016

## To read more...

This day's paper:

Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units"

Supplementary readings:

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, section 3.4-5

Shin and Lipatsi, Modern Processor Design, section 5.2

# **Intel Skylake**



## **Scheduling**

How can we reorder instructions?

Without changing the answer

## Recall: Data hazards

Instructions had wrong data

... because they weren't executed one-at-a-time

Example: reading old value of register

#### Recall: Read-after-Write

$$r1 \leftarrow r2 + r3$$
 $r4 \leftarrow r1 - r5$ 

 1 | IF
 | IF

 2 | ID: read r2, r3 | IF
 | IF

 3 | EX: temp1  $\leftarrow$  r2 + r3 | ID: read r1, r5

 4 | MEM | EX: temp2  $\leftarrow$  r1 - r5

 5 | WB: r1  $\leftarrow$  temp | MEM | WB: r4  $\leftarrow$  temp2

## Write-after-Write

$$r1 \leftarrow r2 + r3$$
; (1)  
...  
 $r1 \leftarrow r6 + r7$ ; (2)  
 $r4 \leftarrow r2 + r1$ ; (3)

| time | $ r1 \leftarrow r2 + r3 $ | r1 ← r6 + r7 | $ r4 \leftarrow r2 + r1 $ |
|------|---------------------------|--------------|---------------------------|
| 1    |                           | read r6, r7  |                           |
| 2    | read r2, r3               | compute      |                           |
| 3    | compute                   | write r1     |                           |
| 4    | write r1                  | *****        | destant of                |
| 5    |                           |              | desired value             |
| 6    | value                     | read         | read r1, r2               |
| 7    |                           |              | compute                   |
|      |                           | •            |                           |

## Write-after-Read

$$r1 \leftarrow r2 + r3$$
; (1)  
 $r3 \leftarrow r4 + r5$ ; (2)

| time | $r1 \leftarrow r2 + r3$ | r3 ← r4 + r5 |
|------|-------------------------|--------------|
| 1    |                         | read r4, r5  |
| 2    |                         | compute      |
| 3    |                         | write r3     |
| 4    | read r2, r3             |              |
| 5    | compute                 |              |
| 6    | write r1                |              |

5

# **Types of Data Hazards**

Read-after-Write (RAW)

also called: true dependence

Write-after-Write (WAW)

also called: output dependence

Write-after-Read (WAR)

also called: anti-dependence

## a problem with names

write-after-write

$$r1 \leftarrow r2 + r3$$
; (1)  
 $r1 \times \leftarrow r6 + r7$ ; (2)  
 $r4 \leftarrow r2 + r1 \times$ ; (3)

write-after-read

$$r1 \leftarrow r2 + r3$$
; (1)  
 $r3 \times \leftarrow r4 + r5$ ; (2)

no problem if we used a different name each write

8

# register renaming

original code with renaming

$$r1 \leftarrow r2 + r3 \text{ new1} \leftarrow r2 + r3$$
 ; (1)  
 $r7 \leftarrow r1 + r3 \text{ new2} \leftarrow \text{new1} + r3$  ; (2)  
 $r1 \leftarrow r6 + r7 \text{ new3} \leftarrow r6 + r7$  ; (3)  
 $r4 \leftarrow r2 + r1 \text{ new4} \leftarrow r2 + \text{new3}$  ; (4)  
 $r2 \leftarrow r4 + r5 \text{ new5} \leftarrow r4 + r5$  ; (5)

| new  | old  | from | up to |
|------|------|------|-------|
| name | name |      |       |
| new1 | r1   | (1)  | (2)   |
| new2 | r7   | (2)  |       |
| new3 | r1   | (3)  | _     |
| new4 | r4   | (4)  |       |
| new5 | r2   | (5)  |       |

# scheduling with renaming

different architectual (external) and internal register names

new internal name on each write

1(

# register renaming state

| original code | with renaming              |
|---------------|----------------------------|
| r1 ← r2 + r3  | x09 ← x02 + x03            |
| r7 ← r1 + r3  | $x10 \leftarrow x09 + x03$ |
| r1 ← r6 + r7  | $x11 \leftarrow x06 + x10$ |
| r4 ← r2 + r1  | $x12 \leftarrow x02 + x11$ |
| r2 ← r4 + r5  | x13 ← x12 + x05            |

| external name | internal name          |
|---------------|------------------------|
| r1            | <del>×01</del> ×09 ×11 |
| r2            | <del>×02</del> ×13     |
| r3            | ×03                    |
| r4            | ×04 ×12                |
| r5            | ×05                    |
| r6            | ×06                    |
| r7            | ×07 ×10                |
| r8            | x08                    |

## **Diversion: SSA**

compiler technique: static single-assignment (SSA) form

eewrite code as code with immutable variables only makes optimization easier

if you know it — this will seem familiar

12

# scheduling with renaming

| #   | (renamed) instructions     | run on | done? |
|-----|----------------------------|--------|-------|
| (1) | x05 ← Mem[x03]             | Load   | yes   |
| (2) | x06 ← x01 + x02            | Add1   | yes   |
| (3) | x07 ← x01 × x02            | Mult   | yes   |
| (4) | x08 ← x05 × x04            | Mult   | yes   |
| (5) | x09 ← x05 + x04            | Add2   | yes   |
| (6) | $x10 \leftarrow x07 + x06$ | Add1   | yes   |

| int. | ready? |
|------|--------|
| name | ready! |
| ×01  | yes    |
| ×02  | yes    |
| ×03  | yes    |
| ×04  | yes    |
| ×05  | yes    |
|      |        |

Might have second adder, but x5 is not ready.

| ×08 | yes |
|-----|-----|
| ×09 | yes |
| ×10 | yes |

# handling variable times

scheduling is reactive

Load took longer? Doesn't matter.

Don't try to start things until ready.

13

# Running out of register names?

recycle names with no operations, external name still out of names? don't issue more instructions

## reservation stations vs registers

Tomasulo paper doesn't seem to have extra registers

But has reservation stations

... with tags

these are extra registers and their names

16

18

# pieces in Tomasulo



# scheduling with reservation buffers

| #   | (renamed) instructions    | run on | done? |
|-----|---------------------------|--------|-------|
| (1) | $x05 \leftarrow Mem[x03]$ | Load   | yes   |
| (2) | x06 ← x01 + x02           | Add1   | yes   |
| (3) | x07 ← x01 × x02           | Mult   | yes   |
| (4) | x08 ← x05 × x04           | Mult   |       |
| (5) | x09 ← x05 + x04           | Add2   |       |
| (6) | x10 ← x07 + x06           | Add1   | yes   |

dispatching transmits register values

|                 | Add1                 | Add2  | Mult     | Load |
|-----------------|----------------------|-------|----------|------|
| source 1 tag    | ×01×07               | ×05   | ×01×05   | ×03  |
| source 1 ready? | yesnoyes             | noyes | yesnoyes | yes  |
| source 2 tag    | ×02×06               | ×04   | ×02×04   |      |
| source 2 ready? | <mark>yes</mark> yes | yes   | yes      |      |
| sink tag        | ×06×10               | ×09   | ×07×08   | ×05  |

#### common data bus

results are broadcast here

tag ≈ internal register name

reservation stations listen for operands

register file listens for register values

keeps register file from being bottleneck

fancy buses: mutliple value+tags per clock cycle

## issuing instructions

assign tags for operands instruction will execute when operands are ready handles variable length operations (e.g. loads)

20

integrating with reorder buffer



# integrating with reorder buffer (2)

reorder buffer just another thing listening on bus

Hennessy & Patterson Figure 3.11

22

# multiple entries in reservation stations

instead of dispathcing one instruction, issue a list reservation station starts whichever one gets operands first

#### variations on reservation stations

Intel P6: shared reservation station for all types of operations

MIPS R10000 (next Monday's paper): read from shared register file (with renaming)

24

26

# Intel P6 execution unit datapaths



#### summary

register renaming to avoid data hazards otherwise even write-after-write, write-after-read a problem

shared bus to communicate results
register file, reservation buffers listen on bus
can dispatch to buffer before value ready

2