ooo

beyond pipelining: multiple issue

start more than one instruction/cycle
multiple parallel pipelines; many-input/output register file
hazard handling much more complex

beyond pipelining: out-of-order

find later instructions to do instead of stalling
lists of available instructions in pipeline registers
- take any instruction with available values
provide illusion that work is still done in order
- much more complicated hazard handling logic

interlude: real CPUs

modern CPUs:
execute multiple instructions at once
execute instructions out of order — whenever values available

out-of-order and hazards

out-of-order execution makes hazards harder to handle

problems for forwarding:
- value in last stage may not be most up-to-date
- older value may be written back before newer value?
problems for branch prediction:
- mispredicted instructions may complete execution before squashing
which instructions to dispatch?
- how to quickly find instructions that are ready?

read-after-write examples (1)

normal pipeline: two options for %r8?
choose the one from earliest stage
because it’s from the most recent instruction

out-of-order execution:
%r8 from earliest stage might be from delayed instruction
can’t use same forwarding logic

register version tracking

goal: track different versions of registers
out-of-order execution: may compute versions at different times
only forward the correct version
strategy for doing this: preprocess instructions represent version info
makes forwarding, etc. lookup easier

rewriting hazard examples (1)

`addq %r10, %r8`	`addq %r10_v1, %r8_v1 \(\rightarrow\) %r8_v2`
`addq %r11, %r8`	`addq %r11_v1, %r8_v2 \(\rightarrow\) %r8_v3`
`addq %r12, %r8`	`addq %r12_v1, %r8_v3 \(\rightarrow\) %r8_v4`

read different version than the one written
- represent with three argument psuedo-instructions
forwarding a value? must match version exactly
for now: version numbers
later: something simpler to implement

write-after-write example

many instructions producing differnt version of %r8
most recently run instruction may be writing outdated version

multiple instructions that haven’t started
could need different versions of %r8

keeping multiple versions

for write-after-write problem: need to keep copies of multiple versions
- both the new version and the old version needed by delayed instructions
for read-after-write problem: need to distinguish different versions
solution: have lots of extra registers
… and assign each version a new ‘real’ register
called register renaming

register renaming

rename architectural registers to physical registers
different physical register for each version of architectural
track which physical registers are ready
compare physical register numbers to do forwarding

an OOO pipeline

an OOO pipeline (branch prediction)

branch prediction needs to happen before instructions decoded
done with cache-like tables of information about recent branches

an OOO pipeline (register renaming)

“dispatch” instructions = add to instruction queue
also requires preparing reorder buffer (for handling squashing)

an OOO pipeline (instruction queue)

instruction queue holds pending renamed instructions
combined with register-ready info to issue instructions
(issue = start executing)

an OOO pipeline (starting instruction)

after selecting from instruction queue,
read from large register file and handle forwarding

typically read 6+ registers at a time
extra data paths for forwarding (not drawn)

an OOO pipeline (execution units)

many execution units actually do math or cache load/store
some may have multiple pipeline stages
some may take variable time (data cache, integer divide, …)

an OOO pipeline (writeback)

writeback to physical registers
register file typically supports writing 3+ registers at a time

an OOO pipeline (commit)

new commit (sometimes retire) stage finalized instruction
figures out when physical registers can be reused

also tracks bookkeeping information if we need to undo instruction
(because of branch misprediction/segfault/etc.)

an OOO pipeline diagram

an OOO pipeline (register renaming)

register renaming

rename architectural registers to physical registers
- architectural = part of instruction set architecture
different name for each version of architectural register

register renaming state

register renaming example (1)

register renaming example (2)

register renaming exercise

an OOO pipeline (starting instruction)

instruction queue and dispatch

instruction queue and dispatch ex

an OOO pipeline (execution units)

execution units AKA functional units (1a)

where actual work of instruction is done
e.g. the actual ALU, or data cache
sometimes pipelined:
- (here: 1 op/cycle; 3 cycle latency)

exercise: how long to compute \(A\times (B\times (C\times D))\)?

\(3\times 3\) cycles + any time to forward values

execution units AKA functional units (1b)

exercise: how long to compute \((A\times B)\times (C\times D)\)?

assuming just one ALU:

	1	2	3	4	5	6	7
\(A\times B\)	1	2	3
\(C\times D\)		1	2	3
final					1	2	3

exercise part 2: how much would second ALU help?

execution units AKA functional units (2)

where actual work of instruction is done
e.g. the actual ALU, or data cache
sometimes unpipelined:

instruction queue and dispatch (multicycle)

register renaming: missing pieces

what about ‘‘hidden’’ inputs like %rsp, condition codes?
one solution: translate to intructions with additional register parameters
- making %rsp explicit parameter
- turning hidden condition codes into operands!
bonus: can also translate complex instructions to simpler ones

data flow visualization

OOO limitations

can’t always find instructions to run
- plenty of instructions, but all depend on unfinished ones
- programmer can adjust program to help this
need to track all uncommitted instructions
- can only go so far ahead
- e.g. Intel Skylake: 224-entry reorder buffer, 168 physical registers
branch misprediction has a big cost (relative to pipelined)
- e.g. Intel Skylake: up to approx. 16 cycles (v. 2 for simple pipelined CPU)

some performance examples

data flow model and limits

Backup slides

backup slides

data flow model and limits (1)

data flow visualization

Intel Skylake OOO design

2015 Intel design — codename ‘Skylake’
94-entry instruction queue-equivalent
168 physical integer registers
168 physical floating point registers
4 ALU functional units
- but some can handle more/different types of operations than others
2 load functional units
- but pipelined: supports multiple pending cache misses in parallel
1 store functional unit
224-entry reorder buffer
- determines how far ahead branch mispredictions, etc. can happen

indirect branch prediction

jmp *%rax or jmp *(%rax, %rcx, 8) or call *%rax or …
- example: implementing switch statement
- example: implementing polymorphic method call
want to predict target address
BTB could store one possibility
really want to take advantage of context
- example: guess whether we’ll call Rectangle.GetArea or Circle.GetArea
prediction idea: instead of tables containing taken/not taken…
… have table containing predicted target
- one implementation: hashtable keyed by recent branches taken

reorder buffer: on rename

reorder buffer: on commit

reorder buffer: commit mispredict (one way)

better? alternatives

can take snapshots of register map on each branch
- don’t need to reconstruct the table
- (but how to efficiently store them)
can reconstruct register map before we commit the branch instruction
- need to let reorder buffer be accessed even more?
can track more/different information in reorder buffer

exceptions and OOO (one strategy)

the open-source BROOM pipeline

Figure from Celio et al., ‘‘BROOM: An Open Source Out-Of-Order Processor With Resilient Low-Voltage Operation in 28-nm CMOS’’

data flow model and limits

better data-flow

int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
    sum1 += array[i]
    sum2 += array[i+1]
}
sum = sum1 + sum2;

register renaming: missing pieces

what about ‘‘hidden’’ inputs like %rsp, condition codes?
one solution: translate to intructions with additional register parameters
- making %rsp explicit parameter
- turning hidden condition codes into operands!
bonus: can also translate complex instructions to simpler ones

data flow as loop optimization

data flow model and limits

better data-flow

int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
    sum1 += array[i]
    sum2 += array[i+1]
}
sum = sum1 + sum2;