ooo

beyond pipelining: multiple issue

  • start more than one instruction/cycle
  • multiple parallel pipelines; many-input/output register file
  • hazard handling much more complex

beyond pipelining: out-of-order

  • find later instructions to do instead of stalling

  • lists of available instructions in pipeline registers

    • take any instruction with available values
  • provide illusion that work is still done in order

    • much more complicated hazard handling logic

interlude: real CPUs

  • modern CPUs:
  • execute multiple instructions at once
  • execute instructions out of order — whenever values available

out-of-order and hazards

  • out-of-order execution makes hazards harder to handle

  • problems for forwarding:
    • value in last stage may not be most up-to-date
    • older value may be written back before newer value?
  • problems for branch prediction:
    • mispredicted instructions may complete execution before squashing
  • which instructions to dispatch?
    • how to quickly find instructions that are ready?

read-after-write examples (1)


normal pipeline: two options for %r8?
choose the one from earliest stage
because it’s from the most recent instruction

out-of-order execution:
%r8 from earliest stage might be from delayed instruction
can’t use same forwarding logic

register version tracking

  • goal: track different versions of registers
  • out-of-order execution: may compute versions at different times
  • only forward the correct version
  • strategy for doing this: preprocess instructions represent version info
  • makes forwarding, etc. lookup easier

rewriting hazard examples (1)

addq %r10, %r8 addq %r10v1, %r8v1 \(\rightarrow\) %r8v2
addq %r11, %r8 addq %r11v1, %r8v2 \(\rightarrow\) %r8v3
addq %r12, %r8 addq %r12v1, %r8v3 \(\rightarrow\) %r8v4

  • read different version than the one written

    • represent with three argument psuedo-instructions
  • forwarding a value? must match version exactly


  • for now: version numbers

  • later: something simpler to implement

write-after-write example

many instructions producing differnt version of %r8
most recently run instruction may be writing outdated version

multiple instructions that haven’t started
could need different versions of %r8

keeping multiple versions

  • for write-after-write problem: need to keep copies of multiple versions

    • both the new version and the old version needed by delayed instructions
  • for read-after-write problem: need to distinguish different versions

  • solution: have lots of extra registers

  • … and assign each version a new ‘real’ register


  • called register renaming

register renaming

  • rename architectural registers to physical registers
  • different physical register for each version of architectural
  • track which physical registers are ready
  • compare physical register numbers to do forwarding

an OOO pipeline

an OOO pipeline (branch prediction)

branch prediction needs to happen before instructions decoded
done with cache-like tables of information about recent branches

an OOO pipeline (register renaming)

register renaming after decoding
where mapping from architectural to physical names kept

“dispatch” instructions = add to instruction queue
also requires preparing reorder buffer (for handling squashing)

an OOO pipeline (instruction queue)

instruction queue holds pending renamed instructions
combined with register-ready info to issue instructions
(issue = start executing)

an OOO pipeline (starting instruction)

after selecting from instruction queue,
read from large register file and handle forwarding

typically read 6+ registers at a time
extra data paths for forwarding (not drawn)

an OOO pipeline (execution units)

many execution units actually do math or cache load/store
some may have multiple pipeline stages
some may take variable time (data cache, integer divide, …)

an OOO pipeline (writeback)

writeback to physical registers
register file typically supports writing 3+ registers at a time

an OOO pipeline (commit)

new commit (sometimes retire) stage finalized instruction
figures out when physical registers can be reused

also tracks bookkeeping information if we need to undo instruction
(because of branch misprediction/segfault/etc.)

an OOO pipeline diagram

an OOO pipeline (register renaming)

register renaming

  • rename architectural registers to physical registers

    • architectural = part of instruction set architecture
  • different name for each version of architectural register

register renaming state

register renaming example (1)

register renaming example (2)

register renaming exercise

an OOO pipeline (starting instruction)

instruction queue and dispatch

instruction queue and dispatch ex

an OOO pipeline (execution units)

execution units AKA functional units (1a)

  • where actual work of instruction is done

  • e.g. the actual ALU, or data cache

  • sometimes pipelined:

    • (here: 1 op/cycle; 3 cycle latency)
  • exercise: how long to compute \(A\times (B\times (C\times D))\)?

\(3\times 3\) cycles + any time to forward values

execution units AKA functional units (1b)

  • exercise: how long to compute \((A\times B)\times (C\times D)\)?

assuming just one ALU:

1 2 3 4 5 6 7
\(A\times B\) 1 2 3
\(C\times D\) 1 2 3
final 1 2 3
  • exercise part 2: how much would second ALU help?

execution units AKA functional units (2)

  • where actual work of instruction is done
  • e.g. the actual ALU, or data cache
  • sometimes unpipelined:

instruction queue and dispatch (multicycle)

register renaming: missing pieces

  • what about ‘‘hidden’’ inputs like %rsp, condition codes?

  • one solution: translate to intructions with additional register parameters

    • making %rsp explicit parameter
    • turning hidden condition codes into operands!
  • bonus: can also translate complex instructions to simpler ones

data flow visualization

OOO limitations

  • can’t always find instructions to run

    • plenty of instructions, but all depend on unfinished ones
    • programmer can adjust program to help this
  • need to track all uncommitted instructions

    • can only go so far ahead
    • e.g. Intel Skylake: 224-entry reorder buffer, 168 physical registers
  • branch misprediction has a big cost (relative to pipelined)

    • e.g. Intel Skylake: up to approx. 16 cycles (v. 2 for simple pipelined CPU)

some performance examples

data flow model and limits

Backup slides

backup slides

data flow model and limits (1)

data flow visualization

Intel Skylake OOO design

  • 2015 Intel design — codename ‘Skylake’

  • 94-entry instruction queue-equivalent

  • 168 physical integer registers

  • 168 physical floating point registers

  • 4 ALU functional units

    • but some can handle more/different types of operations than others
  • 2 load functional units

    • but pipelined: supports multiple pending cache misses in parallel
  • 1 store functional unit

  • 224-entry reorder buffer

    • determines how far ahead branch mispredictions, etc. can happen

indirect branch prediction

  • jmp *%rax or jmp *(%rax, %rcx, 8) or call *%rax or …
    • example: implementing switch statement
    • example: implementing polymorphic method call
  • want to predict target address
  • BTB could store one possibility
  • really want to take advantage of context
    • example: guess whether we’ll call Rectangle.GetArea or Circle.GetArea
  • prediction idea: instead of tables containing taken/not taken…
  • … have table containing predicted target
    • one implementation: hashtable keyed by recent branches taken

reorder buffer: on rename

reorder buffer: on commit

reorder buffer: commit mispredict (one way)

better? alternatives

  • can take snapshots of register map on each branch

    • don’t need to reconstruct the table
    • (but how to efficiently store them)
  • can reconstruct register map before we commit the branch instruction

    • need to let reorder buffer be accessed even more?
  • can track more/different information in reorder buffer

exceptions and OOO (one strategy)

the open-source BROOM pipeline

Figure from Celio et al., ‘‘BROOM: An Open Source Out-Of-Order Processor With Resilient Low-Voltage Operation in 28-nm CMOS’’

data flow model and limits

better data-flow

int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
    sum1 += array[i]
    sum2 += array[i+1]
}
sum = sum1 + sum2;

register renaming: missing pieces

  • what about ‘‘hidden’’ inputs like %rsp, condition codes?

  • one solution: translate to intructions with additional register parameters

    • making %rsp explicit parameter
    • turning hidden condition codes into operands!
  • bonus: can also translate complex instructions to simpler ones

data flow as loop optimization

data flow model and limits

better data-flow

int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
    sum1 += array[i]
    sum2 += array[i+1]
}
sum = sum1 + sum2;