find later instructions to do instead of stalling
lists of available instructions in pipeline registers
provide illusion that work is still done in order
normal pipeline: two options for %r8?
choose the one from earliest stage
because it’s from the most recent instruction
out-of-order execution:
%r8 from earliest stage might be from delayed instruction
can’t use same forwarding logic
| addq %r10, %r8 | addq %r10v1, %r8v1 \(\rightarrow\) %r8v2 |
| addq %r11, %r8 | addq %r11v1, %r8v2 \(\rightarrow\) %r8v3 |
| addq %r12, %r8 | addq %r12v1, %r8v3 \(\rightarrow\) %r8v4 |
read different version than the one written
forwarding a value? must match version exactly
for now: version numbers
later: something simpler to implement
many instructions producing differnt version of %r8
most recently run instruction may be writing outdated version
multiple instructions that haven’t started
could need different versions of %r8
for write-after-write problem: need to keep copies of multiple versions
for read-after-write problem: need to distinguish different versions
solution: have lots of extra registers
… and assign each version a new ‘real’ register
called register renaming
branch prediction needs to happen before instructions decoded
done with cache-like tables of information about recent branches
register renaming after decoding
where mapping from architectural to physical names kept
“dispatch” instructions = add to instruction queue
also requires preparing reorder buffer (for handling squashing)
instruction queue holds pending renamed instructions
combined with register-ready info to issue instructions
(issue = start executing)
after selecting from instruction queue,
read from large register file and handle forwarding
typically read 6+ registers at a time
extra data paths for forwarding (not drawn)
many execution units actually do math or cache load/store
some may have multiple pipeline stages
some may take variable time (data cache, integer divide, …)
writeback to physical registers
register file typically supports writing 3+ registers at a time
new commit (sometimes retire) stage finalized instruction
figures out when physical registers can be reused
also tracks bookkeeping information if we need to undo instruction
(because of branch misprediction/segfault/etc.)
rename architectural registers to physical registers
different name for each version of architectural register
where actual work of instruction is done
e.g. the actual ALU, or data cache
sometimes pipelined:
\(3\times 3\) cycles + any time to forward values
assuming just one ALU:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|
| \(A\times B\) | 1 | 2 | 3 | ||||
| \(C\times D\) | 1 | 2 | 3 | ||||
| final | 1 | 2 | 3 |
what about ‘‘hidden’’ inputs like %rsp, condition codes?
one solution: translate to intructions with additional register parameters
bonus: can also translate complex instructions to simpler ones
can’t always find instructions to run
need to track all uncommitted instructions
branch misprediction has a big cost (relative to pipelined)
2015 Intel design — codename ‘Skylake’
94-entry instruction queue-equivalent
168 physical integer registers
168 physical floating point registers
4 ALU functional units
2 load functional units
1 store functional unit
224-entry reorder buffer
jmp *%rax or jmp *(%rax, %rcx, 8) or call *%rax or …
can take snapshots of register map on each branch
can reconstruct register map before we commit the branch instruction
can track more/different information in reorder buffer
Figure from Celio et al., ‘‘BROOM: An Open Source Out-Of-Order Processor With Resilient Low-Voltage Operation in 28-nm CMOS’’
int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
sum1 += array[i]
sum2 += array[i+1]
}
sum = sum1 + sum2;
what about ‘‘hidden’’ inputs like %rsp, condition codes?
one solution: translate to intructions with additional register parameters
bonus: can also translate complex instructions to simpler ones
int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
sum1 += array[i]
sum2 += array[i+1]
}
sum = sum1 + sum2;