Note:
make test-pipehw2in versions ofhclrs.tarbefore 21 March 2018 12:15pm were pickier than we intended to be with respect to load-use hazards.(It assumed that you would stall whenever a loaded value is used in the following instruction (like the textbook does), even if the value is not needed until after execute (e.g.
mrmovq (%rax), %rbxfollowed byrmmovq %rbx, (%rcx)) and you could implement another forwarding path instead.)Although we believe that this will affect very few students, we recommend obtaining a new version of
hclrs.tar.
Your task
- Combine your solution from the previous HW and the previous lab into a new file called
pipehw2.hclto create a five-stage pipelined processor with forwarding and branch prediction as described in the textbook that implements:nophaltirmovqrrmovqOPqcmovXXrmmovqmrmovq
We will provide an example lab solution by the end of the day following the lab.
- Add the remaining instructions:
jXXpushqpopqcallret
-
Test your combined simulator with
make test-pipehw2 - Submit your solution to archimedes
Hints/Approach
General Approach
You may approach this however you wish, but I suggest the following flow:
- Combine your
pipehw1.hclandpipelab2.hcland test the combination. All of the tests that either source file passed previously ought to still pass the combination. - Add
jXXwith speculative execution and branch misprediction recovery. Predict that all branches are taken. Test. - Add
pushqand test. - Add
calland test. - Add
popqand test. - Add
retwith handling for the return hazard, and test.
Implementing jXX
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |||
|---|---|---|---|---|---|---|---|---|---|
jXX |
F | D | E | (next PC available) | M | W | |||
wrong1 |
F | D | (bubble needed) | ||||||
wrong2 |
F | (bubble needed) | |||||||
right1 |
F | D | E | M | W |
-
Replace
pcin thefF(orxForpPor whatever else you called it) register bank wtihpredPC, which will store a predicted PC value instead of the actual PC value.To speculatively use this prediction, we can set
pcto the predicted PC (pc = F_predPC). -
Your processor should predict that all
jXXs are taken (the new PC isvalC). -
We will detect that predictions are wrong near the end of the
jXX’s execute stage (when we check the condition codes). We will fetch the correct instruction during the fetch stage in the next cycle (whenjXXis in the memory stage). - When we react to a misprediction, we need to:
- Squash the mispredicted instructions (which are about to enter the decode and execute stages).
This requires setting the
bubble_Xsignals in the cycle before the corrected instruction is fetched. - Fetch the corrected instruction next cycle (e.g. with a MUX in front of the
pcsignal).
- Squash the mispredicted instructions (which are about to enter the decode and execute stages).
This requires setting the
-
You can fetch the corrected instruction with a MUX front of the
pcsignal:pc = [ mispredicted : oldValP ; ... 1: F_predPC ; ];You may need to pass the
conditionsMetsignal or something equivalent through a pipeline register to be able to tell when a misprediction happened at the appropriate time. -
You will need access to the
valPfrom thejXXinstruction. To do so, you will probably need to pass it through pipeline registers. -
Make sure you correctly handle interactions between
jXXandhalt. Consider code like:jne foo halt foo: rmmovq %rax, (%rax) rmmovq %rax, (%rax) rmmovq %rax, (%rax)When the
haltis executed,F_predPCmay contain the address of anrmmovqinstead ofhalt, so simply settingstall_Fmay not result in fetching anhaltnext cycle.Some solutions to this problem may involve using an alternate technique to prevent the PC from changing, like changing the
pc = [...]MUX.
pushq and popq
-
rmmovqormrmovqexcept you useREG_RSPnotrBand ±8 notvalC. Also has a writeback component forREG_RSP. -
popqupdates two registers, so it will need bothreg_dstEandreg_dstM. -
popqreads from the old%rspwhilepushqwrites to the new%rsp.
call and ret
|1|2|3|4| |5|6|7|8|
------|-|-|-|-|-----------|-|-|-|-|-
`ret` |F|D|E|M|(next PC available)|W| | | |
`???` | |F|F|F|(next PC needed) |F|D|E|M|W
-
call 0x1234ispush valP; jXX 0x1234. Combining the logic of push and unconditional jump should be sufficient. -
retis jump-to-the-read-value-when-popping. It always encounters the “ret-hazard”:You’ll have to stall the fetch stage as long as a
retis in decode, execute, or memory and forward the value fromW_valMto thepc.
Testing your code
-
You can run the command
make test-pipehw2to run your processor on almost all the files iny86/, comparing its output to references supplied intestdata/pipe-reference. The list of tested files is intestdata/pipehw2.txt. For the filespop-forward2.yo,pop-forward3.yo,pop-forward4.yo,load-store.yo, you should have the same values, but you may take fewer cycles. If you have an version ofhclrs.tarfrom before 21 March 2018, it may not take this case into account. -
For each input file in
y86/, there is a trace from our reference implementation intestdata/pipe-traces. -
Your code should have the same semantics as
tools/yis: set the same registers and memory. You can use this to see if your processor does the correct thing on any input files, including files you come up with yourself. -
We will check the number of cycles your processor takes. As a general rule, your pipelined processor will need
- 1 cycle per instruction executed
- 4 extra cycles because we have a five-stage pipeline; even
halttakes 5 cycles now. - +1 more cycle for each load-use hazard (i.e., read from memory in one cycle, use as src next cycle)
- +2 more cycles for each conditional jump the code should not take (the misprediction penalty)
- +3 more cycles for each
retexecuted
Specific Test Cases
jXX
y86/j-cc.yo
irmovq $1, %rsi
irmovq $2, %rdi
irmovq $4, %rbp
irmovq $-32, %rax
irmovq $64, %rdx
subq %rdx,%rax
je target
nop
halt
target:
addq %rsi,%rdx
nop
nop
nop
halt
takes 15 cycles and leaves
| RAX: ffffffffffffffa0 RCX: 0 RDX: 40 |
| RBX: 0 RSP: 0 RBP: 4 |
| RSI: 1 RDI: 2 R8: 0 |
A full trace is available in testdata/pipe-traces/j-cc.txt
y86/jxx.yo
irmovq $3, %rax
irmovq $-1, %rbx
a:
jmp b
c:
jge a
halt
b:
addq %rbx, %rax
jmp c
takes 25 cycles and leaves
| RAX: ffffffffffffffff RCX: 0 RDX: 0 |
| RBX: ffffffffffffffff RSP: 0 RBP: 0 |
A full trace is available in testdata/pipe-traces/jxx.txt (distributed with hclrs.tar)
pushq
y86/push.yo
irmovq $3, %rax
irmovq $256, %rsp
pushq %rax
takes 8 cycles and leaves
| RAX: 3 RCX: 0 RDX: 0 |
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 03 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/push.txt
popq
y86/pop.yo
irmovq $4, %rsp
popq %rax
takes 7 cycles and leaves
| RAX: fb0000000000000 RCX: 0 RDX: 0 |
| RBX: 0 RSP: c RBP: 0 |
A full trace is available as testdata/pipe-traces/pop.txt
call
y86/call.yo
irmovq $256, %rsp
call a
addq %rsp, %rsp
a:
halt
takes 7 cycles and leaves
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 13 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/call.txt
ret
y86/ret.yo
irmovq $256, %rsp
irmovq a, %rbx
rmmovq %rbx, (%rsp)
ret
halt
a:
irmovq $258, %rax
halt
takes 13 cycles and leaves
| RAX: 102 RCX: 0 RDX: 0 |
| RBX: 20 RSP: 108 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x0000010_: 20 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/ret.txt
(It is okay to diagree with this trace about what instruction is fetched and ignored while waiting for the ret, but you should take the same number of cycles and produce the same final results.)