Contents
Your task
- Combine your solution from the previous HW and the previous lab into a new file called
pipehw2.hclto create a five-stage pipelined processor with forwarding and branch prediction as described in the textbook that implements:nophaltirmovqrrmovqOPqcmovXXrmmovqmrmovq
We will provide an example lab solution on Collab (under the resources tab) sometime after the lab is due.
- Add the remaining instructions:
jXXpushqpopqcallret
-
Test your combined simulator with
make test-pipehw2 - Submit your solution to the submission site
Hints/Approach
General Approach
You may approach this however you wish, but I suggest the following flow:
- Combine your
pipehw1.hclandpipelab2.hcland test the combination. All of the tests that either source file passed previously ought to still pass the combination. - Add
jXXwith speculative execution and branch misprediction recovery. Predict that all branches are taken. Test. - Add
pushqand test. - Add
calland test. - Add
popqand test. - Add
retwith handling for the return hazard, and test.
Implementing jXX
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |||
|---|---|---|---|---|---|---|---|---|---|
jXX |
F | D | E | (next PC available) | M | W | |||
wrong1 |
F | D | (bubble needed) | ||||||
wrong2 |
F | (bubble needed) | |||||||
right1 |
F | D | E | M | W |
-
Replace
pcin thefF(orxForpPor whatever else you called it) register bank wtihpredPC, which will store a predicted PC value instead of the actual PC value.To speculatively use this prediction, we can set
pcto the predicted PC (pc = F_predPC). -
Your processor should predict that all
jXXs are taken (the new PC isvalC). -
We will detect that predictions are wrong near the end of the
jXX’s execute stage (when we check the condition codes). We will fetch the correct instruction during the fetch stage in the next cycle (whenjXXis in the memory stage). - When we react to a misprediction, we need to:
- Squash the mispredicted instructions (which are about to enter the decode and execute stages).
You can do this by setting the
bubble_Xsignals in the cycle before the corrected instruction is fetched in order to reset theX_*pipeline registers to their default values in the next cycle. - Fetch the corrected instruction next cycle (e.g. with a MUX in front of the
pcsignal).
- Squash the mispredicted instructions (which are about to enter the decode and execute stages).
You can do this by setting the
-
You can fetch the corrected instruction with a MUX front of the
pcsignal:pc = [ mispredicted : oldValP ; ... 1: F_predPC ; ];You may need to pass the
conditionsMetsignal or something equivalent through a pipeline register to be able to tell when a misprediction happened at the appropriate time. -
You will need access to the
valPfrom thejXXinstruction. To do so, you will probably need to pass it through pipeline registers. -
Make sure you correctly handle interactions between
jXXandhalt. Consider code like:jne foo halt foo: rmmovq %rax, (%rax) rmmovq %rax, (%rax) rmmovq %rax, (%rax)When the
haltis executed,F_predPCmay contain the address of anrmmovqinstead ofhalt: the halt instruction would be fetched because of themispredictedcase on thepcMUX we suggseted above, rather than being feteched becauseF_predPCpointed to it.This means that setting
stall_Fmay not be enough to fetch ahaltnext cycle: if we setstall_F, then during the next cycle,F_predPCwill be the same (not halt), but, if using a MUX forpclike we suggest above, the value ofmispredictedandoldValPin that MUX will be different.Some solutions to this problem may involve using an technique other than setting
stall_Fto prevent the PC from changing, like adding a MUX or cases to a MUX forf_predPCand/or other MUXes involved in computing the PC.For example, one might use code like
f_predPC = [ keep_same_instruction_we_fetched : pc ; ... ];to keep the same instruction in the fetch stage when
keep_same_instruction_we_fetchedis set to true rather than settingstall_F. - If instead of squashing the mispredicted instructions when they are about to enter the decode and execute stages (like suggested above), you squash them when they are about to enter the execute and memory stages, you will have to worry about preventing the conditions codes from being changed by one of the mispredicted instructions.
pushq and popq
-
rmmovqormrmovqexcept you useREG_RSPnotrBand ±8 notvalC. Also has a writeback component forREG_RSP. -
popqupdates two registers, so it will need bothreg_dstEandreg_dstM. -
popqreads from the old%rspwhilepushqwrites to the new%rsp.
call and ret
|1|2|3|4| |5|6|7|8|
------|-|-|-|-|-----------|-|-|-|-|-
`ret` |F|D|E|M|(next PC available)|W| | | |
`???` | |F|F|F|(next PC needed) |F|D|E|M|W
-
call 0x1234ispush valP; jXX 0x1234. Combining the logic of push and unconditional jump should be sufficient. -
retis jump-to-the-read-value-when-popping. It always encounters the “ret-hazard”:You should stop fetching new instructions and insert a pipeline “bubble” as long as a
retis in decode, execute, or memory and forward the value fromW_valMto thepc.
Testing your code
-
You can run the command
make test-pipehw2to run your processor on almost all the files iny86/, comparing its output to references supplied intestdata/pipe-reference. The list of tested files is intestdata/pipehw2.txt. For the filespop-forward2.yo,pop-forward3.yo,pop-forward4.yo,load-store.yo, you should have the same values, but you may take fewer cycles. -
If
make test-pipehw2produces a lot of output, you can do something likemake test-pipehw2 >test-output.txt 2>&1to have that output written to the filetest-output.txtinstead of the terminal. -
For each input file in
y86/, there is a trace from our reference implementation intestdata/pipe-traces. -
Your code should have the same semantics as
tools/yis: set the same registers and memory. You can use this to see if your processor does the correct thing on any input files, including files you come up with yourself. -
We will check the number of cycles your processor takes. As a general rule, your pipelined processor will need
- 1 cycle per instruction executed
- 4 extra cycles because we have a five-stage pipeline; even
halttakes 5 cycles now. - +1 more cycle for each load-use hazard (i.e., read from memory in one cycle, use with ALU next cycle)
- +2 more cycles for each conditional jump the code should not take (the misprediction penalty)
- +3 more cycles for each
retexecuted
Specific Test Cases
jXX
y86/j-cc.yo
irmovq $1, %rsi
irmovq $2, %rdi
irmovq $4, %rbp
irmovq $-32, %rax
irmovq $64, %rdx
subq %rdx,%rax
je target
nop
halt
target:
addq %rsi,%rdx
nop
nop
nop
halt
takes 15 cycles and leaves
| RAX: ffffffffffffffa0 RCX: 0 RDX: 40 |
| RBX: 0 RSP: 0 RBP: 4 |
| RSI: 1 RDI: 2 R8: 0 |
A full trace is available in testdata/pipe-traces/j-cc.txt
y86/jxx.yo
irmovq $3, %rax
irmovq $-1, %rbx
a:
jmp b
c:
jge a
halt
b:
addq %rbx, %rax
jmp c
takes 25 cycles and leaves
| RAX: ffffffffffffffff RCX: 0 RDX: 0 |
| RBX: ffffffffffffffff RSP: 0 RBP: 0 |
A full trace is available in testdata/pipe-traces/jxx.txt (distributed with hclrs.tar)
pushq
y86/push.yo
irmovq $3, %rax
irmovq $256, %rsp
pushq %rax
takes 8 cycles and leaves
| RAX: 3 RCX: 0 RDX: 0 |
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 03 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/push.txt
popq
y86/pop.yo
irmovq $4, %rsp
popq %rax
takes 7 cycles and leaves
| RAX: fb0000000000000 RCX: 0 RDX: 0 |
| RBX: 0 RSP: c RBP: 0 |
A full trace is available as testdata/pipe-traces/pop.txt
call
y86/call.yo
irmovq $256, %rsp
call a
addq %rsp, %rsp
a:
halt
takes 7 cycles and leaves
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 13 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/call.txt
ret
y86/ret.yo
irmovq $256, %rsp
irmovq a, %rbx
rmmovq %rbx, (%rsp)
ret
halt
a:
irmovq $258, %rax
halt
takes 13 cycles and leaves
| RAX: 102 RCX: 0 RDX: 0 |
| RBX: 20 RSP: 108 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x0000010_: 20 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/ret.txt
(It is okay to diagree with this trace about what instruction is fetched and ignored while waiting for the ret, but you should take the same number of cycles and produce the same final results.)