This page is for a prior offering of CS 3330. It is not up-to-date.
In the previous HW you pipelined nop, halt, irmovq, rrmovq, OPq and cmovXX. In lab you pipelined rmmovq and mrmovq (if you didn’t finish, we will post an example solution sometime during the day after the lab). To finish your pipelined simulator, you need to combine those two and then add jXX, pushq, popq, call, and ret.
You may approach this however you wish, but I suggest the following flow:
pipehw1.hcl and pipelab2.hcl and test the combination.jXX with speculative execution and branch misprediction recovery. Predict that all branches are taken. Test.pushq and test.call and test.popq and test.ret with handling for the return hazard, and test.All of the tests that either source file passed ought to still pass the combination.
jXXThis one is messy because we have branch prediction, speculative execution, and recovery from misprediction through stage bubbling. Let’s look at through a set of questions
That all jXX are taken (i.e., that the new PC is valC for all jXX).
In a predPC register inside the xF register bank (or pP or whatever else you called it).
By setting the pc to the predicted PC (pc = F_predPC).
If the condition codes evaluate to false.
At the end of the jXX’s execute stage (which is when we check the condition codes).
End of execute = beginning of memory, so we can look in the M_... register bank outputs.
We need to fetch the correct address (jXX’s valP) and bubble any stages that we should not have run.
Since jXX is in memory, we keep that one (and writeback, which is a pre-jXX instruction); since we are fixing fetch we keep that one too; so we bubble just the decode and execute stages.
By using a mux to pick pc
mispredicted: oldValP and 1: F_predPC
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |||
|---|---|---|---|---|---|---|---|---|---|
jXX |
F | D | E | (available) | M | W | |||
wrong1 |
F | D | (needed) | ||||||
wrong2 |
F | (needed) | |||||||
right1 |
F | D | E | M | W |
xF register bankvalP for a non-jump, valC for a jump.pc to
jXX in Memory and (2) the result of checking the condition codes stored in the eM register bank is 0.valPpushq and popqLike an rmmovq or mrmovq except you use REG_RSP not rB and ±8 not valC. Also has a writeback component for REG_RSP.
Note that popq updates two registers, so it will need both reg_dstE and reg_dstM.
Note that popq reads from the old %rsp while pushq writes to the new %rsp.
call and retcall 0x1234 is push valP; jXX 0x1234. Combining the logic of push and unconditional jump should be sufficient.
ret is jump-to-the-read-value-when-popping. It always encounters the
:ret-hazard
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |||
|---|---|---|---|---|---|---|---|---|---|---|
ret |
F | D | E | M | (available) | W | ||||
??? |
F | F | F | (needed) | F | D | E | M | W |
You’ll have to stall the fetch stage as long as a ret is in decode, execute, or memory and forward the value from W_valM to the pc.
You can run the command make test-pipehw2 to run your processor on almost all the files in y86/, comparing its output to references supplied in testdata/pipe-reference. For each of these files, there is a trace from our reference implementation in testdata/pipe-traces.
Your code should have the same semantics as tools/yis: set the same registers and memory
As a general rule, your pipelined processor will need
1 cycle per instruction executed
4 extra cycles because we have a five-stage pipeline; even halt takes 5 cycles now.
+1 more cycle for each load-use hazard (i.e., read from memory in one cycle, use as src next cycle)
+2 more cycles for each conditional jump the code should not take (the misprediction penalty)
+3 more cycles for each ret executed
jXXy86/jxx.yo
irmovq $3, %rax
irmovq $-1, %rbx
a:
jmp b
c:
jge a
halt
b:
addq %rbx, %rax
jmp c
takes 25 cycles and leaves
| RAX: ffffffffffffffff RCX: 0 RDX: 0 |
| RBX: ffffffffffffffff RSP: 0 RBP: 0 |
A full trace is available as pipe-jxx.txt
pushqy86/push.yo
irmovq $3, %rax
irmovq $256, %rsp
pushq %rax
takes 8 cycles and leaves
| RAX: 3 RCX: 0 RDX: 0 |
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 03 00 00 00 00 00 00 00 |
A full trace is available as pipe-push.txt
popqy86/pop.yo
irmovq $4, %rsp
popq %rax
takes 7 cycles and leaves
| RAX: fb0000000000000 RCX: 0 RDX: 0 |
| RBX: 0 RSP: c RBP: 0 |
A full trace is available as pipe-pop.txt
cally86/call.yo
irmovq $256, %rsp
call a
addq %rsp, %rsp
a:
halt
takes 7 cycles and leaves
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 13 00 00 00 00 00 00 00 |
A full trace is available as pipe-call.txt
rety86/ret.yo
irmovq $256, %rsp
irmovq a, %rbx
rmmovq %rbx, (%rsp)
ret
halt
a:
irmovq $258, %rax
halt
takes 13 cycles and leaves
| RAX: 102 RCX: 0 RDX: 0 |
| RBX: 20 RSP: 108 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x0000010_: 20 00 00 00 00 00 00 00 |
A full trace is available as pipe-ret.txt.
(It is okay to diagree with this trace about what instruction is fetched and ignored while waiting for the ret, but you should take the same number of cycles and produce the same final results.)
You should pass all the tests run by make test-pipehw2, with the possible exception of poptest.yo. (This test has been removed from the lists of tests in the most recent version of hclrs.tar.)
In general, the same tests that should have worked on your single-cycle processor in seqhw should produce the correct results on your pipelined processor.
Our general advice for debugging this assignment:
-d output;Submit pipehw2.hcl on the submission page.