CS 3330: HCL2D part 8: Final homework

This page does not represent the most current semester of this course; it is present merely as an archive.

In the previous HW you pipelined nop, halt, irmovl, and rrmovl. In lab you pipelined rmmovq and mrmovq (if you didn’t finish, we have an example solution). To finish your pipelined simulator, you need to combine those two and then add OPq, jXX, cmovXX, pushq, popq, call, and ret.

0.1 Approach

You may approach this however you wish, but I suggest the following flow:

Combine your pipehw1.hcl and pipelab2.hcl and test the combination.
Add OPq and condition codes and test.
Add cmovXX and test.
Add jXX with speculative execution and branch misprediction recovery. Predict that all branches are taken. Test.
Add pushq and test.
Add call and test.
Add popq and test.
Add ret with handling for the return hazard, and test.

0.1.1 Combine and Test

All of the tests that either source file passed ought to still pass the combination.

0.1.2 `OPq` and condition codes

Put the condition codes in their own register bank; you don’t want to bubble them if you bubble another register bank as part of stalling a stage.

Also check the condition codes in execute (based on the ifun there) and store the result of that comparison in the eM register bank.

0.1.3 `cmovXX` and condition codes

To implement cmovXX, all you need to do is change reg_dstE to be either the value you’d normally predict for it or REG_NONE, depending on the truth of the condition codes. The easiest way to do that is probably by adding a mux in the execute stage, something like

e_dstE = [
    /* this is a cmovXX and the condition codes are not satisfied */ : REG_NONE;
    1 : E_dstE;
];

0.1.4 `jXX`

This one is messy because we have branch prediction, speculative execution, and recovery from misprediction through stage bubbling. Let’s look at through a set of questions

What are we predicting?: That all jXX are taken (i.e., that the new PC is valC for all jXX).
How do we store that prediction?: In a predPC register inside the xF register bank (or pP or whatever else you called it).
How do we speculatively execute based on that prediction?: By setting the pc to the predicted PC (pc = F_predPC).
When would our prediction be wrong?: If the condition codes evaluate to false.
When would we learn that our prediction was wrong?: At the end of the jXX’s execute stage (which is when we check the condition codes).
We need to know at the beginning of Fetch…: End of execute = beginning of memory, so we can look in the M_... register bank outputs.
How should we react to misprediction?: We need to fetch the correct address (jXX’s valP) and bubble any stages that we should not have run.
Which stages get bubbled?: Since jXX is in memory, we keep that one (and writeback, which is a pre-jXX instruction); since we are fixing fetch we keep that one too; so we bubble just the decode and execute stages.
How do we fetch the correct address?: By using a mux to pick pc
What is the mux conditions?: mispredicted: oldValP and 1: F_predPC

	1	2	3	4		5	6	7
`jXX`	F	D	E	(available)	M	W
`wrong1`		F	D	(needed)
`wrong2`			F	(needed)
`right1`					F	D	E	M	W

Add a predicted PC to the xF register bank
Predict all branches as taken: valP for a non-jump, valC for a jump.
Set the pc to
- the predicted PC unless we discover that there was a jump that we predicted incorrectly.
- we predicted a jump wrong if (1) there is a jXX in Memory and (2) the result of checking the condition codes stored in the eM register bank is 0.
- if we mispredicted, we’ll need to fetch the old valP
- we’ll also need to bubble the mispredicted stages

0.1.5 `pushq` and `popq`

Like an rmmovq or mrmovq except you use REG_RSP not rB and ±8 not valC. Also has a writeback component for REG_RSP.

Note that popq updates two registers, so it will need both reg_dstE and reg_dstM.

Note that popq reads from the old %rsp while pushq writes to the new %rsp.

0.1.6 `call` and `ret`

call 0x1234 is push valP; jXX 0x1234. Combining the logic of push and unconditional jump should be sufficient.

ret is jump-to-the-read-value-when-popping. It always encounters the ret-hazard:

	1	2	3	4		5	6	7	8
`ret`	F	D	E	M	(available)	W
`???`		F	F	F	(needed)	F	D	E	M	W

You’ll have to stall the fetch stage as long as a ret is in decode, execute, or memory and forward the value from W_valM to the pc.

1 cycle per instruction executed
4 extra cycles because we have a five-stage pipeline; even halt takes 5 cycles now.
+1 more cycle for each load-use hazard (i.e., read from memory in one cycle, use as src next cycle)
+2 more cycles for each conditional jump the code should not take (the misprediction penalty)
+3 more cycles for each ret executed

0.2.3 Examples

0.2.3.1 `OPq`

y86/opq.yo

irmovq $7, %rdx
irmovq $3, %rcx
addq %rcx, %rbx
subq %rdx, %rcx
andq %rdx, %rbx
xorq %rcx, %rdx
andq %rdx, %rsi

takes 12 cycles and leaves

| RAX:                0   RCX: fffffffffffffffc   RDX: fffffffffffffffb |
| RBX:                3   RSP:                0   RBP:                0 |

A full trace is available as pipe-opq.txt

0.2.3.2 `cmovXX`

y86/cmovXX.yo

irmovq $2766, %rbx
irmovq    $1, %rax
andq    %rax, %rax
cmovg   %rbx, %rcx
cmovne  %rbx, %rdx
irmovq   $-1, %rax
andq    %rax, %rax
cmovl   %rbx, %rsp
cmovle  %rbx, %rbp
xorq    %rax, %rax
cmove   %rbx, %rsi
cmovge  %rbx, %rdi
irmovq $2989, %rbx
irmovq    $1, %rax
andq    %rax, %rax
cmovl   %rbx, %rcx
cmove   %rbx, %rdx
irmovq   $-1, %rax
andq    %rax, %rax
cmovge  %rbx, %rsp
cmovg   %rbx, %rbp
xorq    %rax, %rax
cmovl   %rbx, %rsi
cmovne  %rbx, %rdi
irmovq    $0, %rbx

takes 30 cycles and leaves 0xace in %rcx, %rdx, %rsp, %rbp, %rsi, and %rdi.

A full trace is available as pipe-cmovXX.txt

0.2.3.3 `jXX`

y86/jxx.yo

    irmovq $3, %rax
    irmovq $-1, %rbx
a:
    jmp b
c:
    jge a
    halt
b:
    addq %rbx, %rax
    jmp c

takes 25 cycles and leaves

| RAX: ffffffffffffffff   RCX:                0   RDX:                0 |
| RBX: ffffffffffffffff   RSP:                0   RBP:                0 |

A full trace is available as pipe-jxx.txt

0.2.3.4 `pushq`

y86/push.yo

irmovq $3, %rax
irmovq $256, %rsp
pushq %rax

takes 8 cycles and leaves

| RAX:                3   RCX:                0   RDX:                0 |
| RBX:                0   RSP:               f8   RBP:                0 |

| used memory:   _0 _1 _2 _3  _4 _5 _6 _7   _8 _9 _a _b  _c _d _e _f    |
|  0x000000f_:                              03 00 00 00  00 00 00 00    |

A full trace is available as pipe-push.txt

0.2.3.5 `popq`

y86/pop.yo

irmovq $4, %rsp
popq %rax

takes 7 cycles and leaves

| RAX:  fb0000000000000   RCX:                0   RDX:                0 |
| RBX:                0   RSP:                c   RBP:                0 |

A full trace is available as pipe-pop.txt

0.2.3.6 `call`

y86/call.yo

    irmovq $256, %rsp
    call a
    addq %rsp, %rsp
a:
    halt

takes 7 cycles and leaves

| RBX:                0   RSP:               f8   RBP:                0 |

| used memory:   _0 _1 _2 _3  _4 _5 _6 _7   _8 _9 _a _b  _c _d _e _f    |
|  0x000000f_:                              13 00 00 00  00 00 00 00    |

A full trace is available as pipe-call.txt

0.2.3.7 `ret`

y86/ret.yo

    irmovq $256, %rsp
    irmovq a, %rbx
    rmmovq %rbx, (%rsp)
    ret
    halt
a:
    irmovq $258, %rax
    halt

takes 13 cycles and leaves

| RAX:              102   RCX:                0   RDX:                0 |
| RBX:               20   RSP:              108   RBP:                0 |

| used memory:   _0 _1 _2 _3  _4 _5 _6 _7   _8 _9 _a _b  _c _d _e _f    |
|  0x0000010_:   20 00 00 00  00 00 00 00                               |

A full trace is available as pipe-ret.txt

0.3 Submit

Submit pipehw2.hcl on the submission page.