CS 3330: Lab 5: Y86 Lab 3

This page does not represent the most current semester of this course; it is present merely as an archive.

In this lab we'll add some basic pipelining to a subset of the Y86 instruction set. In particular, we'll deal only with nop, halt, irmovl, and rrmovl and add a pipeline register between decode and writeback. Download lab5_base.hcl to get a copy of the simulator with only those instructions implemented.

Approach

To add pipelining,

Identify where in your code the pipeline register should go
Identify which wires cross that point and put them in a pipeline register
Replace wires with register inputs and outputs
Look for needs to stall or forward data

We'll explore this idea by adding a pipeline register after decode (the "E" register bank in the textbook).

register E {
    # todo: fill in the details here
}

Put it up at the top of the file (we'll need it to be defined before we first use any e_... wire).

What wires cross the E register bank?

Let's consider each wire in our HCL file. In the reference lab4 solution that is:

pc is output by our code before the pipeline register
i6bytes is returned by the instruction memory before the pipeline register and used before it too
icode is computed before the pipeline register but used on both sides
need_regs is computed and used before the pipeline register
need_immediate is computed and used before the pipeline register
rA is computed and used before the pipeline register
rB is computed and used before the pipeline register
valC is computed before the pipeline register but used after it
valP is computed and used before the pipeline register
srcA is output by our code before the pipeline register
rvalA is returned by the register file before the pipeline register but used after it
dstE is computed before the pipeline register but used after it
wvalE is output by our code after the pipeline register
Stat is output by our code after the pipeline register, but also used to stall the P register bank before the pipeline register
p_pc and stall_P are interfacing with the P register, which is before the pipeline register

Thus, we'll need copies of icode, valC, rvalA, dstE, and Stat in the E pipeline register.

Always pick the default values in the pipeline register to be sensible nop values that do nothing; that way when we start running and it takes a cycle for E to have values given it by the previous stages it will not have done anything. In particular, that means putting icode:4 = NOP;, dstE:4 = REG_NONE;, and Stat:3 = STAT_AOK; in register E

Replace wires with register inputs/outputs

Recall that whatever signal we put into x_thing will come out of X_thing on the next cycle. Thus, any signal that needs to cross register bank E will need to use e_... on the pre-E side and E_... on the post-E side.

Go through each signal and, if it crosses E, replace every use before E with e_... and ever use after E with E_....

For example, consider icode:

Remove the wire:32 icode declaration since we have it in E.
In fetch and decode, replace all occurrences of icode with e_icode
In execute, memory, and writeback replace all occurrences of icode with E_icode

Do the same thing with valC.

The signals rvalA, dstE, and Stat have to be treated specially because they are inputs to our outputs from the register file. Thus, rvalA (an output created during decode) will need to be saved into e_... during decode and used as E_... afterwards, as in

# in decode:
e_rvalA = rvalA;
# in execute and later phases, used E_rvalA instead of rvalA

Similarly, dstE will need to be originally computed as e_dstE during Fetch and then dstE = E_dstE placed in writeback to get that value back out. Stat is an output like dstE and will need the same treatment (set e_Stat before the pipeline register and Stat = E_Stat afterward).

At this point, the rrmovl.yo we made in lab2

irmovl $5678, %eax
irmovl $34, %ecx
rrmovl %eax, %edx
rrmovl %ecx, %eax

should take 6 (not 5) cycles to set three registers:

| EAX:       22   ECX:       22   EDX:     162e   EBX:        0 |

and should leave the pc at address 0x11. It should also take a few less cycles overall than the 355 used by lab5_base.hcl as a result of increased throughput, though if it does not don't be worried; we aren't focusing on speed right now.

Look for stalls

Consider

irmovl $1, %eax
rrmovl %eax, %ebx

In a pipeline diagram (given that we have no execute or memory phases), these will look like

Instr	cycle 1	cycle 2	cycle 3
`irmovl`	FD	W
`rrmovl`		FD	W

Note that the immediate value won't be written to the register file until the after of cycle 2, but it will be attempted to be read by the next instruction at the during of cycle 2. This will create a load-use dependency.

We can bypass this dependency in two ways. We can either stall, or we can forward data. Let's try both solutions for our edification.

Stall

Copy your work so far into lab5_stall.hcl

We want to stall the decode phase if it needs to read something a later phase intends to write. The register in front of decode is P, so we'll be setting stall_P. In addition to the logic we already have for that, we stall if the writeback phases's dstE is (1) not REG_NONE and (2) the same as the decode phase's srcA.

If correctly implemented,

irmovl $1, %eax
rrmovl %eax, %ebx

should take 5 cycles to put a 1 in both eax and edx, while

irmovl $5678, %eax
irmovl $34, %ecx
rrmovl %eax, %edx
rrmovl %ecx, %eax

should still take 6 cycles and result in

| EAX:       22   ECX:       22   EDX:     162e   EBX:        0 |

like it did before.

Forward

Copy your work pre-stall work into lab5_forward.hcl

We want to grab the value that is being prepped for writing to the register file before it actually gets written if it is the register we are trying to read. Thus, e_rvalA will be rvalA unless dstE is (1) not REG_NONE and (2) the same as the decode phase's srcA; in that case, we'll forward wvalE into e_rvalA instead.

If correctly implemented,

irmovl $1, %eax
rrmovl %eax, %ebx

should take 4 cycles to put a 1 in both eax and edx, while

irmovl $5678, %eax
irmovl $34, %ecx
rrmovl %eax, %edx
rrmovl %ecx, %eax

should still take 6 cycles and result in

| EAX:       22   ECX:       22   EDX:     162e   EBX:        0 |

like it did before.

Submit

Submit two files, one named lab5_stall.hcl and one named lab5_forward.hcl on the submission page. You'll have to upload them one at a time…

If you didn't have time to finish everything, still submit both files (it's OK if they are incomplete; we are looking for effort more than correctness).

For your edification

If you want to understand pipelines more, I'd encourage you to add another pipeline register between Fetch and Decode. Don't submit that three-stage-pipeline file, though.