Your task

Submit your solutions to Problem Set-2 as a scanned PDF document on Gradescope. Problem Set-2 may be found in Collab under the resources folder. Alternatively, you may also get a physical copy of the assignment from me during lecture or office hours (Tu/Th 12:30pm at Rice 312).
Combine your solution from the previous HW and the previous lab into a new file called pipehw2.hcl to create a five-stage pipelined processor with forwarding and branch prediction as described in the textbook that implements:
- nop
- halt
- irmovq
- rrmovq
- OPq
- cmovXX
- rmmovq
- mrmovq
We will provide an example lab solution
Add the jXX instruction (and make it predict all jumps as taken).
Test your combined simulator with make test-pipehw2
Submit your solution to kytos

Hints/Approach

General Approach

You may approach this however you wish, but I suggest the following flow:

Combine your pipehw1.hcl and pipelab2.hcl and test the combination. All of the tests that either source file passed previously ought to still pass the combination.
Add jXX with speculative execution and branch misprediction recovery. Predict that all branches are taken. Test.

Implementing `jXX`

	1	2	3	4		5	6	7
`jXX`	F	D	E	(next PC available)	M	W
`wrong1`		F	D	(bubble needed)
`wrong2`			F	(bubble needed)
`right1`					F	D	E	M	W

Replace pc in the fF (or xF or pP or whatever else you called it) register bank wtih predPC, which will store a predicted PC value instead of the actual PC value.

To speculatively use this prediction, we can set pc to the predicted PC (pc = F_predPC).
Your processor should predict that all jXXs are taken (the new PC is valC).
We will detect that predictions are wrong near the end of the jXX’s execute stage (when we check the condition codes). We will fetch the correct instruction during the fetch stage in the next cycle (when jXX is in the memory stage).
When we react to a misprediction, we need to:
- Squash the mispredicted instructions (which are about to enter the decode and execute stages). This can be done with by setting the bubble_X signals in the cycle before the corrected instruction is fetched. (Setting the bubble_X signal will make the X_* pipeline registers output their default values in the next cycle instead of using their input values.)
- Fetch the corrected instruction next cycle (e.g. with a MUX in front of the pc signal).
You can fetch the corrected instruction with a MUX front of the pc signal:
```
pc = [
    mispredicted : oldValP ;
    ...
    1: F_predPC ;
];
```
You may need to pass the conditionsMet signal or something equivalent through a pipeline register to be able to tell when a misprediction happened at the appropriate time.
You will need access to the valP from the jXX instruction. To do so, you will probably need to pass it through pipeline registers.
Make sure you correctly handle interactions between jXX and halt. Consider code like:
```
     jne foo
     halt
foo: rmmovq %rax, (%rax)
     rmmovq %rax, (%rax)
     rmmovq %rax, (%rax)
```
When the halt is executed, F_predPC may contain the address of an rmmovq instead of halt, so simply setting stall_F may not be enough to fetch a halt next cycle.

Some solutions to this problem may involve using an technique other than setting stall_F to prevent the PC from changing, like adding a case to the pc = [...] MUX.
If instead of squashing the mispredicted instructions when they are about to enter the decode and execute stages (like suggested above), you squash them when they are about to enter the execute and memory stages, you will have to worry about preventing the conditions codes from being changed by one of the mispredicted instructions.

Testing your code

You can run the command make test-pipehw2 to run your processor on almost all the files in y86/, comparing its output to references supplied in testdata/pipe-reference. The list of tested files is in testdata/pipehw2.txt. For the files pop-forward2.yo, pop-forward3.yo, pop-forward4.yo, load-store.yo, you should have the same values, but you may take fewer cycles.
For each input file in y86/, there is a trace from our reference implementation in testdata/pipe-traces.
Your code should have the same semantics as tools/yis: set the same registers and memory. You can use this to see if your processor does the correct thing on any input files, including files you come up with yourself.
We will check the number of cycles your processor takes. As a general rule, your pipelined processor will need
- 1 cycle per instruction executed
- 4 extra cycles because we have a five-stage pipeline; even halt takes 5 cycles now.
- +1 more cycle for each load-use hazard (i.e., read from memory in one cycle, use with ALU next cycle)
- +2 more cycles for each conditional jump the code should not take (the misprediction penalty)
- +3 more cycles for each ret executed

Specific Test Cases

`jXX`

y86/j-cc.yo

                        irmovq $1, %rsi
              irmovq $2, %rdi
              irmovq $4, %rbp
              irmovq $-32, %rax
              irmovq $64, %rdx
              subq %rdx,%rax
              je target
              nop
              halt
target:
              addq %rsi,%rdx
              nop
              nop
              nop
              halt

    
  

takes 15 cycles and leaves

          | RAX: ffffffffffffffa0   RCX:                0   RDX:               40 |
| RBX:                0   RSP:                0   RBP:                4 |
| RSI:                1   RDI:                2   R8:                 0 |

A full trace is available in testdata/pipe-traces/j-cc.txt

y86/jxx.yo

              irmovq $3, %rax
    irmovq $-1, %rbx
a:
    jmp b
c:
    jge a
    halt
b:
    addq %rbx, %rax
    jmp c

    
  

takes 25 cycles and leaves

          | RAX: ffffffffffffffff   RCX:                0   RDX:                0 |
| RBX: ffffffffffffffff   RSP:                0   RBP:                0 |

A full trace is available in testdata/pipe-traces/jxx.txt (distributed with hclrs.tar)

Contents