pipeline

simple CPU

running instructions

Human pipeline: laundry

Waste (1)

Waste (2)

Latency — Time for One

Throughput — Rate of Many

adding stages (one way)

running some instructions

why registers?

example: fetch/decode
need to store current instruction somewhere … while fetching next one

exercise: throughput/latency (1)

suppose cycle time is 500 ps
exercise: latency of one instruction?

A. 100 ps B. 500 ps C. 2000 ps D. 2500 ps E. something else
exercise: throughput overall?

A. 1 instr/100 ps B. 1 instr/500 ps C. 1 instr/2000ps D. 1 instr/2500 ps

E. something else

exercise: throughput/latency (2)

double number of pipeline stages (to 10) + decrease cycle time from 500 ps to 250 ps — throughput?

A. 1 instr/100 ps B. 1 instr/250 ps C. 1 instr/1000ps D. 1 instr/5000 ps

E. something else

diminishing returns: register delays

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?
Probably not…

a data hazard

addq %r8, %r9   // R8 + R9 -> R9
addq %r9, %r8   // R9 + R8 -> R9
addq ...
addq ...

data hazard

addq %r8, %r9  // (1)
addq %r9, %r8  // (2)

step#	pipeline implementation	ISA specification
1	read r8, r9 for (1)	read r8, r9 for (1)
2	read r9, r8 for (2)	write r9 for (1)
3	write r9 for (1)	read r9, r8 for (2)
4	write r8 for (2)	write r8 ror (2)

pipeline reads older value…
instead of value ISA says was just written

data hazard compiler solution

addq %r8, %r9
nop
nop
addq %r9, %r8

one solution: change the ISA
- all addqs take effect three instructions later
  (assuming can read register value while it is being written back)
make it compiler’s job
problem: recompile everytime processor changes?

stalling/nop pipeline diagram (1)

stalling/nop pipeline diagram (2)

data hazard hardware solution

addq %r8, %r9
// hardware inserts: nop
// hardware inserts: nop
addq %r9, %r8

how about hardware add nops?
called stalling
extra logic:
- sometimes don’t change PC
- sometimes put do-nothing values in pipeline registers

control hazard

0x00: cmpq %r8, %r9
0x08: je   0xFFFF
0x10: addq %r10, %r11

jXX: stalling?

cmpq %r8, %r9 
jne LABEL // not taken 
xorq %r10, %r11 
movq %r11, 0(%r12) 
...

making guesses

        cmpq %r8, %r9
        jne LABEL
        xorq %r10, %r11
        movq %r11, 0(%r12)
        ...

LABEL:  addq %r8, %r9
        imul %r13, %r14
        ...

speculate (guess): jne won’t go to LABEL
right: 2 cycles faster!; wrong: undo guess before too late

jXX: speculating right (1)

        cmpq %r8, %r9
        jne LABEL
        xorq %r10, %r11
        movq %r11, 0(%r12)
        ...

LABEL:  addq %r8, %r9
        imul %r13, %r14
        ...

jXX: speculating wrong

‘‘squashed’’ instructions

on misprediction need to undo partially executed instructions
mostly: remove from pipeline registers
more complicated pipelines: replace written values in cache/registers/etc.

opportunity

// initially %r8 = 800,
//           %r9 = 900, etc.
0x0: addq %r8, %r9 
0x2: addq %r9, %r8
...

exploiting the opportunity

opportunity 2

// initially %r8 = 800,
//           %r9 = 900, etc.
0x0: addq %r8, %r9 
0x2: nop
0x3: addq %r9, %r8
...

exploiting the opportunity

exercise: forwarding paths

in subq, %r8 is _____ addq.
in xorq, %r9 is _____ addq.
in andq, %r9 is _____ addq.
in andq, %r9 is _____ xorq.
- assume regfile cannot read value while being written
- A: not forwarded from
- B-D: forwarded to decode from {execute,memory,writeback} stage of

unsolved problem

value from memory stage too late for forwarding alone
combine stalling and forwarding to resolve hazard
assumption in diagram: hazard detected in subq’s decode stage
- (since easier than detecting it in fetch stage)

solveable problem

why can’t we…

clock cycle needs to be long enough
to go through data cache AND
to go through math circuits!
(which we were trying to avoid by putting them in separate stages)

throughput exercise

exercise

suppose 5-stage pipeline
- using forwarding, stalling like we discussed
1 ns cycle time
5% are branches
- 60% correct predictions
1% are uses of value just after loading it from memory
assume negligible cache misses
estimated throughput?

exercise solution

97% 1 cycle (until next useful instruction fetched)
40% of 5% = 2% 3 cycles
- 2 “wasted” fetches for misprediction
1% 2 cycles
- 1 “wasted” fetch for data hazard stall
average cycles/instruction = 1$$97% + 3$$2% + 1$$1% = 1.05
1 billion instructions / second $\div$ 1.05 cycles/instruction = 0.952 billion instr/sec

hazards versus dependencies

dependency — X needs result of instruction Y?
- has potential for being messed up by pipeline
- (since part of X may run before Y finishes)
hazard — will it not work in some pipeline?
- before extra work is done to ‘‘resolve’’ hazards
- multiple kinds: so far, data hazards, control hazards

ex.: dependencies and hazards (1)

pipeline with different hazards

example: 4-stage pipeline: fetch/decode/execute+memory/writeback

                 // 4 stage  // 5 stage
addq %rax, %r8   //          // W
subq %rax, %r9   // W        // M
xorq %rax, %r10  // EM       // E
andq %r8,  %r11  // D        // D

(assuming register file does not read while writing)
addq/andq is hazard with 5-stage pipeline
addq/andq is not a hazard with 4-stage pipeline
more hazards with more pipeline stages

exercise: different pipeline

split execute into two stages: F/D/E1/E2/M/W
result only available near end of second execute stage
where does forwarding, stalls occur?

exercise: different pipeline

split execute into two stages: F/D/E1/E2/M/W

Backup slides

exercise: control hazard timing+forwarding?

with F/D/E/M/W: what is fetched when? what is forwarded?

[solution]: control hazard timing+forwarding?

with F/D/E/M/W: what is fetched when? what is forwarded?

exercise: with different pipeline

with F/D/E1/E2/M/W

[solution]: with different pipeline

with F/D/E1/E2/M/W

exercise: forwarding paths (2)

in subq, %r8 is _____ addq.
in subq, %r9 is _____ addq.
in andq, %r9 is _____ subq.
in andq, %r9 is _____ addq.
- A: not forwarded from
- B-D: forwarded to decode from {execute,memory,writeback} stage of

exercise: predict+forward (1)

if jle is correctly predicted:
- in andq, %r9 is ______ addq.
- in andq, %r8 is ______ subq.
- A: not forwarded from [assume read while writing requires forwarding]
- B-D: forwarded to decode from {execute,memory,writeback} stage of

exercise: predict+forward (2)

if jle is mispredicted + resolved after jle’s execute:
- in andq, %r9 is _____ addq.
- in andq, %r9 is ___ subq.
- A: not forwarded from [assume read while writing requires forwarding]
- B-D: forwarded to decode from {execute,memory,writeback} stage of

diminishing returns: register delays

importance of prediction

5-stage pipeline, predict not-taken versus always stall

* — ignoring data hazards

prediction and OOO

deeper pipeline — much higher misprediction penalty

A. 1 instr/100 ps	B. 1 instr/500 ps	C. 1 instr/2000ps	D. 1 instr/2500 ps
E. something else

A. 1 instr/100 ps	B. 1 instr/250 ps	C. 1 instr/1000ps	D. 1 instr/5000 ps
E. something else

A. 100 ps	B. 500 ps	C. 2000 ps	D. 2500 ps	E. something else