pipeline

simple CPU

running instructions

Human pipeline: laundry

Waste (1)

Waste (2)

Latency — Time for One

Throughput — Rate of Many

adding stages (one way)

running some instructions

why registers?

  • example: fetch/decode
  • need to store current instruction somewhere … while fetching next one

exercise: throughput/latency (1)

  • suppose cycle time is 500 ps
  • exercise: latency of one instruction?
    A. 100 ps B. 500 ps C. 2000 ps D. 2500 ps E. something else
  • exercise: throughput overall?
    A. 1 instr/100 ps B. 1 instr/500 ps C. 1 instr/2000ps D. 1 instr/2500 ps
    E. something else

exercise: throughput/latency (2)

  • double number of pipeline stages (to 10) + decrease cycle time from 500 ps to 250 ps — throughput?
    A. 1 instr/100 ps B. 1 instr/250 ps C. 1 instr/1000ps D. 1 instr/5000 ps
    E. something else

diminishing returns: register delays

diminishing returns: uneven split

  • Can we split up some logic (e.g. adder) arbitrarily?
  • Probably not…

a data hazard

addq %r8, %r9   // R8 + R9 -> R9
addq %r9, %r8   // R9 + R8 -> R9
addq ...
addq ...

data hazard

addq %r8, %r9  // (1)
addq %r9, %r8  // (2)
step# pipeline implementation ISA specification
1 read r8, r9 for (1) read r8, r9 for (1)
2 read r9, r8 for (2) write r9 for (1)
3 write r9 for (1) read r9, r8 for (2)
4 write r8 for (2) write r8 ror (2)
  • pipeline reads older value
  • instead of value ISA says was just written

data hazard compiler solution

addq %r8, %r9
nop
nop
addq %r9, %r8
  • one solution: change the ISA

    • all addqs take effect three instructions later
      (assuming can read register value while it is being written back)
  • make it compiler’s job

  • problem: recompile everytime processor changes?

stalling/nop pipeline diagram (1)

stalling/nop pipeline diagram (2)

data hazard hardware solution

addq %r8, %r9
// hardware inserts: nop
// hardware inserts: nop
addq %r9, %r8
  • how about hardware add nops?

  • called stalling

  • extra logic:

    • sometimes don’t change PC
    • sometimes put do-nothing values in pipeline registers

control hazard

0x00: cmpq %r8, %r9
0x08: je   0xFFFF
0x10: addq %r10, %r11

jXX: stalling?

cmpq %r8, %r9 
jne LABEL // not taken 
xorq %r10, %r11 
movq %r11, 0(%r12) 
...

making guesses

        cmpq %r8, %r9
        jne LABEL
        xorq %r10, %r11
        movq %r11, 0(%r12)
        ...

LABEL:  addq %r8, %r9
        imul %r13, %r14
        ...
  • speculate (guess): jne won’t go to LABEL
  • right: 2 cycles faster!; wrong: undo guess before too late

jXX: speculating right (1)

        cmpq %r8, %r9
        jne LABEL
        xorq %r10, %r11
        movq %r11, 0(%r12)
        ...

LABEL:  addq %r8, %r9
        imul %r13, %r14
        ...

jXX: speculating wrong

‘‘squashed’’ instructions

  • on misprediction need to undo partially executed instructions
  • mostly: remove from pipeline registers
  • more complicated pipelines: replace written values in cache/registers/etc.

opportunity

// initially %r8 = 800,
//           %r9 = 900, etc.
0x0: addq %r8, %r9 
0x2: addq %r9, %r8
...

exploiting the opportunity

opportunity 2

// initially %r8 = 800,
//           %r9 = 900, etc.
0x0: addq %r8, %r9 
0x2: nop
0x3: addq %r9, %r8
...

exploiting the opportunity

exercise: forwarding paths

  • in subq, %r8 is _____ addq.

  • in xorq, %r9 is _____ addq.

  • in andq, %r9 is _____ addq.

  • in andq, %r9 is _____ xorq.

    • assume regfile cannot read value while being written
    • A: not forwarded from
    • B-D: forwarded to decode from {execute,memory,writeback} stage of

unsolved problem

  • value from memory stage too late for forwarding alone

  • combine stalling and forwarding to resolve hazard

  • assumption in diagram: hazard detected in subq’s decode stage

    • (since easier than detecting it in fetch stage)

solveable problem

why can’t we…

clock cycle needs to be long enough
to go through data cache AND
to go through math circuits!
(which we were trying to avoid by putting them in separate stages)

throughput exercise

exercise

  • suppose 5-stage pipeline
    • using forwarding, stalling like we discussed
  • 1 ns cycle time
  • 5% are branches
    • 60% correct predictions
  • 1% are uses of value just after loading it from memory
  • assume negligible cache misses
  • estimated throughput?

exercise solution

  • 97% 1 cycle (until next useful instruction fetched)
  • 40% of 5% = 2% 3 cycles
    • 2 “wasted” fetches for misprediction
  • 1% 2 cycles
    • 1 “wasted” fetch for data hazard stall
  • average cycles/instruction = 1$\(97% + 3\)\(2% + 1\)$1% = 1.05
  • 1 billion instructions / second \(\div\) 1.05 cycles/instruction = 0.952 billion instr/sec

hazards versus dependencies

  • dependency — X needs result of instruction Y?

    • has potential for being messed up by pipeline
    • (since part of X may run before Y finishes)
  • hazard — will it not work in some pipeline?

    • before extra work is done to ‘‘resolve’’ hazards
    • multiple kinds: so far, data hazards, control hazards

ex.: dependencies and hazards (1)

pipeline with different hazards

  • example: 4-stage pipeline: fetch/decode/execute+memory/writeback
                     // 4 stage  // 5 stage
    addq %rax, %r8   //          // W
    subq %rax, %r9   // W        // M
    xorq %rax, %r10  // EM       // E
    andq %r8,  %r11  // D        // D
    
  • (assuming register file does not read while writing)
  • addq/andq is hazard with 5-stage pipeline
  • addq/andq is not a hazard with 4-stage pipeline
  • more hazards with more pipeline stages

exercise: different pipeline

  • split execute into two stages: F/D/E1/E2/M/W
  • result only available near end of second execute stage
  • where does forwarding, stalls occur?

exercise: different pipeline

  • split execute into two stages: F/D/E1/E2/M/W

Backup slides

exercise: control hazard timing+forwarding?

  • with F/D/E/M/W: what is fetched when? what is forwarded?

[solution]: control hazard timing+forwarding?

  • with F/D/E/M/W: what is fetched when? what is forwarded?

exercise: with different pipeline

  • with F/D/E1/E2/M/W

[solution]: with different pipeline

  • with F/D/E1/E2/M/W

exercise: forwarding paths (2)

  • in subq, %r8 is _____ addq.

  • in subq, %r9 is _____ addq.

  • in andq, %r9 is _____ subq.

  • in andq, %r9 is _____ addq.

    • A: not forwarded from
    • B-D: forwarded to decode from {execute,memory,writeback} stage of

exercise: predict+forward (1)

  • if jle is correctly predicted:

    • in andq, %r9 is ______ addq.

    • in andq, %r8 is ______ subq.

    • A: not forwarded from [assume read while writing requires forwarding]

    • B-D: forwarded to decode from {execute,memory,writeback} stage of

exercise: predict+forward (2)

  • if jle is mispredicted + resolved after jle’s execute:

    • in andq, %r9 is _____ addq.

    • in andq, %r9 is ___ subq.

    • A: not forwarded from [assume read while writing requires forwarding]

    • B-D: forwarded to decode from {execute,memory,writeback} stage of

diminishing returns: register delays

diminishing returns: register delays

importance of prediction

  • 5-stage pipeline, predict not-taken versus always stall

* — ignoring data hazards

prediction and OOO

  • deeper pipeline — much higher misprediction penalty