## last time

hazards in pipelines

hazard = extra work needed to make instruction run correctly in pipeline data hazard = ...reading value with pending update control hazard = ...can't compute next instruction to fetch

stalling (pause instructions until ready) to resolve hazards

forwarding — take pending value from later in pipeline MUX to select versus value from register compare register numbers to see if forwarding needed can combine with stalling

branch prediction — guess what jump will do if wrong; undo guess when actual outcome known

## anonymous feedback (1)

#### pipeline assignment — deadline move?

I know we didn't completely cover branch prediction think assignment text is enough + no quiz due next Tuesday past experience: assignment is quicker to do than typical assignment

#### pipeline assignment — how partial credit?

rubric categories checking for things like: one instruction per stage per cycle instructions pass through stages in order + never skip stages identifies when misprediction occurs instruction X fetched after instruction Y correctly identifies data hazard requiring stalling

## on upcoming quiz

next quiz due Tuesday after Thanksgiving

will release tomorrow (so you can start early if you want)

cmpq %r8, %r9 ine LABEL // not taken xorg %r10, %r11 movg %r11, 0(%r12) cvcle # 0 1 2 3 4 5 6 7 8cmpq %r8, %r9 Μ W F D E ine LABEL F Е Μ W D (do nothing) F E D Μ W (do nothing) F E М W D xorg %r10, %r11 F E D Μ movq %r11, 0(%r12) Ε F D

W

Μ

```
cmpq %r8, %r9
       ine LABEL
                    // not taken
       xorg %r10, %r11
       movg %r11, 0(%r12)
                              cycle # 0 1 2 3 4 5 6 7 8
cmpg %r8, %r9
                          compare sets flags E
                                                 W
                                              Μ
ine LABEL
                                         F
                                               Е
                                                  Μ
                                                    W
                                            D
(do nothing)
                                            F
                                                  E
                                               D
                                                    Μ
                                                       W
(do nothing)
                                               F
                                                     E
                                                          W
                                                       Μ
                                                  D
xorg %r10, %r11
                                                  F
                                                       E
                                                    D
                                                          Μ
                                                             W
movq %r11, 0(%r12)
                                                          Ε
                                                     F
                                                             Μ
                                                       D
```

```
cmpq %r8, %r9
       ine LABEL
                   // not taken
       xorg %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpg %r8, %r9
                                     F
                                        DE
                                             М
                                               W
ine LABEL compute if jump goes to LABED
                                            F
                                                Μ
                                                   W
(do nothing)
                                                E
                                          F
                                             D
                                                   Μ
                                                     W
(do nothing)
                                             F
                                                   E
                                                        W
                                                D
                                                     Μ
xorg %r10, %r11
                                                F
                                                     E
                                                   D
                                                        Μ
                                                           W
movg %r11, 0(%r12)
                                                        E
                                                   F
                                                           Μ
                                                     D
```

```
cmpq %r8, %r9
       ine LABEL
                     // not taken
       xorg %r10, %r11
       movg %r11, 0(%r12)
                              cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                               Μ
                                                  W
                                       F
                                          D
                                            E
ine LABEL
                                          F
                                               Е
                                                  Μ
                                                     W
                                            D
(do nothing)
                                             F
                                                  E
                                                     М
                                               D
                                                        W
(do nothing)
                                               F
                                                     E
                                                        М
                                                           W
                                                  D
xorg %r10, %r11
                               use computed result F
                                                        E
                                                           М
                                                     D
                                                              W
movg %r11, 0(%r12)
                                                     F
                                                           E
                                                              Μ
                                                        D
```

## making guesses

```
cmpg %r8, %r9
         ine LABEL
         xorg %r10, %r11
         movg %r11, 0(%r12)
         . . .
LABEL: addg %r8, %r9
         imul %r13, %r14
speculate (guess): jne won't go to LABEL
```

right: 2 cycles faster!; wrong: undo guess before too late

## jXX: speculating right (1)

```
cmpq %r8, %r9
jne LABEL
xorq %r10, %r11
movq %r11, 0(%r12)
...
```

LABEL: addq %r8, %r9 imul %r13, %r14

cmpq %r8, %r9
jne LABEL
xorq %r10, %r11
movq %r11, 0(%r12)

| cycle $\#$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|------------|---|---|---|---|---|---|---|---|---|
|            | F | D | Е | М | W |   |   |   |   |
|            |   | F | D | Е | М | W |   |   |   |
|            |   |   | F | D | Е | М | W |   |   |
|            |   |   |   | F | D | Е | М | W |   |

8

### •••

## jXX: speculating wrong cycle # 0 1 2 3 4 5 6 7 8 cmpg %r8, %r9 F D E M W

ine LABEL xorg %r10, %r11 (inserted nop) movg %r11, 0(%r12) (inserted nop) LABEL: addg %r8, %r9 imul %r13, %r14



F

# instruction "squashed" W

8

Μ W

E М

F

W

Μ

## jXX: speculating wrong

*cycle* # 0 1 2 3 4 5 6 7 8 cmpg %r8, %r9 F F М D W ine LABEL F D E М W xorg %r10, %r11 F instruction "squashed" D (inserted nop) E movg %r11, 0(%r12) (inserted nop) E D LABEL: addg %r8, %r9 F D imul %r13, %r14 F

## "squashed" instructions

on misprediction need to undo partially executed instructions

mostly: remove from pipeline registers

more complicated pipelines: replace written values in cache/registers/etc.

## performance

#### hypothetical instruction mix

| kind          | portion | cycles<br>(predict<br>not-taken) |    |
|---------------|---------|----------------------------------|----|
| taken jXX     | 3%      | 3                                | 3  |
| non-taken jXX | 5%      | 1                                | 3  |
| others        | 92%     | 1*                               | 1* |

## performance

#### hypothetical instruction mix

| kind          | portion | cycles<br>(predict<br>not-taken) |    |
|---------------|---------|----------------------------------|----|
| taken jXX     | 3%      | 3                                | 3  |
| non-taken jXX | 5%      | 1                                | 3  |
| others        | 92%     | 1*                               | 1* |

predict: 
$$3 \times .03 + 1 \times .05 + 1 \times .92 = \frac{1.06 \text{ cycles/instr.}}{3 \times .03 + 3 \times .05 + 1 \times .92} = \frac{1.16 \text{ cylces/instr.}}{1.09} \approx 1.09 \text{ x faster}$$

#### exercise: predict+forward (1) *cvcle* # 0 1 2 3 4 5 6 7 8 addg %r8, %r9 FDEMW subg %r7, %r8 FDEMW **ile** foo (taken) FDEMW ... ... ... foo: andg %r9, %r8 if jle is *correctly predicted*: in andg, %r9 is \_\_\_\_\_ addg. in andq, %r8 is \_\_\_\_\_\_ subq. A: not forwarded from [assume read while writing requires forwarding] B-D: forwarded to decode from {execute,memory,writeback} stage of



#### exercise: predict+forward (2) cycle # 0 1 2 3 4 5 6 7 8 addg %r8, %r9 FDEMW **subg** %r7, %r8 FDEMW **ile** foo (taken) FDEMW ... ... ••• foo: andg %r9, %r8 if ile is *mispredicted* + resolved after ile's execute: in andg, %r9 is \_\_\_\_\_ adda. in andq, %r9 is \_\_\_\_\_\_ subq. A: not forwarded from [assume read while writing requires forwarding] B-D: forwarded to decode from {execute,memory,writeback} stage of



## other pipelines?

showed fetch / decode / execute / memory / writeback

very common early pipeline design

not only option!

## hazards versus dependencies

dependency — X needs result of instruction Y? has potential for being messed up by pipeline (since part of X may run before Y finishes)

hazard — will it not work in some pipeline? before extra work is done to "resolve" hazards multiple kinds: so far, *data hazards* 

## ex.: dependencies and hazards (1)

| addq | %rax,  | %rbx |
|------|--------|------|
| subq | %rax,  | %rcx |
| movq | \$100, | %rcx |
| addq | %rcx,  | %r10 |
| addq | %rbx,  | %r10 |

#### ex.: dependencies and hazards (1) addq %rax, %rbx %rax, suba %rcx \$100, %rcx movq %rcx addq %r10 addg %rbx %r10





## pipeline with different hazards

example: 4-stage pipeline: fetch/decode/execute+memory/writeback

|      |       |      | // 4 st | age // | 5 | stage |
|------|-------|------|---------|--------|---|-------|
| addq | %rax, | %r8  | 11      |        | W |       |
| subq | %rax, | %r9  | // W    |        | М |       |
| xorq | %rax, | %r10 | // EM   |        | Ε |       |
| andq | %r8,  | %r11 | // D    | 11     | D |       |

## pipeline with different hazards

example: 4-stage pipeline: fetch/decode/execute+memory/writeback

// 4 stage // 5 stage addq %rax, %r8 // // W subq %rax, %r9 // W // M xorq %rax, %r10 // EM // E andq %r8, %r11 // D // D

addq/andq is hazard with 5-stage pipeline

addq/andq is **not** a hazard with 4-stage pipeline

## pipeline with different hazards

example: 4-stage pipeline: fetch/decode/execute+memory/writeback

|      |       |      | // 4 stage | // 5 stage |
|------|-------|------|------------|------------|
| addq | %rax, | %r8  | 11         | // W       |
| subq | %rax, | %r9  | // W       | // M       |
| xorq | %rax, | %r10 | // EM      | // E       |
| andq | %r8,  | %r11 | // D       | // D       |

more hazards with more pipeline stages

split execute into two stages: F/D/E1/E2/M/W

result only available near end of second execute stage

where does forwarding, stalls occur?

| cycle ⊭                     | 0 | 1                     | 2                     | 3  | 4 | 5 | 6 | 7 | 8 |  |
|-----------------------------|---|-----------------------|-----------------------|----|---|---|---|---|---|--|
| (1) addq %rcx, %r9          | F | D                     | E1                    | E2 | М | W |   |   |   |  |
| (2) addq %r9,%rbx           |   |                       |                       |    |   |   |   |   |   |  |
| (3) addq %rax, %r9          |   |                       |                       |    |   |   |   |   |   |  |
| (4) <b>movq</b> %r9, (%rbx) |   | -<br>-<br>-<br>-<br>- |                       |    |   |   |   |   |   |  |
| (5) <b>movq</b> %rcx, %r9   |   |                       | -<br>-<br>-<br>-<br>- |    |   |   |   |   |   |  |

| split execute into two stages: $F/D/E1/E2/M/W$ |   |   |    |    |   |   |   |   |   |  |
|------------------------------------------------|---|---|----|----|---|---|---|---|---|--|
| cycle ∉                                        | 0 | 1 | 2  | 3  | 4 | 5 | 6 | 7 | 8 |  |
| addq%rcx,%r9                                   | F | D | E1 | E2 | М | W |   |   |   |  |
| addq %r9, %rbx                                 |   |   |    |    |   |   |   |   |   |  |
| addq%rax,%r9                                   |   |   |    |    |   |   |   |   |   |  |
| movq%r9,(%rbx)                                 |   |   |    |    |   |   |   |   |   |  |

split execute into two stages: F/D/E1/E2/M/W

 cycle #
 0
 1
 2
 3
 4
 5
 6
 7
 8

 addq %rcx, %r9
 F
 D
 E1
 E2
 M
 W
 4
 5
 6
 7
 8

 addq %r9, %rbx
 F
 D
 E1
 E2
 M
 W
 4
 5
 6
 7
 8

addq %rax, %r r9 not available yet — can't forward here
so try stalling in addq's decode...
movq %r9, (%rbx)
F D E1 E2 M W

| split execute into two stages: $F/D/E1/E2/M/W$ |        |     |      |       |     |       |       |      |    |   |
|------------------------------------------------|--------|-----|------|-------|-----|-------|-------|------|----|---|
| cycle ∦                                        | 0      | 1   | 2    | 3     | 4   | 5     | 6     | 7    | 8  |   |
| addq %rcx, %r9                                 | F      | D   | E1   | E2    | М   | W     |       |      |    |   |
| addq%r9,%rbx                                   |        | F   | D    | Ε1    | E2  | М     | W     |      |    |   |
| addq %r9,%rbx                                  |        |     | 1    | D     |     |       |       |      |    |   |
| addq %rax, %r <sup>g</sup> after s             | stalli | ing | once | e, no | w w | /e ca | an fo | orwa | rd |   |
| addq%rax,%r9                                   |        |     | F    | F     | D   | E1    | E2    | М    | W  |   |
| movq%r9,(%rbx)                                 |        |     |      | F     | D   | Ε1    | E2    | М    | W  |   |
| <pre>movq %r9, (%rbx)</pre>                    |        |     |      |       | F   | D     | E1    | E2   | М  | W |

| split execute into two stages: $F/D/E1/E2/M/W$ |   |   |   |    |    |    |    |    |   |                  |  |
|------------------------------------------------|---|---|---|----|----|----|----|----|---|------------------|--|
| cycle ∦                                        | 0 | 1 | 2 | 3  | 4  | 5  | 6  | 7  | 8 |                  |  |
| addq%rcx,%r9                                   | F |   |   |    |    | W  |    |    |   |                  |  |
| addq%r9,%rbx                                   |   | F | D | Ε1 | E2 | М  | W  |    |   |                  |  |
| addq%r9,%rbx                                   |   |   |   |    |    | E2 |    | W  |   |                  |  |
| addq%rax,%r9                                   |   |   | F | D  | Ε1 | E2 | М  | W  |   |                  |  |
| addq%rax,%r9                                   |   |   | F | F  | D  | E1 | E2 | М  | W |                  |  |
| <pre>movq %r9, (%rbx)</pre>                    |   |   |   | F  | D  | E1 | E2 | М  | W | 4<br>4<br>4<br>4 |  |
| <b>movq</b> %r9, (%rbx)                        |   |   |   |    | F  | D  | E1 | E2 | М | W                |  |

| split execute into two stages: $F/D/E1/E2/M/W$ |   |   |    |    |    |    |    |    |    |   |   |
|------------------------------------------------|---|---|----|----|----|----|----|----|----|---|---|
| cycle ⋕                                        | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7  | 8  |   |   |
| addq%rcx,%r9                                   | F | D | E1 | E2 | Μ  | W  |    |    |    |   |   |
| addq %r9, %rbx                                 |   | F | D  | Ε1 | E2 | М  | W  |    |    |   |   |
| addq %r9, %rbx                                 |   | F | D  | D  | Ε1 | E2 | М  | W  |    |   |   |
| addq%rax,%r9                                   |   |   | F  | D  | Ε1 | E2 | М  | W  |    |   |   |
| addq%rax,%r9                                   |   |   | F  | F  | D  | Ε1 | E2 | М  | W  |   |   |
| <b>movq</b> %r9, (%rbx)                        |   |   |    | F  | D  | E1 | E2 | М  | W  |   |   |
| <pre>movq %r9, (%rbx)</pre>                    |   |   |    |    | F  | D  | E1 | E2 | М  | W |   |
| movq%rcx,%r9                                   |   |   |    |    |    | F  | D  | E1 | E2 | М | W |

## static branch prediction

```
forward (target > PC) not taken; backward taken
```

intuition: loops:

```
LOOP: ...
je LOOP
LOOP: ...
jne SKIP_LOOP
...
jmp LOOP
SKIP LOOP:
```

## exercise: static prediction

```
.global foo
foo:
   xor %eax, %eax // eax <- 0</pre>
foo loop top:
   test $0x1, %edi
   je foo_loop_bottom // if (edi & 1 == 0) goto for_loop_bottom
   add %edi, %eax
foo_loop_bottom:
   dec %edi // edi = edi - 1
   jg for_loop_top // if (edi > 0) goto for_loop_top
    ret
```

suppose %edi = 3 (initially)

and using forward-not-taken, backwards-taken strategy: how many mispreditions for je? for jg?

## backup slides





how to (in hardware) connect A and B?

one wire carrying binary signals?

B

23









how to (in hardware) connect A and B?



example: cable Internet

(how is topic for ECE class)







how to connect?



D

F







## shared bus, really?

common for parts of internals of computers (topic later)

model for wifi radio "channel" kinda similar to shared wire

how the early versions of Ethernet worked "vampire taps" physically attached to shared cable

## shared bus, messages for who?



messages needs a 'header' to tell who it's to/from

# everyone needs to filter out messages that aren't theirs

Figure 6-1: Data Link Layer Frame Format

Figure from Digital, Intel, and Xerox, "The Ethernet: A Local Area Network: Data Link Layer and Physical Layer Specification", Version 2.0 (1982) 27

## taking turns on shared bus?

token ring one machine has a 'token' = can send send special message to pass to another machine

free-for-all: collision detection + retry detect if you're transmitting when someone else is wait (usually randomized amount of time) and retry

coordinating machine transmits timeslots part of common cellphone design (TDMA: time division multiple access)

make bus support multiple transmitters? requires understanding how interference works another part of common cell phone design



### what does the hub do?

simple version:

imitate shared bus: copy messages to everyone else something to handle two messages sent at once

less simple:

read "header" on message + send to destination only requires some way to figure out destinations queue of messages waiting to be sent



### more complicated designs

hierarchies

networks of networks "internetworks"

so far still have single points of failure



#### individual computers are networks

individual computers are (kinda) networks of ...

processors memories I/O devices

so what topology (layout) do those networks have?

#### the "bus"



## example: 80386 signal pins

| name      | purpose                |          |
|-----------|------------------------|----------|
| CLK2      | clock for bus          | timing   |
| W/R#      | write or read?         |          |
| D/C#      | data or control?       | metadata |
| M/IO#     | memory or I/O?         | melauala |
| INTR      | interrupt request      |          |
|           | other metadata signals |          |
| BE0#-BE3# | (4) byte enable        | address  |
| A2-A31    | (30) address bits      |          |
| DO-D31    | (32) data signals      | data     |

## example: AMD EPYC (1 socket)



Fig. 21. Single-socket AMD EPYC<sup>TM</sup> system (SP3).

Figure from Burd et al,

## example: Intel Skylake-SP



#### extra trips to CPU



#### extra trips to CPU



#### DMA

"place data at 0xABCD"





"place data at 0xABCD"





"place data at 0xABCD"











