bpred

branch prediction

  • guess where jumps, etc. go
  • very important for out-of-order processor performance
    • takes many cycles before misprediction detected
    • takes some time to undo misprediction
    • lots of instructions per cycle “missed” on misprediction
  • modern laptop/desktop CPUs devote a lot of space to branch prediction

static branch prediction

  • forward (target \(>\) PC) not taken; backward taken
  • intuition: loops:
LOOP: ...
      ...
      je LOOP

LOOP: ...
      jne SKIP_LOOP
      ...
      jmp LOOP
SKIP_LOOP:

exercise: static prediction

.global foo
foo:
    xor %eax, %eax // eax <- 0
foo_loop_top:
    test $0x1, %edi
    je foo_loop_bottom  // if (edi & 1 == 0) goto for_loop_bottom
    add %edi, %eax 
foo_loop_bottom:
    dec %edi            // edi = edi - 1
    jg for_loop_top // if (edi > 0) goto for_loop_top
    ret
  • suppose %edi = 3 (initially)
  • and using forward-not-taken, backwards-taken strategy:
  • how many mispreditions for je? for jg?

predict: repeat last

example

collisions?

  • two branches could have same hashed PC

  • nothing in table tells us about this

    • versus direct-mapped cache: had tag bits to tell

  • is it worth it?
  • adding tag bits makes table much larger and/or slower
  • but does anything go wrong when there’s a collision?

collision results

  • possibility 1: both branches usually taken

    • no actual conflict — prediction is better(!)
  • possibility 2: both branches usually not taken

    • no actual conflict — prediction is better(!)
  • possibility 3: one branch taken, one not taken

    • performance probably worse

1-bit predictor and loops

  • loops have jump at beginning/end
    • static prediction got these 100%
  • 1-bit prediction: predicts first and last iteration wrong
    • first wrong: last time loop ran, did not continue, but should
    • last wrong: last time loop ran, did continue, but should not
  • everything else correct

exercise (pt 1)

  • use 1-bit predictor on this loop

    • executed in outer loop (not shown) many, many times
  • what is the conditional jump misprediction rate for i % 3 == 0? for i == 50? overall?

int i = 0;
while (true) {
  if (i % 3 == 0)
    goto next; 
  ...
next:
  i += 1;
  if (i == 50)
    break; 
}
i= branch pred outcome correct?
0 mod 3 ??? T ???
1 == 50 ??? F ???
1 mod 3 T F
2 == 50 F F \(\checkmark\)

exercise soln (1)

beyond 1-bit predictor

  • devote more space to storing history

  • main goal: rare exceptions don’t immediately change prediction


  • example: branch taken 99% of the time

  • 1-bit predictor: wrong about 2% of the time

    • 1% when branch not taken
    • 1% of taken branches right after branch not taken
  • new predictor: wrong about 1% of the time

    • 1% when branch not taken

2-bit saturating counter

example

generalizing saturating counters

  • 2-bit counter: ignore one exception to taken/not taken
  • 3-bit counter: ignore more exceptions
  • 000\(\;\leftrightarrow\;\)001\(\;\leftrightarrow\;\)010\(\;\leftrightarrow\;\)011\(\;\leftrightarrow\;\)100\(\;\leftrightarrow\;\)101\(\;\leftrightarrow\;\)110\(\;\leftrightarrow\;\)111
  • 000-011: not taken
  • 100-111: taken

exercise

  • use 2-bit predictor on this loop

    • executed in outer loop (not shown) many, many times
  • what is the conditional branch misprediction rate?

int i = 0;
while (true) {
  if (i % 3 == 0) goto next;
  ...
next:
  i += 1;
  if (i == 50) break;
}

exercise soln (1)

briefly: other prediction ideas

  • branches often have cyclic behavior
    • example: for (int i = 0; i < 5; ++i)
    • example: if (first or last item) {...}
    • prediction idea: try to find/count fixed loops
    • prediction idea: identify patterns like TTNNNTTNNNT\(\rightarrow\)T
  • branches are correlated
    • example: flag checked in multiple places
    • example: special case for first iteration of loop
    • prediction idea: identify patterns across multiple branches

predicting ret: ministack of return addresses

  • predicting ret — ministack in processor registers
    • push on ministack on call; pop on ret
  • ministack overflows? discard oldest, mispredict it later

4-entry return address stack

  • on call: increment index, save return address in that slot
  • on ret: read prediction from index, decrement index

branch prediction before fetch/decode

branch target buffer

  • suppose we have OOO processor that fetches 4 instructions/cycle
  • given jmp LABEL is 1st instruction we fetch
    • want to fetch code at LABEL ideally in same cycle to avoid slowdown
    • if not same cycle, in next cycle
    • likely we won’t decode LABEL for several cycles
  • if it’s jle LABEL instead…
    • also want to predict if it will go to LABEL just as fast
    • may not figure out its a jle for several cycles

  • solution: special predictor for what instruction will be

BTB: cache for branch targets

indirect branch prediction

  • jmp *%rax or jmp *(%rax, %rcx, 8) or call *%rax or …
    • example: implementing switch statement
    • example: implementing polymorphic method call
  • want to predict target address
  • BTB could store one possibility
  • really want to take advantage of context
    • example: guess whether we’ll call Rectangle.GetArea or Circle.GetArea
  • prediction idea: instead of tables containing taken/not taken…
  • … have table containing predicted target
    • one implementation: hashtable keyed by recent branches taken

branch patterns

i = 4;
do {
    ...
    i -= 1;
} while (i != 0);

  • typical pattern for jump to top of do-while above:
  • (T = taken, N = not taken)
  • goal: take advantage of recent pattern to make predictions
  • just saw ‘NTTTNT’? predict T next
  • ‘TNTTTN’? predict T; ‘TTNTTT’? predict N next

local pattern predictor (incomplete)

recent pattern to prediction?

  • easy cases:

  • just saw TTTTTT: predict T

  • just saw NNNNNN: predict N

  • just saw TNTNTN: predict T


  • hard cases:

  • TTNTTTT

    • predict T? loop with many iterations (NTTTTTTTNTTTTTTTNTTTTTT…)
    • predict T? if statement mostly taken (TTTTNTTNTTTTTTTTTTNTTTT…)
    • predict N? loop with 5 iterations (NTTTTNTTTTNTTTTNTTTTNTT…)
  • (many more)

history of history

local patterns and collisions (1)

i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error;  // BRANCH 1
    ...
} while (i-- != 0); // BRANCH 2

  • what if branch 1 and branch 2 hash to same table entry?
  • pattern: TNTNTNTNTNTNTNTNT…
  • actually no problem to predict!

local patterns and collisions (2)

i = 10000;
do {
    if (i % 2 == 0) goto skip; // BRANCH 1
    ...
    p = malloc(...);
    if (p == NULL) goto error;  // BRANCH 2
skip: ...
} while (i-- != 0); // BRANCH 3

  • what if branch 1 and branch 2 and branch 3 hash to same table entry?
  • pattern: TTNNTTNNTTNNTTNNTT
  • also no problem to predict!

local patterns and collisions (3)

i = 10000;
do {
    if (A) goto one  // BRANCH 1
    ...
one:
    if (B) goto two  // BRANCH 2
    ...
two:
    if (A or B) goto three // BRANCH 3
    ...
    if (A and B) goto three // BRANCH 4
    ...
three:
    ... // changes A, B
} while (i-- != 0); 

  • what if branch 1-4 hash to same table entry?
  • better for prediction of branch 3 and 4

global history predictor: idea

  • one predictor idea: ignore the PC
  • just record taken/not-taken pattern for all branches
  • lookup in big table like for local patterns

global history predictor (1)

correlating predictor

  • global history and local info good together
  • one idea: combine history register + PC (‘‘gshare’’)

mixing predictors

  • different predictors good at different times
  • one idea: have two predictors, + predictor to predict which is right

loop count predictors (1)

for (int i = 0; i < 64; ++i)
    ...
  • can we predict this perfectly with predictors we’ve seen
  • yes — local or global history with 64 entries
  • but this is very important — more efficient way?

loop count predictors (2)

  • loop count predictor idea: look for NNNNNNT+repeat (or TTTTTTN+repeat)

  • track for each possible loop branch:

    • how many repeated Ns (or Ts) so far
    • how many repeated Ns (or Ts) last time before one T (or N)
    • something to indicate this pattern is useful?
  • known to be used on Intel

benchmark results

  • from 1993 paper
  • (not representative of modern workloads?)
  • rate for conditional branches on benchmark
  • variable table sizes

2-bit ctr + local history

from McFarling, ‘‘Combining Branch Predictors’’ (1993)

2-bit (bimodal) + local + global hist

from McFarling, ‘‘Combining Branch Predictors’’ (1993)

global + hash(global+PC) (gshare/gselect)

from McFarling, ‘‘Combining Branch Predictors’’ (1993)

real BP?

reverse engineering Haswell BPs

  • branch target buffer

    • 4-way, 4096 entries
    • ignores bottom 4 bits of PC?
    • hashes PC to index by shifting + XOR
    • seems to store 32 bit offset from PC (not all 48+ bits of virtual addr)
  • indirect branch predictor

    • like the global history + PC predictor we showed, but…
    • uses history of recent branch addresses instead of taken/not taken
    • keeps some info about last 29 branches
  • what about conditional branches??? loops???

    • couldn’t find a reasonable source

backup slides

Backup slides