sync-reorder

the correctness problem

two threads?
introduces non-determinism
which one runs first?
allows for ‘‘race condition’’ bugs
… to be avoided with synchronization constructs

a threaded server?

Deposit(accountNumber, amount) {
    account = GetAccount(accountId);
    account->balance += amount;
    SaveAccountUpdates(account);
}

maybe GetAccount/SaveAccountUpdates can be slow?
- read/write disk sometimes? contact another server sometimes?
maybe lots of requests to process?
- maybe real logic has more checks than Deposit()
- …
all reasons to handle multiple requests at once
\(\rightarrow\) many threads all running the server loop

the lost write

account->balance += amount; (in two threads; same account)

Thread A		Thread B
`mov account->balance, %rax`
`add amount, %rax`

		`mov account->balance, %rax`
		`add amount, %rax`

`mov %rax, account->balance` “lost write”

		`mov %rax, account->balance` “winner” of race

thinking about race conditions (1)

what are the possible values of \(x\)? (initially \(x = y = 0\))

Thread A Thread B

\(x \leftarrow 1\) \(y \leftarrow 2\)

must be 1. Thread B can’t do anything

thinking about race conditions (2)

possible values of \(x\)? (initially \(x = y = 0\))

Thread A Thread B

\(x \leftarrow y + 1\) \(y \leftarrow 2\)

\(y \leftarrow y \times 2\)

if A goes first, then B: \(1\)
if B goes first, then A: \(5\)
if B line one, then A, then B line two: \(3\)

thinking about race conditions (3)

what are the possible values of \(x\)?
(initially \(x = y = 0\))

Thread A Thread B

\(x \leftarrow 1\) \(x \leftarrow 2\)

1 or 2

… but why not 3?
- B: x bit 0 \(\leftarrow 0\)
- A: x bit 0 \(\leftarrow 1\)
- A: x bit 1 \(\leftarrow 0\)
- B: x bit 1 \(\leftarrow 1\)

thinking about race conditions (2, reprise)

possible values of \(x\)? (initially \(x = y = 0\))

Thread A

Thread B

\(x \leftarrow y + 1\)

\(y \leftarrow 2\)

\(y \leftarrow y \times 2\)
why not 7?
- B (start): \(y \leftarrow 2 = 0010_{\text{TWO}}\); then y bit 3 \(\leftarrow\) 0; y bit 2 \(\leftarrow\) 1; then
- A: x \(\leftarrow 110_{\text{TWO}} + 1 = 7\); then
- B (finish): y bit 1 \(\leftarrow\) 0; y bit 0 \(\leftarrow\) 0

atomic operation

atomic operation = operation that runs to completion or not at all
we will use these to let threads work together
most machines: loading/storing (aligned) words is atomic
- so can’t get \(3\) from \(x \leftarrow 1\) and \(x \leftarrow 2\) running in parallel
- aligned \(\approx\) address of word is multiple of word size (typically done by compilers)
but some instructions are not atomic; examples:
- x86: integer add constant to memory location
- many CPUs: loading/storing values that cross cache blocks
  - e.g. if cache blocks 0x40 bytes, load/store 4 byte from addr. 0x3E is not atomic

lost adds (program)

.global update_loop
update_loop:
    addl $1, the_value // the_value (global variable) += 1
    dec %rdi           // argument 1 -= 1
    jg update_loop     // if argument 1 >= 0 repeat
    ret

int the_value;
extern void *update_loop(void *);
int main(void) {
    the_value = 0;
    pthread_t A, B;
    pthread_create(&A, NULL, update_loop, (void*) 1000000);
    pthread_create(&B, NULL, update_loop, (void*) 1000000);
    pthread_join(A, NULL); pthread_join(B, NULL);
    // expected result: 1000000 + 1000000 = 2000000
    printf("the_value = %d\n", the_value);
}

lost adds (results)

histogram of actual measurements showing results clustered in around 1 million; with most slightly above that and a few below that. The histogram has no visible results above around 1.3 million.

but how?

probably not possible on single core
- exceptions can’t occur in the middle of add instruction
… but ‘add to memory’ implemented with multiple steps
- still needs to load, add, store internally
- can be interleaved with what other cores do

(and actually it’s more complicated than that — we’ll talk later)

so, what is actually atomic

for now we’ll assume: load/stores of ‘words’
- (64-bit machine = 64-bits words)

in general: processor designer will tell you
their job to design caches, etc. to work as documented

compilers move loads/stores (1)

void WaitForReady() {
  do {} while (!ready);
}

WaitForReady:
  movl ready, %eax         // eax <- other_ready

  // value only loaded before loop
.L2:
  testl %eax, %eax
  je .L2                   // while (eax == 0) repeat
  ...

C standard says: can assume no other thread
So, it’s fine to load ready in advance and reuse that

compilers move loads/stores (2)

void WaitForReady() {
  is_waiting = 1;
  do {} while (!ready);
  is_waiting = 0;
}

WaitForOther:
  movl ready, %eax
.L2:
  testl %eax, %eax
  je .L2                // while (eax == 0) repeat
  movl $0, is_waiting
  ret

C standard says: can assume no other thread
So, can do assignments is_waiting anytime before return
And is_waiting = 1; is_waiting = 0 same as is_waiting = 0

fixing compiler reordering?

isn’t there a way to tell compiler not to do these optimizations?
yes, but that is still not enough!
processors sometimes do this kind of reordering too (between cores)

pthreads and reordering

many pthreads functions prevent reordering
- everything before function call actually happens before
includes preventing some optimizations
- e.g. keeping global variable in register for too long
pthread_create, pthread_join, other tools we’ll talk about …
- basically: if pthreads is waiting for/starting something, no weird ordering
implementation part 1: prevent compiler reordering
implementation part 2: use special instructions
- example: x86 mfence instruction

Backup slides

a simple race

thread_A:
    movl $1, x   /* x <- 1 */
    movl y, %eax /* return y */
    ret

thread_B:
    movl $1, y   /* y <- 1 */
    movl x, %eax /* return x */
    ret

x = y = 0;
pthread_create(&A, NULL, thread_A, NULL);
pthread_create(&B, NULL, thread_B, NULL);
pthread_join(A, &A_result); pthread_join(B, &B_result);
printf("A:%d B:%d\n", (int) A_result, (int) B_result);

if loads/stores atomic, then possible results:
- A:1 B:1 — both moves into x and y, then both moves into eax execute
- A:0 B:1 — thread A executes before thread B
- A:1 B:0 — thread B executes before thread A

a simple race: results

thread_A:
    movl $1, x   /* x <- 1 */
    movl y, %eax /* return y */
    ret

thread_B:
    movl $1, y   /* y <- 1 */
    movl x, %eax /* return x */
    ret

x = y = 0;
pthread_create(&A, NULL, thread_A, NULL);
pthread_create(&B, NULL, thread_B, NULL);
pthread_join(A, &A_result); pthread_join(B, &B_result);
printf("A:%d B:%d\n", (int) A_result, (int) B_result);

my desktop, 100M trials:

frequency	result
\(99\,823\,739\)	A:0 B:1	(‘A executes before B’)
\(171\,161\)	A:1 B:0	(‘B executes before A’)
\(4\,706\)	A:1 B:1	(‘execute moves into x+y first’)
\(394\)	A:0 B:0	???

why reorder here?

thread_A:
    movl $1, x   /* x <- 1 */
    movl y, %eax /* return y */
    ret

thread_B:
    movl $1, y   /* y <- 1 */
    movl x, %eax /* return x */
    ret

thread A: faster to load y right now!
… rather than wait for write of x to finish

why load/store reordering?

fast processor designs can execute instructions out of order
goal: do something instead of waiting for slow memory accesses, etc.
more on this later in the semester

GCC: preventing reordering example (1)

void WaitForReady() {
    int one = 1;
    __atomic_store(&is_waiting, &one, __ATOMIC_SEQ_CST);
    do {
    } while (!__atomic_load_n(&ready, __ATOMIC_SEQ_CST));
    ...
}

WaitForReady:
  movl $1, is_waiting
  mfence
.L2:
  movl ready, %eax
  testl %eax, %eax
  jz .L2
  ...

GCC: preventing reordering example (2)

void WaitForReady() {
    is_waiting = 1;
    do {
        __atomic_thread_fence(__ATOMIC_SEQ_CST);
    } while (!ready);
    ...
}

WaitForReady:
  movl $1, is_waiting // is_waiting <- 1
.L3:
  mfence  // make sure store is visible to other cores before loading
          // on x86: not needed on second+ iteration of loop
  cmpl $0, ready // if (ready == 0) repeat fence
  jz .L3
  ...

C++: preventing reordering

to help implementing things like pthread_mutex_lock
C++ 2011 standard: atomic header, std::atomic class
prevent CPU reordering and prevent compiler reordering
also provide other tools for implementing locks (more later)
could also hand-write assembly code
- compiler can’t know what assembly code is doing

C++: preventing reordering example

#include <atomic>
void WaitForReady() {
    is_waiting = 1;
    do {
        std::atomic_thread_fence(std::memory_order_seq_cst);
    } while (!ready);
}

WaitForReady:
  movl $1, is_waiting // is_waiting <- 1
.L2:
  mfence  // make sure store visible on/from other cores
  cmpl $0, ready // if (ready == 0) repeat fence
  jz .L2
  ...

C++ atomics: no reordering

std::atomic<int> is_waiting, ready;
void WaitForReaady() {
    is_waiting.store(1);
    do {
    } while (ready.load());
}

WaitForReady:
  movl $1, is_waiting
  mfence
.L2:
  movl ready, %eax
  testl %eax, %eax
  jz .L2
  ...

GCC: built-in atomic functions

used to implement std::atomic, etc.
predate std::atomic
builtin functions starting with __sync and __atomic

aside: some x86 reordering rules

each core sees its own loads/stores in order
- (if a core stores something, it can always load it back)
stores from other cores appear in a consistent order
- (but a core might observe its own stores too early)
causality:
if a core reads X=a and (after reading X=a) writes Y=b,
then a core that reads Y=b cannot later read X=older value than a Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3A, Chapter 8

how do you do anything with this?

difficult to reason about what modern CPU’s reordering rules do
typically: don’t depend on details, instead:
special instructions with stronger (and simpler) ordering rules
- often same instructions that help with implementing locks in other ways
special instructions that restrict ordering of instructions around them (‘‘fences’’)
- loads/stores can’t cross the fence

mfence

x86 instruction mfence
make sure all loads/stores in progress finish
… and make sure no loads/stores were started early
fairly expensive
- Intel ‘Skylake’: order 33 cycles + time waiting for pending stores/loads

aside: this instruction did not exist in the original x86
so xv6 uses something older that’s equivalent

Thread A	Thread B
\(x \leftarrow y + 1\)	\(y \leftarrow 2\)
	\(y \leftarrow y \times 2\)