Deposit(accountNumber, amount) {
account = GetAccount(accountId);
account->balance += amount;
SaveAccountUpdates(account);
}
maybe GetAccount/SaveAccountUpdates can be slow?
maybe lots of requests to process?
all reasons to handle multiple requests at once
\(\rightarrow\) many threads all running the server loop
account->balance += amount; (in two threads; same account)
| Thread A | Thread B | |
|---|---|---|
mov account->balance, %rax
|
||
add amount, %rax
|
||
|
|
||
mov account->balance, %rax
|
||
add amount, %rax
|
||
|
|
||
mov %rax, account->balance“lost write” |
||
|
|
||
mov %rax, account->balance“winner” of race |
||
| Thread A | Thread B |
| \(x \leftarrow 1\) | \(y \leftarrow 2\) |
| Thread A | Thread B |
| \(x \leftarrow y + 1\) | \(y \leftarrow 2\) |
| \(y \leftarrow y \times 2\) |
| Thread A | Thread B |
| \(x \leftarrow 1\) | \(x \leftarrow 2\) |
… but why not 3?
possible values of \(x\)? (initially \(x = y = 0\))
|
Thread A |
Thread B |
|
\(x \leftarrow y + 1\) |
\(y \leftarrow 2\) |
|
|
\(y \leftarrow y \times 2\) |
why not 7?
atomic operation = operation that runs to completion or not at all
we will use these to let threads work together
most machines: loading/storing (aligned) words is atomic
but some instructions are not atomic; examples:
x86: integer add constant to memory location
many CPUs: loading/storing values that cross cache blocks
0x40 bytes, load/store 4 byte from addr. 0x3E is not atomicint the_value;
extern void *update_loop(void *);
int main(void) {
the_value = 0;
pthread_t A, B;
pthread_create(&A, NULL, update_loop, (void*) 1000000);
pthread_create(&B, NULL, update_loop, (void*) 1000000);
pthread_join(A, NULL); pthread_join(B, NULL);
// expected result: 1000000 + 1000000 = 2000000
printf("the_value = %d\n", the_value);
}probably not possible on single core
add instruction… but ‘add to memory’ implemented with multiple steps
(and actually it’s more complicated than that — we’ll talk later)
for now we’ll assume: load/stores of ‘words’
C standard says: can assume no other thread
So, it’s fine to load ready in advance and reuse that
C standard says: can assume no other thread
So, can do assignments is_waiting anytime before return
And is_waiting = 1; is_waiting = 0 same as is_waiting = 0
mfence instructionif loads/stores atomic, then possible results:
| frequency | result | |
|---|---|---|
| \(99\,823\,739\) | A:0 B:1 | (‘A executes before B’) |
| \(171\,161\) | A:1 B:0 | (‘B executes before A’) |
| \(4\,706\) | A:1 B:1 | (‘execute moves into x+y first’) |
| \(394\) | A:0 B:0 | ??? |
y right now!x to finishvoid WaitForReady() {
int one = 1;
__atomic_store(&is_waiting, &one, __ATOMIC_SEQ_CST);
do {
} while (!__atomic_load_n(&ready, __ATOMIC_SEQ_CST));
...
}
WaitForReady:
movl $1, is_waiting
mfence
.L2:
movl ready, %eax
testl %eax, %eax
jz .L2
...
void WaitForReady() {
is_waiting = 1;
do {
__atomic_thread_fence(__ATOMIC_SEQ_CST);
} while (!ready);
...
}
WaitForReady:
movl $1, is_waiting // is_waiting <- 1
.L3:
mfence // make sure store is visible to other cores before loading
// on x86: not needed on second+ iteration of loop
cmpl $0, ready // if (ready == 0) repeat fence
jz .L3
...
to help implementing things like pthread_mutex_lock
C++ 2011 standard: atomic header, std::atomic class
prevent CPU reordering and prevent compiler reordering
also provide other tools for implementing locks (more later)
could also hand-write assembly code
#include <atomic>
void WaitForReady() {
is_waiting = 1;
do {
std::atomic_thread_fence(std::memory_order_seq_cst);
} while (!ready);
}
WaitForReady:
movl $1, is_waiting // is_waiting <- 1
.L2:
mfence // make sure store visible on/from other cores
cmpl $0, ready // if (ready == 0) repeat fence
jz .L2
...
std::atomic<int> is_waiting, ready;
void WaitForReaady() {
is_waiting.store(1);
do {
} while (ready.load());
}
WaitForReady:
movl $1, is_waiting
mfence
.L2:
movl ready, %eax
testl %eax, %eax
jz .L2
...
__sync and __atomiceach core sees its own loads/stores in order
stores from other cores appear in a consistent order
causality:
if a core reads X=a and (after reading X=a) writes Y=b,
then a core that reads Y=b cannot later read X=older value than a Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3A, Chapter 8
difficult to reason about what modern CPU’s reordering rules do
typically: don’t depend on details, instead:
special instructions with stronger (and simpler) ordering rules
special instructions that restrict ordering of instructions around them (‘‘fences’’)
x86 instruction mfence
make sure all loads/stores in progress finish
… and make sure no loads/stores were started early
fairly expensive