sync-lock-impl

too much milk

  • roommates Alice and Bob want to keep fridge stocked with milk:
    time Alice Bob
    3:00 look in fridge. no milk  
    3:05 leave for store  
    3:10 arrive at store look in fridge. no milk
    3:15 buy milk leave for store
    3:20 return home, put milk in fridge arrive at store
    3:25   buy milk
    3:30   return home, put milk in fridge

    how can Alice and Bob coordinate better?

too much milk ‘‘solution’’ 1 (algorithm)

  • leave a note: ‘‘I am buying milk’’

    • place before buying, remove after buying
    • don’t try buying if there’s a note
  • \(\approx\) setting/checking a variable (e.g. ‘‘note = 1’’)

    • with atomic load/store of variable
if (no milk) {
    if (no note) {
        leave note;
        buy milk;
        remove note;
    }
}
  • exercise: why doesn’t this work?

too much milk ‘‘solution’’ 1 (timeline)

Alice Bob
if (no milk) {
    if (no note) {
if (no milk) {
    if (no note) {
        leave note;
        buy milk;
        remove note;
        leave note;
        buy milk;
        remove note;
    }
}
    }
}

too much milk ‘‘solution’’ 2 (algorithm)

  • intuition: leave note when buying or checking if need to buy
leave note;
if (no milk) {
    if (no note) {
        buy milk;
    }
}
remove note;

too much milk: ‘‘solution’’ 2 (timeline)

‘‘solution’’ 3: algorithm

  • intuition: label notes so Alice knows which is hers (and vice-versa)

    • computer equivalent: separate noteFromAlice and noteFromBob variables

too much milk: ‘‘solution’’ 3 (timeline)

too much milk: is it possible

  • is there a solutions with writing/reading notes?

    • \(\approx\) loading/storing from shared memory

  • yes, but it’s not very elegant

too much milk: solution 4 (algorithm)

  • exercise (hard): prove (in)correctness
  • exercise (hard): extend to three people

Peterson’s algorithm

  • general version of solution
  • see, e.g., Wikipedia
  • we’ll use special hardware support instead

x86-64 spinlock with xchg

  • lock variable in shared memory: the_lock
  • if 1: someone has the lock; if 0: lock is free to take
acquire:
    movl $1, %eax             // %eax <- 1
    lock xchg %eax, the_lock  // swap %eax and the_lock
                                    // sets the_lock to 1 (taken)
                                    // sets %eax to prior val. of the_lock
    test %eax, %eax           // if the_lock wasn't 0 before:
    jne acquire               //   try again
    ret

release:
    mfence                    // for memory order reasons
    movl $0, the_lock         // then, set the_lock to 0 (not taken)
    ret

exercise: spin wait

  • consider implementing ‘waiting’ functionality of pthread_join
  • thread calls ThreadFinish() when done
  • complete code below:
finished: .quad 0
ThreadFinish:
    _________________________
    ret
ThreadWaitForFinish:
    _________________________
    lock xchg %eax, finished
    cmp $0, %eax
    ____ ThreadWaitForFinish
    ret
A. mfence; mov $1, finished C. mov $0, %eax E. je
B. mov $1, finished; mfence D. mov $1, %eax F. jne

exercise: spin wait

finished: .quad 0
ThreadFinish:
    __________A______________
    ret
ThreadWaitForFinish:            /* or without using a writing instruction: */
    _________B______________    mov %eax, finished
    lock xchg %eax, finished    mfence
    cmp $0, %eax                cmp $0, %eax
    __C_ ThreadWaitForFinish    je ThreadWaitForFinish
    ret                         ret
A. mfence; mov $1, finished C. mov $0, %eax E. je
B. mov $1, finished; mfence D. mov $1, %eax F. jne

againframe(exerSpinWaitSoln)

spinlock problems

  • lock abstraction is not powerful enough

    • lock/unlock operations don’t handle ‘‘wait for event’’
    • common thing we want to do with threads
    • solution: other synchronization abstractions
  • spinlocks waste CPU time more than needed

    • want to run another thread instead of infinite loop
    • solution: lock implementation integrated with scheduler
  • spinlocks can send a lot of messages on the shared bus

    • more efficient atomic operations to implement locks

problem: busy waits

  while(xchg(&lk->locked, 1) != 0)
    ; 
  • what if it’s going to be a while?
  • waiting for process that’s waiting for I/O?
  • really would like to do something else with CPU instead…

mutexes: intelligent waiting

  • want: locks that wait better

    • example: POSIX mutexes
  • instead of running infinite loop, give away CPU


  • lock = go to sleep, add self to list

    • sleep = scheduler runs something else
  • unlock = wake up sleeping thread

better lock implementation idea

  • shared list of waiters
  • spinlock protects list of waiters from concurrent modification
  • lock = use spinlock to add self to list, then wait without spinlock
  • unlock = use spinlock to remove item from list

one possible implementation

struct Mutex { 
    SpinLock guard_spinlock;
    bool lock_taken = false;
    WaitQueue wait_queue;
};
LockMutex(Mutex *m) {
  LockSpinlock(&m->guard_spinlock);
  if (m->lock_taken) {
    put current thread on m->wait_queue
    mark current thread as waiting
    /* xv6: myproc()->state = SLEEPING; */
    UnlockSpinlock(&m->guard_spinlock);
    run scheduler (context switch)
  } else {
    m->lock_taken = true;
    UnlockSpinlock(&m->guard_spinlock);
  }
}
UnlockMutex(Mutex *m) {
  LockSpinlock(&m->guard_spinlock);
  if (m->wait_queue not empty) {
    remove a thread from m->wait_queue 
    mark thread as no longer waiting
    /* xv6: myproc()->state = RUNNABLE; */
  } else {
     m->lock_taken = false;
  }
  UnlockSpinlock(&m->guard_spinlock);
}

spinlock protecting lock_taken and wait_queue
only held for very short amount of time (compared to mutex itself)

tracks whether any thread has locked and not unlocked

list of threads that discovered lock is taken
and are waiting for it be free
these threads are not runnable

instead of setting lock_taken to false
choose thread to hand-off lock to

subtly: if UnlockMutex runs here on another core
need to make sure scheduler on the other core doesn’t switch to thread
while it is still running (would ‘clone’ thread/mess up registers)

mutex and scheduler subtly

core 0 (thread A) core 1 (thread B)
start LockMutex
acquire spinlock
discover lock taken
enqueue thread A
thread A set not runnable
release spinlock start UnlockMutex
thread A set runnable
finish UnlockMutex
run scheduler
scheduler switches to A
… with old verison of registers
thread A runs scheduler
… finally saving registers
  • Linux soln.: track ‘thread running’ separately from ‘thread runnable’
  • xv6 soln.: hold scheduler lock until thread A saves registers

mutex efficiency

  • ‘normal’ mutex uncontended case:

    • lock: acquire + release spinlock, see lock is free
    • unlock: acquire + release spinlock, see queue is empty

  • not much slower than spinlock

implementing locks: single core

  • intuition: context switch only happens on interrupt

    • timer expiration, I/O, etc. causes OS to run
  • solution: disable them

    • reenable on unlock
  • x86 instructions:

    • cli — disable interrupts
    • sti — enable interrupts

naive interrupt enable/disable (1)

Lock() {
    disable interrupts;
}
Unlock() {
    enable interrupts;
}
  • problem: user can hang the system:
                Lock(some_lock);
              while (true) {}
    
  • problem: can’t do I/O within lock
                Lock(some_lock);
              read from disk
                  /* waits forever for (disabled) interrupt
                     from disk IO finishing */
    

naive interrupt enable/disable (2)

Lock() {
    disable interrupts;
}
Unlock() {
    enable interrupts;
}
  • problem: nested locks
            Lock(milk_lock);
          if (no milk) {
              Lock(store_lock);
              buy milk
              Unlock(store_lock);
              /* interrupts enabled here?? */
          }
          Unlock(milk_lock);
    

ping-ponging

ping-ponging

  • test-and-set problem: cache block ‘‘ping-pongs’’ between caches

    • each waiting processor reserves block to modify
    • could maybe wait until it determines modification needed — but not typical implementation
  • each transfer of block sends messages on bus

  • … so bus can’t be used for real work

    • like what the processor with the lock is doing

test-and-test-and-set (pseudo-C)

acquire(int *the_lock) {
    do {
        while (ATOMIC-READ(the_lock) == 0) { /* try again */ }
    } while (ATOMIC-TEST-AND-SET(the_lock) == ALREADY_SET);
}

test-and-test-and-set (assembly)

acquire:
    cmp $0, the_lock         // test the lock non-atomically
            // unlike lock xchg --- keeps lock in Shared state!
    jne acquire              // try again (still locked)
    // lock possibly free
    // but another processor might lock
    // before we get a chance to
    // ... so try wtih atomic swap:
    movl $1, %eax             // %eax <- 1
    lock xchg %eax, the_lock  // swap %eax and the_lock
           // sets the_lock to 1
           // sets %eax to prior value of the_lock
    test %eax, %eax           // if the_lock wasn't 0 (someone else got it first):
    jne acquire               //   try again
    ret

less ping-ponging

couldn’t the read-modify-write instruction

  • notice that the value of the lock isn’t changing…
  • and keep it in the shared state
  • maybe — but extra step in ‘‘common’’ case
    (swapping different values)

more room for improvement?

  • can still have a lot of attempts to modify locks after unlocked

  • there other spinlock designs that avoid this

    • ticket locks
    • MCS locks

misc Linux lock stuff

Linux futexes

  • futexfast userspace mutex
  • goal: implement waiting like ‘proper’ mutexes, but…
  • don’t enter kernel mode most of the time
  • challenge: can’t acquire lock to call scheduler from user mode

futex operations

futex(&lock_value, FUTEX_WAIT, expected_value, ...);
  • check if lock_value is expected_value

    • if not — return immediately
    • otherwise, sleep until it futex(…, FUTEX_WAKE is called
futex(&lock_value, FUTEX_WAKE, num_processes);
  • wakeup up to num_processes which called FUTEX_WAIT

mutexes with futexes

int lock_value; // UNLOCKED or LOCKED_NO_WAITERS or LOCKED_WAITERS
Lock() {
retry:
    if (CompareAndSwap(&lock_value, UNLOCKED, LOCKED_NO_WAITERS) == SET) {
        /* acquired lock */
        return;
    } else if (CompareAndSwap(&lock_value, LOCKED_NO_WAITERS, LOCKED_WAITERS) == SET) {
        futex(&lock_value, FUTEX_WAIT, LOCKED_WAITERS, ...);
    }
    goto retry;
}
Unlock() {
    if (CompareAndSwap(&lock_value, LOCKED_NO_WAITERS, UNLOCKED) == SET) {
        return;
    } else {
        lock_value = UNLOCKED;
        futex(&lock_value, FUTEX_WAKE, 1, ...);
    }
}

implementing futex_wait

  • hashtable: address \(\rightarrow\) queue of waiting threads


  • use hashtable to look-up queue

  • lock queue

  • check value hasn’t changed

    • if so abort, releasing lock
  • add thread to queue

  • set thread as WAITING (not runnable)

  • unlock queue

  • call scheduler


  • woken up — queue used to set RUNNABLE

fairer spinlocks

  • so far — everything on spinlocks

    • mutexes, condition variables — built with spinlocks
  • spinlocks are pretty ‘unfair’

    • where fair = get lock if waiting longest
  • last CPU that held spinlock more likely to get it again

    • already has the lock in its cache…
  • but there are many other ways to spinlocks…

ticket spinlocks

unsigned int serving_number;
unsigned int next_number;

Lock() {
    // "take a number"
    unsigned int my_number = atomic_read_and_increment(&next_number);
    // wait until "now serving" that number
    while (atomic_read(&serving_number) != my_number) {
        /* do nothing */
    }
    // MISSING: code to prevent reordering reads/writes
}

Unlock() {
    // serve next number
    serving_number += 1;
    // MISSING: code to prevent reordering reads/writes
}

ticket spinlocks and cache contention

  • still have contention to write next_number

  • … but no retrying writes!

    • should limit ‘ping-ponging’?
  • threads loop performing a read repeatedly while waiting

    • value will be broadcasted to all processors
    • ‘free’ if using a bus
    • not-so-free if another way of connecting CPUs

beyond ticket spinlocks

  • Linux kernel used to use ticket spinlocks

  • now uses variant of MCS spinlocks — locks have linked-list queue!

    • careful use of atomic operations to modify queue
  • still try

  • goal: even less contention

    • unlocking value doesn’t require broadcasting to all CPUs
    • each processor waits on its own cache block

Backup slides

xv6 interrupt disabling (1)

...
acquire(struct spinlock *lk) {
  pushcli(); // disable interrupts to avoid deadlock
  ... /* this part basically just for multicore */
}
release(struct spinlock *lk)
{
  ... /* this part basically just for multicore */
  popcli();
}

xv6 push/popcli

  • pushcli / popcli — need to be in pairs

  • pushcli — disable interrupts if not already

  • popcli — enable interrupts if corresponding pushcli disabled them

    • don’t enable them if they were already disabled

xv6 interrupt disabling: detail (3)

pushcli(void)
{
  int eflags;
  eflags = readeflags();
  cli();
  if (mycpu()->ncli == 0)
    mycpu()->intena = eflags & FL_IF;
  @2mycpu()2@->ncli += 1;
}

popcli(void)
{
  if(readeflags()&FL_IF)
    panic("popcli - interruptible");
  if(--@2mycpu()2@->ncli < 0)
    panic("popcli");
  if(@2mycpu()2@->ncli == 0 && @2mycpu()2@->@3intena3@)
    sti();
}

xv6 spinlock: acquire

void
acquire(struct spinlock *lk)
{
  pushcli(); // disable interrupts to avoid deadlock.
  ...
  // The xchg is atomic.
  while(xchg(&lk->locked, 1) != 0)
    ; 

  // Tell the C compiler and the processor to not move loads or stores
  // past this point, to ensure that the critical section's memory
  // references happen after the lock is acquired.
  __sync_synchronize();
  ...
}

xv6 spinlock: release

void
release(struct spinlock *lk)
  ...
  // Tell the C compiler and the processor to not move loads or stores
  // past this point, to ensure that all the stores in the critical
  // section are visible to other cores before the lock is released.
  // Both the C compiler and the hardware may re-order loads and
  // stores; __sync_synchronize() tells them both not to.
  __sync_synchronize();

  // Release the lock, equivalent to lk->locked = 0.
  // This code can't use a C assignment, since it might
  // not be atomic. A real OS would use C atomics here.
  asm volatile("movl $0, %0" : "+m" (lk->locked) : );

  popcli();
}

xv6 spinlock: debugging stuff

void acquire(struct spinlock *lk) {
  ...
  if(holding(lk))
    panic("acquire")
  ...
  // Record info about lock acquisition for debugging.
  lk->cpu = mycpu();
  getcallerpcs(&lk, lk->pcs);
}
void release(struct spinlock *lk) {
  if(!holding(lk))
    panic("release");

  lk->pcs[0] = 0;
  lk->cpu = 0;
  ...
}