The Lock You Didn't Know You Were Taking

mutex.lock() Put Your Thread to Sleep. Here's What Happened Between Consciousness and Oblivion.

Reading time: ~15 minutes

You wrote mutex.lock(). The thread that was running your code is now gone — parked in a kernel wait queue, not consuming CPU, invisible to top. What happened between your function call and unconsciousness?

I spent a week in 2019 staring at a deadlock in a Rust service that only reproduced under load, on Tuesdays, when the monitoring goroutine in the sidecar happened to hold a file lock at the exact wrong moment. The fix was two lines. The understanding took a week. That week changed how I think about every synchronization primitive I reach for.

Here's the machinery underneath all of them.

The Foundation: Compare-and-Swap

Before we touch mutexes or wait queues, we need the CPU instruction that makes all of this possible.

Compare-and-swap (CAS) is a single atomic instruction. On x86, it's CMPXCHG. On ARM, it's a pair: LDXR/STXR (load-exclusive, store-exclusive). The semantics are identical: read a memory location, compare it to an expected value, and if it matches, write a new value — all in one indivisible operation that no other core can interrupt.

// This is what __sync_bool_compare_and_swap compiles to:
// LOCK CMPXCHG [mem], new_val
// The LOCK prefix locks the memory bus for this operation.
bool acquired = __sync_bool_compare_and_swap(&lock_word, 0, 1);

use std::sync::atomic::{AtomicU32, Ordering};

let lock = AtomicU32::new(0);
// Attempts to atomically set 0 → 1.
// Returns Ok(0) on success, Err(current) on failure.
let result = lock.compare_exchange(0, 1, Ordering::Acquire, Ordering::Relaxed);

import "sync/atomic"

var lock int32 = 0
// Atomically: if lock == 0, set lock = 1, return true.
swapped := atomic.CompareAndSwapInt32(&lock, 0, 1)

That LOCK prefix on x86 is doing serious work. It coordinates with the cache coherency protocol (MESI) to ensure that only one core can modify that cache line at a time. Before Intel's Pentium Pro (1995), the LOCK prefix literally locked the entire memory bus 😱 — every other core stalled. Modern CPUs are smarter: they lock only the relevant cache line. But it's still not free. Under heavy contention, cores bounce that cache line back and forth in a pattern called cache line ping-pong, and throughput collapses.

CAS is the atom. Everything else — mutexes, semaphores, channels, lock-free queues — is molecules built from it.

CAS operation — Core 0 succeeds, Core 1 fails and retries

The Futex: Linux's Clever Trick

Here's the problem with mutexes. If you implemented one purely in the kernel — every lock and unlock is a syscall — you'd pay syscall overhead on every acquisition, even when there's no contention. And most mutex acquisitions are uncontended. Your thread grabs the lock, does its work, releases it, and nobody else was waiting.

The futex (Fast Userspace muTeX) is the solution Linux introduced in 2002. The insight: keep the lock word in userspace memory. The fast path — no contention — never enters the kernel at all. You do a CAS in userspace, it succeeds, you're done. One atomic instruction, no syscall, no context switch.

The kernel only gets involved when there's actual contention.

Here's the two-path design:

Fast path (no contention): Your thread does CAS(&lock_word, 0, 1). It succeeds. You hold the lock. Cost: one atomic instruction, maybe 20 nanoseconds.

Slow path (contention): Your thread does CAS(&lock_word, 0, 1). It fails — someone else holds the lock. Now you call futex(FUTEX_WAIT, &lock_word, 1). This is a syscall. The kernel checks that the lock word is still 1 (to avoid a race where it was released between your CAS and your syscall), then parks your thread in a wait queue keyed by the address of the lock word. Your thread is now suspended. The scheduler removes it from the run queue. It consumes zero CPU.

When the lock holder releases: it sets the lock word back to 0, then calls futex(FUTEX_WAKE, &lock_word, 1) to wake one waiter. The kernel pulls a thread off the wait queue and marks it runnable.

Futex fast path vs slow path — userspace CAS vs kernel wait queue

The elegance here is real. In the common case — no contention — a mutex acquisition is faster than a function call in many languages. You're doing one CAS and a branch. The kernel doesn't know and doesn't care.

I've profiled this on my own machines — take the exact numbers with a grain of salt, but the ratio holds everywhere I've measured it. On a 2023 AMD Zen 4 box, an uncontended pthread_mutex_lock took around 18ns in my rough measurements. A contended one that actually sleeps and wakes cost 3-10 microseconds — roughly three orders of magnitude more. That ratio is why the futex design matters so much.

Mutexes: The Full Picture

What pthread_mutex_lock actually does in glibc (as of 2.38):

Try CAS on the lock word: 0 → 1. If it works, you're done.
If the lock word is already 1, set it to 2 (meaning "locked with waiters") and enter a short adaptive spin. Glibc spins for a tunable number of iterations — the exact default is heuristic and has shifted across glibc versions, but it's in the ballpark of 100 on a modern multicore box — doing pause instructions to be polite to the core's pipeline.
If spinning doesn't acquire the lock, call futex(FUTEX_WAIT, &lock_word, 2). The thread sleeps.
When woken, loop back and try CAS again. You might not get the lock — another thread might have grabbed it first. This is called a spurious wakeup and it's why lock acquisition always happens in a loop.

On unlock: set lock word to 0. If the old value was 2 (meaning someone was waiting), call futex(FUTEX_WAKE, 1).

The "adaptive spin" in step 2 is a pragmatic compromise. If the lock holder is running on another core right now, it might release the lock in a few hundred nanoseconds. Sleeping and waking would cost microseconds. A brief spin saves you the context switch. But if you're on a single-core machine, spinning is pure waste — the lock holder can't run until you stop. Glibc checks for this.

Spinlocks: When Sleeping Is Too Expensive

A spinlock is a mutex without the sleeping part. If the CAS fails, you loop and try again. And again. And again. Burning CPU the entire time.

// Simplified spinlock — real implementations use backoff
while (!__sync_bool_compare_and_swap(&lock, 0, 1)) {
    __builtin_ia32_pause();  // saves power, avoids memory order violations
}
// critical section
__sync_lock_release(&lock);

// Don't actually write this — use std::sync::Mutex or parking_lot
while lock.compare_exchange_weak(0, 1, Ordering::Acquire, Ordering::Relaxed).is_err() {
    std::hint::spin_loop();  // emits PAUSE on x86
}

This sounds terrible. Why would anyone want this?

The Linux kernel uses spinlocks everywhere. The reason: the kernel can't always sleep. Interrupt handlers, for instance, must never block — there's no thread to suspend, you're running in interrupt context. If you need to protect data that an interrupt handler accesses, you spinlock. The critical section had better be short — microseconds, not milliseconds.

In userspace, spinlocks are almost never the right choice. You're burning a CPU core doing nothing useful, and unlike the kernel, you don't control the scheduler. If your thread gets preempted while holding a spinlock, every other thread waiting for that lock spins for the entire rest of your timeslice. Brutal.

The parking_lot crate in Rust (and Go's sync.Mutex internally) uses a hybrid approach: spin briefly, then fall back to sleeping. Best of both worlds, most of the time.

Read-Write Locks: The Starvation Trap

The pitch is simple: readers don't modify data, so multiple readers can hold the lock simultaneously. Only writers need exclusive access. For read-heavy workloads, this should be a massive win.

And it is. Until it isn't.

The classic problem: writer starvation. If readers keep arriving, a waiting writer never gets in. The read lock is always held by someone. The writer waits forever.

Most implementations solve this by blocking new readers once a writer is waiting. POSIX pthread_rwlock has the PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP attribute (beautiful name, isn't it). In practice, I've found that read-write locks buy you less than you'd expect. The bookkeeping for tracking multiple readers adds overhead, and unless your read-to-write ratio is 10:1 or higher and your critical sections are non-trivial, a plain mutex is faster.

I benchmarked this once for a specific caching layer on a specific box and my RwLock version was slower than Mutex until the read ratio climbed into the high 90s. The cache line bouncing from updating the reader count ate the parallelism gains. Don't take the number as a rule — your crossover point will depend on the critical section size, the core count, and the memory topology — but the shape of the result is real: RwLock is not automatically faster than Mutex just because your workload is read-heavy.

Condition Variables: The Spurious Wakeup Problem

A mutex lets you wait for a lock. A condition variable lets you wait for an arbitrary condition while holding a lock.

The pattern is always the same:

with condition:
    while not ready:        # MUST be a while loop, not if
        condition.wait()    # releases lock, sleeps, re-acquires on wake
    # condition is true AND we hold the lock

let mut guard = mutex.lock().unwrap();
while !*guard {
    guard = cond.wait(guard).unwrap(); // releases + re-acquires
}
// guard is true AND we hold the lock

cond.L.Lock()
for !ready {
    cond.Wait() // releases lock, sleeps, re-acquires on wake
}
// ready is true AND we hold the lock
cond.L.Unlock()

Why the while loop instead of if? Spurious wakeups. The POSIX spec explicitly permits them — your thread can be woken even though nobody called signal() or broadcast(). On Linux, this happens because the futex implementation can wake threads for internal reasons (signal delivery, for instance — see post 05 on signals). Relying on if instead of while is a bug that only manifests under load, in production, at 3am.

Under the hood, pthread_cond_wait atomically releases the mutex and calls futex(FUTEX_WAIT) on an internal sequence counter. pthread_cond_signal increments the counter and calls futex(FUTEX_WAKE, 1). The atomicity of the release-and-sleep is critical — without it, you'd have a window where the condition could change after you release the lock but before you sleep, and you'd miss the notification. This is called the lost wakeup problem, and it's the reason condition variables exist as a distinct primitive rather than being a mutex plus a flag.

Semaphores, Barriers, and the Rest of the Family

Counting semaphores are a generalization of mutexes. Instead of 0-or-1, the counter goes from 0 to N. Each wait() decrements it; each post() increments it. When the counter hits 0, the next wait() blocks. They're useful for resource pools — "I have 5 database connections, hand them out and block when they're gone."

I'm glossing over the dining philosophers. You've seen it. Five forks, five philosophers, circular wait, deadlock. It's in every OS textbook. If you haven't seen it, Tanenbaum's explanation is the canonical one.

Barriers are the simplest primitive conceptually: N threads arrive, they all block, and when the Nth thread arrives, everyone wakes up and proceeds. Think of it as a synchronization checkpoint. GPU programming uses this constantly — you need all shader invocations to finish phase 1 before any can start phase 2.

Memory Ordering: The Part Everyone Gets Wrong

Here's where it gets genuinely hard. Mutexes aren't only about preventing two threads from entering the critical section simultaneously. They're about visibility.

Modern CPUs execute instructions out of order, that's right mate, that super-simple-looking list of imperative commands could be executed by the host in any order it likes. Stores can be reordered, buffered in store buffers, made visible to other cores at different times. The speculation and pipeline machinery from post 11 means that what you wrote in source code is not what the CPU executes. Without synchronization, thread A's writes might be invisible to thread B even after thread A has "finished."

// Thread A                      // Thread B
data = 42;                       while (!ready) { /* spin */ }
ready = true;                    printf("%d\n", data);  // might print 0!

On x86, this specific example usually works because x86 has a relatively strong memory model (Total Store Order). On ARM or RISC-V, it's broken. The store to data can be reordered after the store to ready from thread B's perspective.

A mutex acquire is a memory barrier (or fence). It guarantees that all writes performed by the previous lock holder are visible to the new lock holder. This is what Ordering::Acquire and Ordering::Release mean in Rust's atomic API — they're the load and store halves of a memory barrier. Acquire says "all reads after this point see the latest data." Release says "all writes before this point are visible to the next acquirer."

volatile in C and C++ does not provide memory ordering guarantees. It prevents the compiler from optimizing away loads and stores, but it says nothing about hardware reordering. If you've been using volatile for thread synchronization, you have bugs. Java's volatile is different — it actually provides acquire/release semantics — but C/C++ volatile is for memory-mapped I/O, not threads.

Memory ordering — without barriers vs with acquire/release

Lock-Free Data Structures: Harder Than You Think

If CAS is the foundation, can we build data structures without locks at all? Yes. Should you? Almost certainly not.

The simplest lock-free structure is a stack. Push: read the current head, create a new node pointing to it, CAS the head pointer. If the CAS fails, someone else pushed first — retry.

The problem is the ABA problem. Thread A reads head = node X. Thread A gets preempted. Thread B pops X, pops Y, pushes X back (maybe from a free list). Thread A wakes up, does CAS — head is X, just like it expected. CAS succeeds. But the stack has changed underneath. A is now pointing to a corrupted structure.

Solutions exist (tagged pointers, hazard pointers, epoch-based reclamation), but they're all complex enough that getting them right is a research-paper-level exercise. The crossbeam crate in Rust does this correctly. Writing your own is almost always a mistake.

I tried once, a few years back, for a lock-free work-stealing queue. Long story, but it didn't work and it made me appreciate crossbeam all the more.

Priority Inversion: A Bug on Mars

In 1997, the Mars Pathfinder lander started resetting itself repeatedly on the Martian surface. The engineering team at JPL eventually traced it to priority inversion.

The low-priority meteorological task (ASI/MET) was mid-operation on a mutex inside VxWorks's IPC mechanism — actually in the middle of releasing it — when several medium-priority tasks preempted it. The high-priority bus distribution task needed that same mutex and blocked. The medium-priority tasks didn't need the mutex at all, but they outranked the meteorological task, so it couldn't run to finish releasing the lock. The high-priority task starved. The bus scheduler detected the missed deadline and reset the system.

The fix was priority inheritance: when a high-priority task blocks on a mutex held by a low-priority task, the low-priority task temporarily inherits the higher priority. Linux's PTHREAD_PRIO_INHERIT mutex attribute does this. The rt_mutex in the Linux kernel implements it. It's the kind of mechanism you never think about until a rover 170 million kilometers away starts rebooting.

Deadlock: Four Conditions, One Reality

Deadlock requires all four of these simultaneously:

Mutual exclusion — the resource can't be shared
Hold and wait — you hold one lock while waiting for another
No preemption — locks can't be forcibly taken away
Circular wait — A waits for B, B waits for A

Break any one condition and deadlock becomes impossible. In practice, the most common prevention strategy is lock ordering — always acquire locks in a consistent global order. If every thread acquires lock A before lock B, circular wait can't happen. Simple to state, miserable to enforce across a large codebase.

Deadlock cycle and resolution via lock ordering

Databases take a different approach: timeout-based detection. PostgreSQL lets transactions wait for locks, but if a cycle is detected (it runs a graph cycle detection algorithm on the wait-for graph), it kills one transaction and lets the others proceed. This is why you sometimes see "deadlock detected" errors in Postgres — it's not a bug, it's the database saving you from a bug.

The most insidious deadlocks I've seen aren't the obvious A-then-B-then-A cycles. They're the ones involving implicit locks: a database connection pool lock, a logging mutex, a DNS resolution lock inside an HTTP client. You didn't know you were taking three locks. The code doesn't show three locks. But under the covers, in libraries you didn't write, three locks exist and one day they'll cycle.

When to Use What

Quick framing note: this isn't a "which lock should I use" list. The question is what shape of solution does the problem actually need? Half the items below aren't locks at all. Reaching for Mutex by reflex is the same mistake as reaching for a thread when you need a channel, or reaching for any primitive at all when the real fix is to partition the data so nothing is shared. Pick the shape first, then the primitive.

I'll be opinionated here because I've seen too many codebases reach for the wrong shape:

Mutex: your default. Use it unless you have a specific reason not to. On modern Linux, an uncontended mutex is 18ns. You are not going to beat that with cleverness.

RwLock: only when your read-to-write ratio is very high (>10:1) and critical sections are non-trivial. Benchmark before committing — I've been burned.

Spinlock: almost never in userspace. If you're in the kernel, or writing a hypervisor, or you have a critical section measured in nanoseconds on a dedicated core — fine. Otherwise, no.

Channels (Go channels, Rust mpsc, Python queue.Queue): when you can restructure the problem as message passing instead of shared state. Almost always what I try first. Harder to deadlock, easier to reason about. Making this your default forces you to think about data ownership and flow before reaching for locks. When it's not your default, expect your code to be littered with Arc<Mutex<T>>.

Lock-free: when you've profiled and mutex contention is the bottleneck, you've read the papers, and you're using a battle-tested library. Not when you think it'll be faster. It usually isn't.

Nothing: the best synchronization is no synchronization. If you can partition your data so each thread owns its piece, do that.

Decision flowchart — choosing the right synchronization primitive