← All Posts

The Theorem That Decides Who Gets Lied To

CAP, or Why Distributed Systems Must Break Promises

Reading time: ~17 minutes


You have two database replicas. A network cable gets cut between them. A write comes in.

Do you accept it — and risk the two replicas disagreeing about reality? Or do you reject it — and tell a perfectly valid client that their perfectly valid request can't be served?

You can't do both. It isn't a design choice. It's a mathematical proof. Eric Brewer stood at a podium at the ACM Symposium on Principles of Distributed Computing in July 2000 and presented it as a conjecture. Two years later, Seth Gilbert and Nancy Lynch at MIT proved it as a theorem. And developers have been getting it wrong ever since.


The Three Letters

Before I can explain why everyone (yours truly once too) misunderstands CAP, I need to define what C, A, and P actually mean. Not the hand-wavy conference-talk version. The formal version, because the hand-wavy version is where all the confusion starts. strap in.

Consistency means every read returns the most recent write. Not eventually. Not probably. If I write balance = 500 to node A, and you immediately read from node B, you get 500. This is linearizability — the system behaves as if there's a single copy of the data, even though there are multiple replicas. It's the strongest possible guarantee. If you've ever worked with a single Postgres instance, you had this for free. The moment you added a replica, you started paying for it.

Availability means every request to a non-failing node gets a response. Not a timeout. Not an error code. A response with actual data. If a node is up, it answers. This is a much stronger guarantee than most people realize — it doesn't mean "the system is mostly up" or "99.99% of requests succeed." It means every request to every non-failing node gets a non-error response.

Partition tolerance means the system continues to function when the network between nodes drops messages. A partition isn't a node dying — it's the network between nodes going silent. Node A is up. Node B is up. They just can't hear each other. Maybe a switch failed. Maybe a cable was cut. Maybe a misconfigured firewall started dropping packets at 3 AM on a Saturday, which is when these things always happen.

The CAP dilemma during a network partition — node A must choose between accepting an inconsistent write or rejecting a valid request


"Pick Two" Is Wrong

This is where every Medium article and job interview goes off the rails.

The standard explanation goes like this: you get three properties — Consistency, Availability, Partition tolerance — pick two. So you can have CP, AP, or CA. Three categories, nice and clean, fits on a slide. Sigh, maybe we should blame slides.

It's wrong. Or rather, it's misleading in a way that makes the whole theorem useless.

Here's the thing Brewer himself later clarified, in a 2012 article titled "CAP Twelve Years Later" that not enough people read: you don't get to opt out of partitions. The network will partition. Not might. Will. If your system runs on more than one machine connected by a network, packets will be lost, switches will fail, cables will be unplugged by a contractor who was told the cable was yellow and this one is kind of yellow.

The CA option — consistent and available, but not partition-tolerant — is a single-machine database. It's Postgres on one box. The moment you go distributed, you're partition-tolerant by necessity, because "not partition-tolerant" means "dies when the network hiccups," and that's not a distributed system, that's a single point of failure with extra steps.

The real choice is: during a partition, do you sacrifice consistency or availability?

When the network is healthy, you can have all three. C, A, and P, coexisting peacefully. The theorem only forces a choice when things break. And things break.


CP: Refuse to Lie

A CP system, during a partition, chooses consistency over availability. If a node can't confirm that its data is up-to-date — if it can't reach a quorum, can't verify that it has the latest write — it refuses to answer. Your request gets rejected. The system says "I'd rather give you nothing than give you stale data."

ZooKeeper does this. It's a coordination service, not a database, and the distinction matters. ZooKeeper uses a protocol called ZAB (ZooKeeper Atomic Broadcast) — a leader-based atomic broadcast protocol designed specifically for primary-backup replication. It predates Raft by several years and solves a narrower problem than general-purpose consensus, but the leader-election and log-replication shape rhymes with Paxos and Raft if you squint. If a ZooKeeper node can't reach the leader, it stops serving reads. Your application gets an error. The data it has might be fine, but "might" isn't good enough when you're coordinating distributed locks or leader elections.

etcd does the same thing, using Raft for consensus. If your Kubernetes cluster partitions and a minority of etcd nodes end up on one side, that side becomes read-only — or fully unavailable, depending on the configuration. Every Kubernetes outage caused by etcd losing quorum is CAP in action.

HBase in strict consistency mode will refuse reads if it can't reach the region server that owns the data. You get an error. The row exists. The data is there. But the system won't give it to you because it can't prove it's current.

The cost is real. CP systems have lower availability by design. During a partition, some fraction of your users get errors instead of data. But the data they do get is always right.


AP: Serve Now, Reconcile Later

An AP system makes the opposite bet. During a partition, every node keeps serving requests. Writes go through. Reads return data. The problem is that the data might be wrong — or rather, it might be an older version, or a version that's about to be overwritten by a conflicting write on the other side of the partition.

Cassandra is the textbook example. Every node can accept reads and writes independently. During a partition, both sides keep working. When the partition heals, Cassandra reconciles the divergent data. The default conflict resolution is last-write-wins — whichever write has the later timestamp survives. This is fine for a shopping cart. It's terrifying for a bank account.

DynamoDB operates similarly — it's designed for availability above all else, which is what you want from the database backing Amazon's checkout flow. An unavailable checkout page is a lost sale. A slightly stale product count can be fixed later.

DNS is the oldest and most successful AP system most developers interact with daily. When you update a DNS record, it doesn't propagate instantly. Different resolvers have different cached values with different TTLs. For some window of time — minutes, hours, sometimes days if someone misconfigured their TTL — different clients get different answers for the same hostname. If you read post 07 on DNS, you saw this in action: the resolver, the cache, the authoritative server can all disagree, and the system keeps working because it prioritizes availability over global consistency.

CP vs AP behavior during a partition — ZooKeeper refuses the request while Cassandra serves stale data and reconciles later


When There's No Partition: PACELC

Here's a question CAP doesn't answer: what do you do when the network is fine?

CAP is a theorem about failure. It tells you what breaks during a partition. But most of the time, the network is healthy. Most of the time, all your nodes can talk to each other. And even then, you still have a tradeoff. Just a different one.

Daniel Abadi proposed the PACELC framework, published in 2012, to fill this gap. The name is an if/else: Partition? Choose A or C. Else, choose Latency or Consistency.

Developers, please for the love of all that's holy STOP. NAMING. THINGS. BADLY! This should clearly have been named WAGER - Wait or Acknowledge, Guarantee or Eventual Replication.

Even when the network is healthy, strong consistency costs round trips. If I write to node A and you immediately read from node B, there are two ways to make sure you get my write:

  1. Synchronous replication. Node A doesn't acknowledge my write until it's been replicated to B (and possibly C). Every write waits for network round trips to complete. Consistent, but slow.

  2. Asynchronous replication. Node A acknowledges immediately and replicates in the background. Fast, but you might read stale data from B for a few milliseconds.

That's the ELC tradeoff. No partition, no failure, just physics. Light takes time to travel through fiber. Consensus protocols take round trips. If your replicas are in different data centers — us-east-1 and eu-west-1, say — a synchronous write has to cross the Atlantic and back before your application gets an acknowledgment. That's 80-120ms of added latency on every write. For a real-time bidding system or a game server, that's not acceptable. For a banking ledger, it's mandatory.

DynamoDB is PA/EL — during a partition, it chooses availability; when things are normal, it chooses latency. Writes are fast, reads are eventually consistent by default. You can request strongly consistent reads, but they consume twice the read capacity units — the real cost of asking DynamoDB to stop lying to you, measured in your AWS bill.

Spanner — Google's globally distributed database — is PC/EC. During a partition, consistency wins. When things are normal, consistency still wins. Spanner uses atomic clocks and GPS receivers (the TrueTime API) to keep its nodes synchronized closely enough that it can offer external consistency — Google's term for what most textbooks call strict serializability: linearizable for single operations, serializable for multi-key transactions, with a real-time ordering guarantee across the whole system. For single reads, this is effectively equivalent to linearizability. For multi-key transactions, it's a meaningful step up. And it comes with bounded latency — not zero, but deterministic. Google built custom hardware to make the physics work. Most of us don't have that option. I cover Spanner in detail in post 34.


Eventual Consistency: The Fine Print

When an AP system says "we'll reconcile later," the devil is in "later" and "reconcile." Eventual consistency means that if no new writes arrive, all replicas will eventually converge to the same value. That's the formal definition. It says nothing about how long "eventually" takes, and nothing about what happens to reads during the convergence window.

Three strategies for handling conflicts when replicas disagree:

Last-write-wins (LWW). The write with the most recent timestamp survives. Dead simple. Also subtly broken — clock skew between machines means "most recent" is a lie. Node A's clock runs 50ms fast. Node B's clock is accurate. A writes at T=1000 (real time 950), B writes at T=980 (real time 980). B's write was actually later, but A's timestamp is higher. A wins. B's data is silently discarded. LWW is the conflict resolution strategy of "I don't want to think about this," and it shows.

Vector clocks. Instead of a single timestamp, each node maintains a vector of logical clocks — one per node. When replicas diverge, vector clocks can detect that two writes are concurrent (neither causally preceded the other) rather than one being "later." The system can then surface the conflict to the application or the user: "These two versions both exist. You decide." Amazon's original Dynamo paper (2007) used this approach. It's more correct than LWW. It's also more complex, and the vectors grow with the number of nodes.

CRDTs (Conflict-free Replicated Data Types). Data structures mathematically designed so that concurrent modifications always converge to the same result, regardless of the order they're applied. A G-Counter (grow-only counter), for example, gives each node its own counter. The "value" is the sum of all node counters. Two nodes incrementing concurrently is fine — the sum is always correct. No conflict, no resolution needed.

CRDTs are elegant but constrained. Not every data structure has a CRDT equivalent. You can build a CRDT shopping cart (add-wins set). You cannot build a CRDT bank account (because debiting below zero requires global coordination, which is exactly what CRDTs avoid).

Conflict resolution strategies — last-write-wins, vector clocks, and CRDTs compared


The Spectrum, Not the Binary

I've been talking about consistency as if it's a toggle — you have it or you don't. It's not. It's a spectrum, and the level you pick has direct consequences for latency, availability, and complexity.

Linearizability is the strongest guarantee. The system behaves as if all operations happen in a single, total order, and each operation appears to take effect at some point between its invocation and response. This is what CAP's "C" means. It's what you get from a single Postgres node, and it's what Spanner pays for with atomic clocks.

Sequential consistency is slightly weaker. Operations from each individual client appear in order, but there's no guarantee about the real-time ordering between different clients. If I write A then B, everyone sees A before B. But if I write A and you write X at the same time, some replicas might see A then X and others see X then A. Both orderings are valid, as long as every replica picks the same one eventually. Practical difference: linearizability requires a global clock or consensus. Sequential consistency doesn't.

Causal consistency is weaker still. The idea comes from Leslie Lamport's 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System" — one of the most cited papers in distributed systems and the source of the "happens-before" relation. The insight: in a distributed system, you can't trust wall clocks, but you can trust causality. If event B sent a message that triggered event A, then B unambiguously happened before A, regardless of what any clock says.

If you've taken a discrete math course, you've already met this idea — Lamport's "happens-before" is a strict partial order in the precise mathematical sense. It's irreflexive (an event doesn't happen before itself), antisymmetric (if A happens-before B, then B can't happen-before A), and transitive (A → B → C implies A → C). What makes it partial rather than total is that some pairs of events are incomparable: neither A → B nor B → A holds. In a totally ordered set, every pair has a relationship. In a partial order, some pairs just don't. Lamport called those incomparable events concurrent — not because they happened at the same wall-clock time, but because in the mathematical structure of the system, there's no causal chain connecting them in either direction.

This is the whole difference between linearizability and causal consistency. Linearizability demands a total order — every pair of operations must be comparable, you must be able to say which one came first. Causal consistency only demands a partial order — respect the pairs that are causally related, and the rest can be observed in any order without breaking the contract. Causal consistency uses that relation as the ordering guarantee: if operation A causally depends on operation B, everyone sees B before A. Operations that are causally independent can be seen in any order. This is often good enough. If you post a comment in reply to a photo, everyone should see the photo before the comment. But two unrelated posts can appear in different orders on different clients and nobody notices.

Eventual consistency is the floor. Given enough time with no new writes, all replicas converge. No ordering guarantees during convergence. This is what you get from DNS, from Cassandra with consistency=ONE, from every browser's local cache. It's the cheapest option and, for many workloads, it's genuinely sufficient.

The gap between "I need linearizability" and "eventual consistency is fine" is where architectural decisions live. I've watched teams spend months building linearizable systems for data that could have been eventually consistent. I've also watched teams lose money — actual, quantifiable dollars — because they used eventual consistency for data that needed linearizability. The question isn't "which one is better." The question is "what's the cost of being wrong, and for how long?"


Jepsen: Trust But Verify

Kyle Kingsbury started the Jepsen project because he got tired of databases claiming consistency guarantees they couldn't keep.

The methodology is ruthless: take a distributed database, set up a cluster, inject network partitions and clock skew and node failures, run concurrent operations, and then check whether the database's actual behavior matched its advertised consistency level. Did it really provide linearizability? Did it really handle partitions the way the docs said?

The results are legendary. And depressing.

MongoDB, in versions prior to 3.4, could lose acknowledged writes during a replica set election. The write returned success. The data vanished. RethinkDB had stale reads during partitions when it claimed not to. Aerospike, VoltDB, Galera Cluster — the list of databases that failed their own consistency claims is long. Kingsbury published the results publicly, with detailed write-ups and reproducible test suites. Some vendors fixed their bugs. Some changed their documentation. Some got mad.

The Jepsen reports are the most useful resource in distributed systems, and I'm not exaggerating. Before using any database in a system where consistency matters, check whether Jepsen has tested it. If it has, read the report. If it hasn't, ask yourself whether you trust the vendor's consistency claims more than you trust a formal proof-of-incorrectness framework. You probably shouldn't.


Why Your Shopping Cart and Your Bank Account Work Differently

Your shopping cart is AP. You add an item on your phone and add a different item on your laptop. Both writes succeed. When the system reconciles, your cart has both items. If there's a conflict — you added the same item twice on different devices — the resolution is simple: keep both, you probably wanted two. The cost of inconsistency is low (a duplicate item in a cart is annoying, not catastrophic), and the cost of unavailability is extremely high. Amazon doesn't publish the exact per-minute figure, but every minute the checkout flow is down is a minute of real revenue walking out the door — and the Dynamo paper is explicit that availability was the non-negotiable constraint that drove the entire design.

Your bank account is CP. If two ATMs process withdrawals simultaneously and the network between them is down, one of them must refuse. The cost of inconsistency is high (you withdraw more than your balance; you go overdrawn, incur fees, takes time to resolve) and the cost of unavailability is low (you walk to a different ATM or try again in five minutes; annoying, but nobody loses money).

This is the whole point. CAP isn't an academic curiosity. It's the reason your bank's app occasionally shows "service unavailable" while your social media feed never does. Different data, different costs, different choices.

DNS, as we saw in post 07, chose AP decades ago and it was the right call. A slightly stale DNS record is vastly preferable to DNS being unavailable — the entire internet depends on it. The TTL-based propagation model is eventual consistency with a time-bounded convergence window, and it's worked since 1983.

TCP, as we explored in post 08, solves a related but different problem: reliable delivery over an unreliable network. It doesn't have to choose between C and A because it's a point-to-point protocol, not a replicated system. But the same underlying reality — networks are unreliable, messages get lost, you have to decide what to do about it — drives both TCP's design and CAP's constraints.

Real-world systems positioned on the CAP consistency-availability spectrum with PACELC classifications


The Real Takeaway

CAP doesn't say distributed systems are broken. It says distributed systems must make promises they can't always keep, and the grown-up decision is choosing which promise to break when things go wrong.

Every time you hear someone say a system is "CP" or "AP," ask: during what kind of partition? At what consistency level? With what reconciliation strategy? The two-letter label is a starting point, not an answer. The real engineering is in the 47 configuration options between those two letters.

Brewer framed it as a conjecture. Gilbert and Lynch proved it as a theorem. Kingsbury tests it as a practice. And the rest of us live with it every time a network cable gets cut, a switch reboots, or someone plugs a vacuum cleaner into the wrong power strip.


Further Reading


I'm writing a book about what makes developers irreplaceable in the age of AI. Join the early access list →


Naz Quadri once spent a weekend debugging a "network issue" that turned out to be a vacuum cleaner on a shared power strip. He blogs at nazquadri.dev. Rabbit holes all the way down 🐇🕳️.