← All Posts

The Algorithm That Won by Being Obvious

Raft Consensus: How Distributed Systems Actually Agree

Reading time: ~15 minutes


In 2014, Diego Ongaro and John Ousterhout published a paper with an unusual goal: design a consensus algorithm that people could actually understand. The existing standard was Paxos, Leslie Lamport's legendary protocol developed in the late 1980s and published in 1998. Paxos worked. Paxos was proven correct. Paxos was also nearly impossible to implement from the paper alone — Lamport's original used a fictional Greek parliament metaphor that confused more readers than it enlightened. Google's Chubby team, who built a production Paxos implementation, wrote in their OSDI 2006 paper that there were "significant gaps between the description of the Paxos algorithm and the needs of a real-world system" — a polite way of saying the algorithm worked on paper and broke in production. That observation is from engineers who build planet-scale systems for a living.

Ongaro's PhD thesis opens with a user study. Students who learned Raft scored higher than students who learned Paxos on every comprehension question. Not marginally higher. Substantially higher. And when asked which algorithm they felt more confident implementing — after studying both — 33 of 43 chose Raft.

Raft wasn't designed to be faster than Paxos. It wasn't designed to handle more failure modes. It was designed to be understandable. And it succeeded so completely that it's now the consensus algorithm running inside most of the distributed infrastructure you depend on. etcd. CockroachDB. TiKV. Consul. Vault. Nomad. If your production system has a cluster, there's a good chance Raft is the reason it stays consistent.

I spent a long weekend implementing a toy Raft in Rust after reading the paper, and the thing that struck me was how much I'd overcomplicated consensus in my head. Years of hearing "distributed consensus is hard" had built up this fog. Raft cuts through it by decomposing the problem into three sub-problems that are each tractable on their own: leader election, log replication, and safety. That decomposition is the whole insight.


The Problem: Getting N Machines to Agree

Here's the setup. You have N servers — typically 3, 5, or 7. They need to agree on the same sequence of operations, in the same order, even when some of them crash. This is the replicated state machine problem: each server maintains a log of commands, and if all servers execute the same commands in the same order, they'll arrive at the same state. The log is the source of truth. Consensus is the protocol that keeps the logs identical.

Why not use a database? Because the database itself needs consensus internally. etcd is a key-value store that uses Raft to replicate its data. CockroachDB uses Raft to replicate each range of its keyspace. TiKV, the storage layer under TiDB, uses Raft. The database doesn't solve the problem — it contains the problem.

Replicated state machine architecture — same log, same order, same state

And why not a single server? Because it dies. One server is a single point of failure. Three servers give you fault tolerance: the system continues operating as long as a majority (2 of 3, 3 of 5) are alive. The "as long as a majority" part is load-bearing. It's what makes the math work. A 5-node cluster survives 2 failures. A 3-node cluster survives 1. You always need (N/2) + 1 nodes alive, which is why clusters are odd-numbered — a 4-node cluster and a 3-node cluster both tolerate exactly 1 failure, so the 4th node is wasted money.


Leader Election: Where the Understanding Clicks

Raft leader election — timer expiry, vote requests, majority, heartbeats

Every Raft node is in one of three states: follower, candidate, or leader. On startup, everyone is a follower. Followers are passive — they respond to requests from the leader but don't initiate anything.

The leader sends periodic heartbeats. As long as a follower receives heartbeats, it stays quiet. But each follower has an election timeout — a randomised timer, typically between 150ms and 300ms. If the timeout fires without receiving a heartbeat, the follower assumes the leader is dead and starts an election.

That randomisation is everything. I cannot stress this enough.

Here's the election, step by step:

  1. The follower increments its current term number. Terms are Raft's logical clock — monotonically increasing integers that partition time into periods of leadership. Term 1, term 2, term 3.
  2. It transitions to the candidate state.
  3. It votes for itself.
  4. It sends RequestVote RPCs to every other node in parallel.

Each node votes for at most one candidate per term. First come, first served. If a candidate receives votes from a majority of the cluster, it becomes the leader for that term and immediately starts sending heartbeats to establish authority.

Leader election timeline — S1 wins the election in Term 5

The randomised timeout is what prevents split votes. If two nodes timeout simultaneously and both become candidates in the same term, they might split the vote — S1 votes for itself and gets S2's vote (2 votes), while S4 votes for itself and gets S5's vote (2 votes). S3 hasn't voted yet, but neither candidate has a majority of 3. Neither becomes leader.

When this happens, both candidates time out again (with new random intervals), increment to the next term, and try again. Because the timeouts are random, the probability that they collide again is low — and it drops fast across rounds, because each round is an independent random draw. In practice, elections resolve in one round almost every time. The Raft paper (Section 9.3) reports election completion times in the tens of milliseconds with timeouts in the 150-300ms range; the randomisation works extremely well with even a small spread.

Compare this to the nightmare of simultaneous candidacy in Paxos, which requires a complex system of proposal numbers and promises and prepares. Raft says: randomise the timeout, let the fastest node win, move on.

When the leader dies? Same thing:

Raft leader failure and re-election — crash, timeout, automatic recovery

Followers stop receiving heartbeats, one of them times out, starts an election, wins, becomes the new leader. The cluster is unavailable during the election — typically a few hundred milliseconds. For most systems, that's imperceptible. I once watched an etcd cluster in a test environment kill and restart a leader node. The clients saw a blip of maybe 200ms. By the time I switched terminal tabs, the new leader was already serving reads.

A critical detail: any node that receives an RPC with a term number higher than its own immediately steps down to follower and updates its term. This means stale leaders can't persist. If a leader gets partitioned from the cluster, the remaining nodes elect a new leader at a higher term. When the old leader reconnects, it sees the higher term and steps down. No split-brain. No two leaders serving writes in the same term.


Log Replication: The Steady State

Raft log replication — client write, append, majority ACK, commit

Once a leader is elected, it handles all client requests. This is the normal operation of the cluster — the part that's happening 99.9% of the time.

A client sends a command to the leader. The leader appends it to its local log as a new entry, tagged with the current term number and a log index. Then it sends AppendEntries RPCs to every follower. Each follower appends the entry to its own log and responds.

When the leader has received acknowledgment from a majority of the cluster (including itself), the entry is committed. The leader applies the committed entry to its state machine, returns the result to the client, and notifies followers of the new commit index in subsequent heartbeats. Followers apply committed entries to their own state machines.

Log replication — majority achieved, entry committed

The majority rule is what makes this durable. Once a majority has the entry, no future leader can be elected without having that entry (I'll explain why in a moment). Even if the current leader crashes right after committing, the entry survives.

What about followers that fall behind? The leader tracks the next log index to send to each follower. If a follower's log diverges — maybe it was a leader in a previous term and has some uncommitted entries the current leader doesn't — the leader walks backward through the follower's log until it finds a matching entry, then overwrites everything after it. The leader's log is authoritative. Full stop.

Heartbeats and AppendEntries RPCs are the same mechanism. An empty AppendEntries is a heartbeat. A non-empty one carries log entries. The leader sends them at a regular interval (the heartbeat interval, much shorter than the election timeout). This is a nice simplification — one RPC type handles both liveness and replication.


Safety: Why Stale Candidates Can't Win

Here's the subtlety that makes Raft correct, not merely functional.

When a candidate requests a vote, it includes the term and index of the last entry in its log. A voter compares this to its own log. If the voter's log is more up-to-date — meaning it has a higher term in its last entry, or the same term but more entries — it refuses the vote.

This is the election restriction, and it guarantees that any elected leader has all committed entries. Think about it: a committed entry exists on a majority of servers. A candidate needs votes from a majority of servers. Those two majorities must overlap in at least one server. That overlapping server has the committed entry. And because of the election restriction, it won't vote for any candidate whose log doesn't contain that entry.

Committed entries can never be lost. That's not a feature — it's a theorem. The Raft paper proves it formally, and this election restriction is the key mechanism.


The Commit Rule: A Subtle Bug, Tough to Catch

This is the part that separates people who've read the paper from people who've skimmed it.

A leader can only mark an entry as committed if it's from the current term. Not a previous term. Even if a majority of nodes have replicated an entry from term 2, the leader of term 4 cannot commit that entry by counting replicas. It has to wait until a new entry from term 4 is committed — which implicitly commits all preceding entries.

This sounds like a strange restriction. It prevents a specific, devastating bug that Ongaro illustrates in Figure 8 of the paper. The scenario goes roughly like this:

A leader in term 2 replicates an entry to some followers but crashes before committing it. A new leader in term 3 is elected but also crashes quickly. Another leader in term 4 sees the term-2 entry on a majority of nodes and might be tempted to call it committed. But a different node — one that didn't have the term-2 entry — could still be elected leader in term 5 (if it has a term-3 entry that makes its log "more up-to-date"). That new leader would overwrite the term-2 entry. If we'd already called it committed and applied it to clients, we'd have violated consistency.

The Figure 8 scenario — dangerous overwrite vs safe commit path

The fix is clean: the leader only directly commits entries from its own term. Once a current-term entry is committed on a majority, everything before it in the log is implicitly committed too. This closes the window.

I find this the most satisfying part of the paper. It's the kind of bug that would take months to manifest in production, surface as intermittent data corruption, and drive an engineering team to the edge of sanity. Ongaro identified it at the protocol level and designed it away.


Membership Changes: Necessary, not Exciting

Adding or removing nodes from a running cluster is harder than it sounds. The danger is that during a transition from 3 nodes to 5 nodes, you might have two disjoint majorities — 2 of 3 from the old config and 3 of 5 from the new config — that could elect separate leaders simultaneously.

Raft's original solution was joint consensus: a two-phase approach where the cluster first transitions to a combined configuration (old AND new), requiring majorities from both groups, then transitions to the new configuration alone. It works. It's also complex to implement correctly.

In practice, most implementations use a simpler approach that Ongaro described later: single-server changes. You add or remove one node at a time. With single-server changes, the old and new configurations always overlap in a majority, so two disjoint majorities are impossible. etcd uses this approach. It's slower for large membership changes but dramatically simpler to get right.

I'm not going to pretend this section excites me. Membership changes are important infrastructure plumbing. They matter in production. They don't produce aha moments. Moving on.


Snapshots: The Log Can't Grow Forever

Every committed entry sits in the log. After a year of operation, that log could contain millions of entries. Replaying them all on a restarting node would take forever.

Snapshots solve this. Periodically, a node takes a snapshot of its current state machine state (the result of applying all committed entries up to index N), writes it to disk, and discards all log entries up to N. The log is now short again. If a follower falls so far behind that the leader has already discarded the entries it needs, the leader sends the entire snapshot via an InstallSnapshot RPC. The follower loads the snapshot, and normal replication resumes from there.

This is one of those practical concerns that academic papers sometimes hand-wave. Ongaro didn't. The snapshot mechanism is fully specified in the paper, including what happens when a snapshot transfer is interrupted and how to avoid blocking the state machine during snapshot creation. Production-readiness was a design goal, not an afterthought.


Raft in the Wild

The list of systems using Raft is staggering.

etcd — the key-value store underneath Kubernetes. Every Kubernetes cluster contains an etcd cluster, and etcd's replication protocol is Raft. When you kubectl apply, the desired state goes through Raft consensus before it's persisted. Every Kubernetes cluster on the planet is backed by a Raft cluster — and when that Raft cluster loses quorum, the entire control plane goes read-only. Let that sink in.

CockroachDB — uses Raft to replicate individual ranges of its keyspace. A single CockroachDB cluster can have thousands of Raft groups running simultaneously, one per range. They call this Multi-Raft and it required significant engineering to make performant, but the core protocol is straight from the paper.

TiKV — the distributed key-value store under TiDB. Same pattern: Raft per region, Multi-Raft coordination.

HashiCorp's stack — Consul (service mesh), Vault (secrets management), and Nomad (workload orchestration) all use Raft for their internal state replication. HashiCorp's Go implementation of Raft is one of the most battle-tested open-source implementations.

The common thread: every one of these projects chose Raft over Paxos. Not because Raft is theoretically superior — it isn't, really. Because their engineers could read the paper, understand the protocol, and implement it correctly.


Raft vs Paxos: Same Destination, Different Maps

Raft and Multi-Paxos are similar in safety properties and performance. Both require a majority for commits, both have O(N) message complexity per consensus round, both handle crash faults (not Byzantine faults — if you need servers that might actively lie, you need BFT protocols, not Raft). The substantive differences are in how each algorithm decomposes the problem, how membership changes work, and how leader election interacts with log handling. Raft makes the leader load-bearing in a way Multi-Paxos doesn't, and that simplification is what gives the rest of the algorithm its shape.

The difference is entirely in human factors. Paxos is a protocol described in terms of proposers, acceptors, and learners, with a prepare phase and an accept phase, where the relationship between the single-decree protocol and the multi-decree protocol used in practice was left as an exercise to the implementer for years. Chubby, Google's lock service, took years to build despite being designed by experts. The paper describing their experience is a litany of subtle bugs and misunderstandings.

Raft has a leader. The leader accepts writes, replicates them, and commits them. If the leader dies, a new one is elected. That's it. The entire protocol fits in the student's head after one reading. Paxos requires multiple readings, supplementary papers, and probably a whiteboard session with someone who's already implemented it.

Lamport is a genius — he won the Turing Award, and Paxos is provably correct. But Ongaro and Ousterhout understood something equally important: a correct protocol that nobody can implement correctly is not useful. They didn't reinvent consensus. They re-presented it in a shape engineers could actually carry in their heads. In engineering, packaging is the work.

If you want to see the protocol in action, go to thesecretlivesofdata.com/raft/. It's an interactive visualisation that walks through leader election, log replication, and network partitions step by step. I've sent it to every engineer I've mentored who asks about consensus. It takes 15 minutes and replaces a semester of reading.

Raft vs Paxos — same correctness guarantees, different cognitive load


The Bigger Picture

Raft handles crash faults — servers that stop responding. It does not handle servers that lie, send corrupted data, or act maliciously. That's the Byzantine fault model, and it requires fundamentally different (and more expensive) protocols. If you're building a permissioned cluster of servers you control, Raft is the right tool. If you're building a blockchain where participants might cheat, you need BFT.

Raft also operates within the CAP theorem framework. It's a CP system — it chooses consistency over availability during network partitions. If a minority partition can't reach the majority, those nodes stop serving writes. They don't return stale data or accept writes that might conflict. This is the right tradeoff for systems like etcd and Consul, where inconsistency would be catastrophic.

And if you've read post 19 on synchronization primitives, the parallel is direct. A mutex coordinates threads within a single machine. Raft coordinates servers across a network. The problem is the same — mutual exclusion, ordering, agreement — but the failure model is different. Threads don't crash independently (usually). Servers do. The network between them drops packets, reorders messages, and partitions without warning. That's why distributed consensus is its own field, and why Raft's clarity matters so much. When the failure modes are this complex, you need a protocol simple enough to reason about under pressure.

Ongaro set out to make consensus understandable. As someone who struggled with Paxos but got Raft in a single reading, I can confidently say he pulled it off. And that pull-off changed what infrastructure teams build, because once engineers could trust their own understanding of the protocol, more of them were willing to actually ship it.


Further Reading


I'm writing a book about what makes developers irreplaceable in the age of AI. Join the early access list →


Naz Quadri once implemented a toy Raft and then spent four hours debugging an election timeout of zero milliseconds. He blogs at nazquadri.dev. Rabbit holes all the way down 🐇🕳️.