The Database That Cheated Time

Google Spanner, TrueTime, and the Atomic Clocks in the Basement

Reading time: ~15 minutes

You want to transfer money from a bank account in New York to one in London. The debit has to happen before the credit, the credit can't happen if the debit fails, and no observer anywhere on the planet should ever see a state where the money exists in both accounts or neither. This is a distributed transaction with external consistency — and for decades, the accepted wisdom was that you couldn't have it at global scale.

The CAP theorem said you had to choose. Dynamo chose availability. Most systems chose to keep their data in one region and pretend the problem didn't exist. Google chose a third option.

They installed atomic clocks in their data centres.

The Problem Nobody Could Solve

Here's the core difficulty with globally distributed databases: time.

If your database lives in one machine, ordering transactions is trivial. Transaction A starts, acquires a lock, does its work, commits. Transaction B waits. The single machine's clock provides the ordering. Done.

Spread that database across two continents and the whole thing falls apart. The machine in New York says it's 14:00:00.000. The machine in London says it's 14:00:00.003. Which one is right? Neither. Both. It depends on when each machine last synced with an NTP server, how much their local oscillator has drifted since then, and how much network jitter was in the last sync. NTP gives you accuracy in the tens of milliseconds on a good day. On a bad day, clocks can be hundreds of milliseconds apart.

Tens of milliseconds of uncertainty doesn't sound like much. But a modern database can process thousands of transactions per millisecond. If your timestamp is off by 10ms, you can have 10,000 transactions whose order is ambiguous. Transaction B thinks it happened after Transaction A, but their timestamps say otherwise. Linearizability — the property that the system behaves as if all operations happen in a single, globally agreed order — becomes impossible to guarantee without some form of coordination.

The standard solutions all involve sacrifice. You can use a single timestamp oracle — a designated machine that hands out monotonically increasing timestamps — but now that machine is a bottleneck and a single point of failure. You can use Lamport clocks or vector clocks, but those give you causal ordering, not real-time ordering. You can give up strong consistency entirely and go eventual, like Dynamo. Or you can keep everything in one data centre and accept that "globally distributed" isn't for you.

Google needed serializable transactions across data centres on different continents. Their ad system processed billions of dollars. The ad serving database couldn't afford to be eventually consistent — showing the same ad twice or billing incorrectly because of a stale read isn't a minor inconvenience, it's a financial error at scale.

They needed a new idea. The idea they had was absurd.

TrueTime: The API That Returns an Interval

In 2012, Google published the Spanner paper. The most important thing in it isn't the database. It's a time API called TrueTime.

Every other time API on the planet works the same way: you call now() and get back a timestamp. A single point. "It is 14:00:00.003." TrueTime does something different. You call TT.now() and get back an interval: [earliest, latest]. Not "it is 14:00:00.003" but "it is somewhere between 14:00:00.001 and 14:00:00.005."

That interval is the uncertainty bound. TrueTime is telling you: "I don't know exactly what time it is, and I'm not going to pretend I do. But I guarantee — guarantee — that the true time is within this window."

The width of that interval, which Google calls epsilon (ε), is typically between 1 and 7 milliseconds. It depends on how recently the local clock synchronized and how much drift has occurred since. Right after a sync, epsilon is small — maybe 1ms. As time passes without a sync, the local oscillator drifts, and epsilon grows. Then the next sync happens and epsilon shrinks again. It's a sawtooth wave, oscillating between roughly 1ms and 7ms.

TrueTime uncertainty interval and epsilon sawtooth wave between GPS/atomic clock syncs

How do you get 1-7ms uncertainty when general-internet NTP typically delivers 10-100ms (and even a well-tuned datacenter NTP setup is more like 1-10ms)? You don't use NTP. Each Google data centre has a set of time masters. Some are equipped with GPS receivers. Others have atomic clocks — specifically, chip-scale caesium or rubidium oscillators. The GPS masters synchronize to GPS time (which itself comes from atomic clocks on satellites). The atomic clock masters free-run, providing a cross-check against GPS failures or spoofing. Every machine in the data centre polls multiple time masters and applies a variant of Marzullo's algorithm to detect and reject liars before synchronising the local clock — the Spanner paper cites Marzullo by name for exactly this step.

I want to be precise about this. Google put atomic clocks — instruments that measure the resonance frequency of caesium or rubidium atoms to keep time accurate to nanoseconds per day — inside their data centres so that a database could know what time it was to within a few milliseconds. That's a sentence that sounds like a joke and is completely real.

The Trick: Commit Wait

Having a small uncertainty window is nice, but it's useless without a way to turn it into transaction ordering. Here's where the insight lives.

The rule is simple. When a transaction commits, Spanner:

Picks a commit timestamp s within the current TrueTime interval.
Waits until TT.after(s) returns true — meaning the uncertainty window has fully passed.
Only then makes the transaction's effects visible.

This wait is called commit wait, and it's typically the width of epsilon — 1 to 7 milliseconds.

Why does this work? Because after the wait, the commit timestamp s is guaranteed to be in the past. Not "probably" in the past. Guaranteed. Any future transaction, on any machine in any data centre on the planet, will get a timestamp strictly greater than s. There's no ambiguity, no overlap, no race condition. The physical passage of time — the actual rotation of the Earth, the actual ticking of caesium atoms — becomes the synchronization mechanism.

Think about what this replaces. Traditional distributed transactions use two-phase commit (2PC) with a coordinator that everyone talks to. That requires network round trips between continents — 100-300ms of latency for cross-continental coordination. Spanner's commit wait costs 1-7ms. The atomic clocks replaced network coordination with waiting. And waiting 7ms is a lot cheaper than round-tripping to another continent.

Commit wait: T1 waits out its uncertainty window, guaranteeing T2 gets a strictly later timestamp

The formal property Spanner achieves is called external consistency. It's stronger than serializability. It means: if transaction T1 commits before transaction T2 starts (in real, wall-clock time, as observed by any external observer), then T1's commit timestamp is less than T2's commit timestamp. The database's internal ordering matches the external reality. The universe's clock and the database's clock agree.

I've worked with distributed systems for years, and this is the part that stopped me cold when I first read the paper. They didn't solve the clock synchronization problem. They bounded it, using hardware, and then they waited out the bound. The uncertainty is still there. They don't pretend clocks are perfectly synchronized. They accept the uncertainty and turn it into a design parameter: the cost of a commit is "wait epsilon milliseconds." Make epsilon smaller (buy better clocks), and commits get faster. It's a hardware-software co-design in the most literal sense.

There's a principle hiding in here that I think about a lot: if you can't solve the problem, change the problem. "Synchronize clocks across a global fleet to within zero error" is impossible — relativity and network latency say so. "Synchronize clocks within a known, small, bounded error" is engineerable. Spanner didn't out-physics physics. They redefined the problem from "make epsilon zero" to "make epsilon small enough that waiting it out is acceptable." Once the problem is reshaped, hardware and software meet in the middle and the impossible becomes a tunable parameter. I've found it's often the case the original problem statement is the trap. The escape is rewriting the question.

The Full Stack: Paxos, Directories, and SQL

TrueTime is the foundation, but Spanner is a complete database, not just a clock trick.

Paxos Groups

Spanner's data is divided into splits (shards), and each split is replicated across data centres using Paxos — not Raft, which didn't exist yet when Spanner was built. Each split forms a Paxos group with replicas in typically 3-5 data centres. One replica is the leader, which handles reads and writes. The leader uses TrueTime to assign commit timestamps, and Paxos ensures the replicas agree on the order of operations.

When Raft was published in 2014, two years after the Spanner paper, many systems adopted it for its clarity. Spanner stuck with Paxos. Google has more experience with Paxos than possibly anyone on Earth — their Chubby lock service was running in production internally for years before they published the OSDI 2006 paper describing it, making it one of the first large-scale production Paxos implementations anywhere. They've had nearly two decades to work out the implementation subtleties that make Paxos notoriously difficult for everyone else.

From Key-Value to SQL

Spanner started as a key-value store with strong consistency guarantees. That was already impressive. But then something happened that is, in retrospect, inevitable: the team that built Google's ad system needed a relational database.

F1 was the name of Google's ad backend database. Before Spanner, F1 ran on MySQL with manual sharding. This was roughly as pleasant as it sounds. Schema changes required careful coordination across hundreds of shards. Queries that spanned shards were an exercise in suffering. The F1 team looked at Spanner's consistency guarantees and said: "We want this, but with SQL."

The Spanner team added SQL support. Not a SQL-like query language — actual SQL, with schemas, secondary indexes, interleaved tables, and a query optimizer. The resulting system was a globally distributed, externally consistent, SQL-speaking relational database.

I want to emphasize how unusual that combination is. "Globally distributed" usually means "give up transactions." "SQL" usually means "single machine or maybe a primary-replica setup." "Externally consistent" usually means "slow." Spanner is all three simultaneously, and it runs Google's ad money. The F1 paper, published in 2013, describes the migration from MySQL to Spanner and it reads like a war story with a happy ending.

Spanner architecture: Paxos groups across data centres with GPS and atomic clock time masters

Why You Probably Don't Need This

Spanner solves a problem that most companies do not have. If your data fits in one region — and it almost certainly does — Postgres with read replicas is a proven, well-understood solution that will serve you for years. Maybe decades. My experience with database architecture is that roughly 95% of the teams who evaluate globally distributed databases would be perfectly served by a single-region Postgres instance with a good backup strategy.

The companies that genuinely need Spanner-grade global consistency form a short list: Google, large banks processing cross-border transactions, multinational financial exchanges, global gaming platforms with real-money economies, and maybe a dozen others. These are organizations where a user in Tokyo and a user in Frankfurt must see the same data right now, and where "eventually" means lawsuits, or monetary loss.

Cloud Spanner, the managed service Google launched in 2017, charges accordingly. It's not cheap. You're paying for atomic clocks and GPS receivers in Google's data centres, replicated Paxos groups across continents, and the engineering of a team that's been working on this problem for over fifteen years. If you need it, it's worth every cent. If you don't need it, you're buying a Formula 1 car to drive to the grocery store.

The Offspring: CockroachDB, YugabyteDB, TiDB

The Spanner paper didn't stay inside Google. It inspired a generation of databases that wanted the same properties without requiring you to run your own atomic clock infrastructure.

CockroachDB is the most direct descendant. Spencer Kimball and Peter Mattis — both better known in another life as the original authors of GIMP, both ex-Googlers who worked on Colossus, the second-generation successor to GFS — teamed up with Ben Darnell (ex-Google Reader, then Tornado/FriendFeed) and started CockroachDB in 2014 with an explicit goal: build Spanner for everyone. They made two major substitutions. Raft replaced Paxos for consensus — same correctness, better understandability. And NTP replaced TrueTime for clock synchronization.

That second substitution is the interesting one. CockroachDB can't put atomic clocks in your data centre. So it uses NTP, which means uncertainty bounds of tens of milliseconds in a data centre, potentially hundreds across the internet, instead of 1-7ms. Their commit wait would be catastrophically slow if they waited out the full NTP uncertainty, so instead they use a hybrid approach: causal ordering with a configurable maximum clock offset. If clocks are more than the maximum offset apart, the cluster detects it and refuses to operate rather than risk incorrect ordering. It's a pragmatic compromise — not as strong as TrueTime's hardware guarantee, but usable on commodity infrastructure.

YugabyteDB takes a similar approach, using Raft and a hybrid logical clock that combines physical timestamps with logical counters. It targets Spanner-compatible semantics with a focus on PostgreSQL compatibility — you can often point existing Postgres applications at Yugabyte and have them work.

TiDB chose a different path. Instead of trying to approximate TrueTime, it uses a centralized timestamp oracle — a single service that hands out monotonically increasing timestamps. This avoids clock uncertainty entirely but reintroduces a bottleneck. The TiDB team mitigates this with batching and regional TSO instances, but it's a fundamentally different tradeoff than Spanner's hardware-based approach. TiDB uses Raft for replication (via TiKV) and offers MySQL compatibility.

Each of these is a different answer to the same question: how do you get Spanner's properties without Spanner's infrastructure?

Spanner lineage: CockroachDB, TiDB, YugabyteDB, and Cloud Spanner

The Jepsen Factor

No discussion of distributed database claims is complete without mentioning Kyle Kingsbury's Jepsen project. Cloud Spanner was tested in 2017, and the result was unusual: Kingsbury found it did what it claimed. The external consistency guarantees held under network partitions, clock skew, and node failures. His write-up is uncharacteristically positive for a Jepsen report.

CockroachDB has also been Jepsen-tested multiple times. Early versions had issues with clock skew handling. Later versions tightened the guarantees. The Jepsen reports for CockroachDB are required reading if you're evaluating it — they show exactly where the NTP-based approach has weaker guarantees than TrueTime.

The Paper That Changed the Conversation

Before Spanner, the distributed systems community had largely accepted the CAP theorem as a law of nature that forced you to choose between consistency and availability. Dynamo's 2007 paper had made eventual consistency fashionable. The NoSQL movement was in full swing, and "give up consistency" was the default recommendation for anything that needed to scale.

Spanner showed that the choice isn't binary. The CAP theorem is real — you can't have consistency and availability during a network partition. But Spanner doesn't violate CAP. It's a CP system. During a partition, it chooses consistency: the minority partition stops serving writes. The insight is that if you make partitions rare enough (redundant network links, multiple paths between data centres) and make your consistency mechanism fast enough (TrueTime's 1-7ms commit wait instead of cross-continent coordination), the practical cost of choosing consistency becomes negligible.

The CAP theorem is a constraint, not a prison. Google demonstrated that with enough engineering — and enough atomic clocks — you can operate so close to the boundary that the constraint barely matters in practice. You still can't violate CAP. But you can make it irrelevant for all but the most catastrophic failure scenarios.

That's the lasting lesson of Spanner. Not the specific architecture, which is Google-scale and Google-budget. The idea: that the tradeoffs we accept in distributed systems are often choices disguised as laws. NTP gives you 100ms of uncertainty? That's not physics, that's infrastructure. Invest in better clocks and the uncertainty shrinks. Cross-continent coordination takes 200ms? That's not fundamental, that's because you're coordinating over the network. Replace network coordination with local waiting and the cost drops by an order of magnitude.

The constraints are real. The specific numbers are engineering decisions.

Why It Matters Even If You Never Use It

If you've followed this series through synchronization primitives, the CAP theorem, and Raft, Spanner ties several threads together.

Synchronization primitives coordinate threads on a single machine using shared memory and atomic operations — hardware mechanisms that provide ordering guarantees at nanosecond scale. TrueTime is the same idea, scaled up. Instead of atomic compare-and-swap on a cache line, it's atomic clocks providing time bounds at millisecond scale. The principle is identical: use hardware to establish ordering that software alone cannot guarantee.

The CAP theorem tells you what's impossible. Spanner shows you how close to the boundary you can get. Knowing both — the theoretical limit and the practical engineering — is what separates someone who designs systems from someone who copies architectures off blog posts.

And Raft — the algorithm CockroachDB chose over Paxos — exists because Ongaro wanted consensus to be understandable. Spanner uses Paxos because Google started before Raft existed and had already paid the implementation cost. Both are correct. The choice between them is human, not mathematical.

The next time someone tells you "you have to give up consistency for scale," remember that Google put atomic clocks in their data centres and waited seven milliseconds. The tradeoffs are real. The resignation is optional.