What Actually Happens When You Query a Database

Pages, Trees, and the Ghost Versions of Your Rows

Reading time: ~17 minutes

You wrote SELECT * FROM users WHERE email = '[email protected]'. PostgreSQL returned the row in 0.3 milliseconds. There are 4 million rows in that table. The database didn't scan all 4 million. It didn't even scan 1% of them. It walked a tree, followed three pointers, and landed on your row.

How?

I spent the first few years of my career treating the database like a vending machine. Put SQL in, get rows out. The internal mechanics were somebody else's problem. Then we started changing the fundamental structure of the data in the main trading database. Different tables. Same conceptual data. Query performance started changing. Changing a lot and not in a good way.

The database is not magic. It's a carefully engineered system of pages, trees, and transaction bookkeeping, built on top of the same filesystem and page cache we covered in post 04. Everything it does is shaped by one constraint: disk I/O is slow, and the goal is to do as little of it as possible.

The Page: The Atom of Storage

A database never reads a single row from disk. It reads a page.

PostgreSQL uses 8KB pages. MySQL's InnoDB uses 16KB. SQLite uses 4KB by default. The number varies, but the principle doesn't: the page is the minimum unit of I/O. Ask for one row, get the whole page it lives in. Ask for a row on a different page, that's another I/O operation. Two rows on the same page? One read gets both.

This is not the database being lazy. It's physics. The storage hardware underneath — whether spinning disk or NAND flash — transfers data in blocks. The filesystem reads in blocks. The kernel page cache (post 04) manages memory in 4KB pages. The database aligns its own pages to this reality.

PostgreSQL 8KB page layout — header, item pointers growing down, free space, tuple data growing up

A PostgreSQL page has a fixed structure: a 24-byte header at the top with the WAL log sequence number, checksum, and free space pointers. Then an array of item pointers growing downward — each one is [offset, length, flags] for a tuple. Free space sits in the middle. Tuple data is packed from the bottom of the page upward, each tuple with its own header. The very bottom is reserved for "special space" — index-specific data when the page belongs to an index instead of a heap.

The item pointer indirection is clever. When the database needs to reference a specific row, it uses a tuple ID (TID): a page number plus an offset index into the item pointer array. The item pointer then says where the actual bytes live within the page. This means the database can rearrange tuple data within a page — compacting free space, for instance — without invalidating every external reference to those rows. The TID stays the same; only the item pointer's target changes.

Heap Files: The Pile of Pages

A table in PostgreSQL is a heap file. That's not a data structures heap — it's the other kind. A pile. Rows go wherever there's space. No sorting. No ordering guarantees. You INSERT a row, and PostgreSQL finds a page with enough free space and shoves it in. (This is a PostgreSQL-specific design choice. InnoDB, which I'll get to later, organises its tables as B-trees keyed by primary key, so rows are physically sorted and there's no separate heap at all. Same problem, different answer.)

The free space map (FSM) tracks which pages have room. The visibility map (VM) tracks which pages contain only tuples visible to all current transactions. These are auxiliary files that live alongside the main heap file on the filesystem — the same filesystem layer we looked at in post 17.

This means a full table scan reads every page in the heap, sequentially. For 4 million rows at maybe 5 rows per page, that's 800,000 pages. At 8KB each, 6.4 gigabytes of I/O. Even on fast NVMe, that's not 0.3 milliseconds.

What a Row Actually Looks Like

Each tuple in the heap has a header before the actual column data:

Tuple Header (23 bytes minimum):
  t_xmin      (4 bytes) — transaction ID that created this row
  t_xmax      (4 bytes) — transaction ID that deleted/updated this row
  t_cid       (4 bytes) — command ID within the transaction
  t_ctid      (6 bytes) — current TID (self, or pointer to newer version)
  t_infomask  (2 bytes) — status flags (committed, aborted, has nulls, etc.)
  t_infomask2 (2 bytes) — more flags + number of attributes
  t_hoff      (1 byte)  — offset to the actual data

  [null bitmap — 1 bit per column, present if any column is nullable]
  [padding for alignment]
  [column data — fixed-width fields inline, variable-width fields with length prefix]

Those transaction IDs — t_xmin and t_xmax — are the heart of MVCC, and we'll come back to them. For now, notice that every single row carries its own birth certificate and death certificate. That's overhead. A table with ten million rows has ten million copies of this bookkeeping. It's the price of concurrency without locks.

The Buffer Pool: A Database Within a Database

The database doesn't trust the operating system's page cache. This should set off alarms.

A general rule I've come to trust: Linux kernel primitives are almost always the best general-purpose solution to a problem. Not because the kernel team is infallible, but because they're under constant arbitrage pressure. If a kernel primitive is leaving performance on the table for a common workload, someone files a patch, the patch lands, and the gap closes. The kernel has been refining read(), write(), the page cache, and the scheduler for thirty years across every workload imaginable. If your application can beat the kernel at general I/O, you have a bug, or a Linux contribution, not an optimisation. The arbitrage will be closed, now or soon.

The exceptions are the cases that prove the rule. They're the workloads where you know something the kernel can't know. Database servers are one of them. The kernel's page cache uses LRU because LRU is a reasonable default across every program on the system. It can't know that a B-tree root page is accessed on every single index lookup and should never be evicted. It can't know that a sequential scan will touch pages once and never again, and that promoting them past your hot index pages is exactly the wrong thing to do. The database knows. The database has workload-specific knowledge that the kernel cannot have because the kernel has to be general-purpose.

So PostgreSQL builds its own page cache. shared_buffers is a chunk of RAM (typically 25% of system memory) dedicated to caching pages. MySQL has its InnoDB buffer pool. When PostgreSQL needs a page, it checks its own buffer pool first. Only on a miss does it go to the filesystem, and even then the OS page cache (post 04) might serve it from RAM before any disk I/O happens.

Other databases take this further. InnoDB uses O_DIRECT by default on Linux — writing straight to disk, managing all caching itself (the flag we discussed in post 04) — to avoid the double-cache problem where the same page sits in both the buffer pool and the OS page cache, wasting RAM that could hold more useful data. PostgreSQL deliberately does not go this route. It relies on the OS page cache as a second layer and treats its own shared_buffers as the hot cache — a design choice the Postgres team has defended for years, and one of the reasons tuning shared_buffers past about 25% of system RAM stops helping (you're just duplicating pages the kernel already has).

If you find yourself wanting to bypass kernel primitives, ask yourself: do I have workload knowledge the kernel can't have? If yes, you're probably justified. If no, you're probably wrong.

B-Trees: The Data Structure That Makes Databases Work

Scanning 800,000 pages is O(n). We need O(log n). That's what a B-tree index gives you.

I'm going to spend more time on B-trees than anything else in this post, because understanding them changed how I think about databases. Every other index type is a footnote compared to this.

The Shape

A B-tree is a balanced tree where every node is one page. The root node is a single page at the top. Internal nodes contain keys and pointers to child pages. Leaf nodes contain keys and pointers to heap tuples (the actual row locations).

The critical insight: a binary tree node holds one key and has two children. A B-tree node holds hundreds of keys and has hundreds of children. As many as will fit in one page.

A page is 8KB. A key might be an 8-byte bigint plus a 6-byte TID pointer plus some overhead — call it 20 bytes per entry. In theory that's around 400 entries per page. In practice PostgreSQL's B-tree pages also carry a header, an item pointer array, and variable-width encoding overhead, so real fan-out for integer keys lands closer to 200–350 per internal node. The exact number doesn't matter; what matters is that it's hundreds, not two. Take the 400 number as the rounded-up textbook version and the tree shape follows:

Level 0 (root): 1 page, ~400 child pointers
Level 1: ~400 pages, ~160,000 child pointers
Level 2: ~160,000 pages, ~64,000,000 leaf entries

Three levels. 64 million rows indexed. Three page reads to find any one of them. For 4 million rows? Still three levels. Three I/O operations. And the root page is always cached in the buffer pool, so really it's two I/O operations.

A binary tree indexing 4 million rows would need log2(4,000,000) = 22 levels. Twenty-two I/O operations versus two or three. That's not an optimization. That's a different universe.

B-tree index structure with three levels fanning out to heap pages

How a Lookup Works

SELECT * FROM users WHERE email = '[email protected]' with a B-tree index on email:

Read the root page. Binary search within the page to find which child pointer covers the range containing [email protected]. This is an in-memory comparison — the page is already in the buffer pool.
Follow the child pointer. Read that internal page. Binary search again. Find the next child pointer.
Follow it to a leaf page. Binary search the leaf entries. Find the key [email protected] with TID (42, 7) — page 42, item 7.
Read heap page 42. Follow item pointer 7 to the tuple data. Return the row.

Four page accesses. Root is cached, so three actual reads. On NVMe with a warm buffer pool, most of those are memory lookups, not disk reads. That's your 0.3 milliseconds.

The binary search within each page is pure CPU work — comparing keys in RAM. The expensive part is fetching the page in the first place. The B-tree minimizes page fetches by packing hundreds of keys into each node. The data structure is optimized for disk, not RAM. A binary tree is optimal for RAM (one comparison per node, minimal work). A B-tree is optimal for disk (one I/O per level, minimal I/O). Different hardware, different optimal structure.

Clustered vs Non-Clustered: The Architectural Fork

InnoDB and PostgreSQL made fundamentally different choices here.

InnoDB stores the actual row data inside the B-tree leaf nodes. The primary key index IS the table. This is a clustered index. There's no separate heap file. When you traverse the B-tree to a leaf, the data is right there. No extra hop.

Secondary indexes in InnoDB store the primary key value in their leaf nodes, not a physical row pointer. Looking up a row by secondary index means: traverse the secondary B-tree to find the primary key, then traverse the primary B-tree to find the row. Two tree traversals. This is why InnoDB primary key choice matters so much — a fat primary key bloats every secondary index.

PostgreSQL stores row data in the heap, and every index (including the "primary" key index) points to heap TIDs. There's no clustered index. The heap is a pile. Every index lookup ends with a heap fetch.

The trade-off: InnoDB is faster for primary key lookups (data is right in the leaf) but secondary index lookups cost more (two traversals). PostgreSQL treats all indexes equally but always pays the heap fetch cost.

I have opinions about which is better, and they depend entirely on the workload. For OLTP with heavy primary-key access, InnoDB wins. For analytical workloads with many different indexes, PostgreSQL's uniformity is cleaner.

Covering Indexes: Skipping the Heap Entirely

If the index contains every column the query needs, the database never touches the heap. This is an index-only scan.

-- If you have an index on (email, name):
CREATE INDEX idx_users_email_name ON users (email, name);

-- This query can be answered from the index alone:
SELECT name FROM users WHERE email = '[email protected]';
-- Index has both email (for lookup) and name (for output). No heap access.

PostgreSQL calls this an "Index Only Scan" in EXPLAIN output. The visibility map tells it whether all tuples on the heap page are visible — if so, it doesn't even need to check the heap for MVCC purposes.

Other Index Types (The Abridged Version)

B-trees handle equality and range queries. Not everything is equality and range.

Hash indexes give O(1) lookup for equality. No range support — you can't ask a hash table for "all emails between A and M." PostgreSQL didn't even make hash indexes crash-safe until version 10 (2017). Before that, they weren't written to WAL. I've never used one in production. B-trees are fast enough for equality and also handle ranges.

GIN (Generalized Inverted Index) is for containment queries. Full-text search, JSONB @> operators, array overlap. It's an inverted index: given a value, which rows contain it? Think of a book index — you look up "B-tree" and get page numbers 47, 112, 238. GIN does that for every token in a text column or every key in a JSONB document. The data structure is, ironically, a B-tree of B-trees.

GiST handles spatial and geometric data. PostGIS uses it. If you need "find all restaurants within 5km," GiST is your index. R-trees under the hood. I won't pretend I use this often.

BRIN (Block Range Index) is the most elegant. It stores the minimum and maximum value for each range of consecutive heap pages. If the data is physically sorted (timestamps in an append-only log table, for instance), the BRIN index is tiny — a few pages covering millions of rows. The trade-off: it only helps if the data's physical order matches the query. If timestamps are scattered randomly across pages, BRIN is useless.

Comparison of four index types: B-tree, Hash, GIN, and BRIN

MVCC: How Reads Don't Block Writes

If a read locks the row, writes wait. If a write locks the row, reads wait. Both are terrible for throughput. This is the fundamental problem MVCC solves.

Remember those t_xmin and t_xmax fields in every tuple header? Here's where they earn their bytes (or is it bits 🤷)

PostgreSQL's Approach

Every transaction gets a sequential ID. When you INSERT a row, t_xmin is set to your transaction ID. When you DELETE or UPDATE a row, t_xmax is set to your transaction ID. (An UPDATE is a DELETE of the old version plus an INSERT of the new version — two tuples.)

A reading transaction takes a snapshot: "I can see rows where t_xmin was committed before my snapshot, and t_xmax is either empty or was committed after my snapshot." This is a visibility check, and it runs for every tuple the query touches.

No read locks. No write locks blocking reads. Writers create new tuple versions. Readers consult their snapshot to decide which version is visible. The same row can have multiple physical copies in the heap — one per concurrent modification — and each transaction sees the version that's correct for its snapshot.

The cost: dead tuples. When you UPDATE a row, the old version doesn't disappear. It's marked with t_xmax but the bytes are still there, taking up space in the page. Multiply this by millions of updates and your table bloats. Pages fill up with invisible corpses.

VACUUM is the cleanup crew. It scans the heap, identifies tuples that are no longer visible to any active transaction, and marks that space as reusable. AUTOVACUUM runs this in the background. When it falls behind — large tables, high update rates, long-running transactions holding old snapshots open — bloat accumulates. I've seen tables where 60% of the pages were dead tuples. The fix is usually "figure out why autovacuum isn't keeping up," which is a blog post of its own.

InnoDB's Approach

InnoDB doesn't keep old versions in the main data pages. Instead, it writes undo logs — before-images of modified rows stored in a separate undo tablespace. When a reader needs an older version, InnoDB follows the undo chain backward until it finds the version that was current at the reader's snapshot.

Same concept, different plumbing. PostgreSQL's approach is simpler (everything is in the heap) but requires VACUUM. InnoDB's approach avoids heap bloat but adds complexity in the undo log management and the chain-walking for old snapshots.

The Query Planner: Why It Ignores Your Index

You added an index. The query is still slow. EXPLAIN ANALYZE shows a sequential scan. What happened?

The query planner is a cost-based optimizer. It estimates the cost of different execution strategies and picks the cheapest one. "Cost" roughly means "number of page reads," weighted by whether they're sequential (cheap, read-ahead friendly) or random (expensive, one page at a time).

Query planner decision flowchart: index scan vs sequential scan based on selectivity

An index scan reads the index pages (sequential, fast) then jumps to heap pages in random order (random, slow). A sequential scan reads the entire heap start to finish (sequential, fast, but reads everything).

If your WHERE clause matches 80% of the table, the index scan would do: 3 index page reads + 3.2 million random heap page reads. The sequential scan does: 800,000 sequential heap page reads. Sequential I/O is roughly 10-100x faster than random I/O per page. The planner does the math and chooses the sequential scan.

The crossover point varies, but a rough rule: if the query returns more than about 10-15% of the table, the planner probably prefers a sequential scan. Your index exists but isn't useful for that query.

The planner makes these decisions using statistics — histograms of column values, null fractions, distinct value counts. The ANALYZE command (or AUTOVACUUM with analyze) samples the table and updates these stats. Stale statistics lead to bad plans. I once tracked a production slowdown to a table that had grown from 100K to 10M rows without a statistics refresh. The planner thought it was querying a small table and chose a nested loop join. It was not a small table.

-- See what the planner is doing and how much work it actually did
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = '[email protected]';

-- Output tells you:
--   "Index Scan using idx_users_email" — which strategy
--   "actual time=0.015..0.016 rows=1" — wall clock time, actual rows
--   "Buffers: shared hit=4" — 4 pages read from buffer pool (no disk I/O)
--   "Planning Time: 0.082 ms" — cost of choosing the plan
--   "Execution Time: 0.031 ms" — cost of running it

That Buffers: shared hit=4 is the proof. Four pages. Root, one internal node, one leaf, one heap page. The entire lookup touched 32KB of data out of a 6.4GB table.

WAL: Why Your Data Survives a Crash

The database modified a page in memory. Power goes out. Is the data gone?

No, because of write-ahead logging. Before modifying any data page, the database writes a description of the change to a sequential log file — the WAL (Write-Ahead Log). Only after the WAL record is safely on disk (fsync'd) does the transaction count as committed.

The modified data page sits in the buffer pool as a "dirty page." Eventually — at a checkpoint — the database flushes dirty pages to their actual locations in the heap files. If the system crashes before the checkpoint, the WAL records are replayed on startup to reconstruct the changes.

Why not write directly to the data files? Because data file writes are random — page 42, then page 90,871, then page 3. Random I/O. WAL writes are sequential — append to the end of the log. Sequential I/O on any storage device is dramatically faster than random I/O. The WAL turns random writes into sequential writes now, with the random writes batched and deferred to checkpoint time.

This is the same principle behind log-structured merge trees (LSM), the kernel's own journaling in ext4/xfs (post 17), and every system that has ever had to make writes durable on unreliable storage. Sequential first, random later, in batches.

Write-ahead logging flow: sequential WAL writes then batched checkpoint flushes

The Practical Bit

A few things I wish someone had told me earlier.

EXPLAIN (ANALYZE, BUFFERS) is your X-ray. Not EXPLAIN. Not EXPLAIN ANALYZE. The BUFFERS option shows actual page reads — shared hits (buffer pool), shared reads (disk), and temp reads (spilling to disk). This is where you see what the database is actually doing, not what it plans to do.

Check your dead tuples. pg_stat_user_tables has n_dead_tup and last_autovacuum. If dead tuples are climbing and autovacuum hasn't run recently, you have a bloat problem coming.

SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) as dead_pct,
       last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

"Just add an index" has a cost. Every index is a separate data structure that must be updated on every INSERT, UPDATE, and DELETE. A table with ten indexes does ten B-tree insertions per row insert. That's write amplification. And each index takes disk space, buffer pool space, and vacuum time. I've seen tables with 15 indexes where half were never used. pg_stat_user_indexes tells you which indexes have zero scans.

The N+1 problem, from the database's perspective. Your ORM loads a user, then loops and loads each of their 100 orders individually. That's 101 queries, 101 parse-plan-execute cycles, and roughly 101 separate index traversals. A single JOIN does it in one query with one plan and reads the order index pages sequentially. The ORM hides this because it hides everything — the page structure, the buffer pool, the planner, all of it. I'm not saying don't use ORMs. I'm saying know what they're hiding.

Also ... don't use an ORM 😉.