What Is RAM, Actually?
From Leaking Capacitors to Cache Lines
Reading time: ~17 minutes
You called malloc(). The kernel gave you an address. You stored a number there, read it back, and moved on with your life.
But what is that address? It's not a location on a chip — not in any way you'd recognize. It's an index into a grid of billions of capacitors, each one holding a single bit as an electrical charge that is, right now, draining away. The data you stored is evaporating. A circuit you've never thought about is racing to put it back before it disappears. And it does this for every single bit, millions of times per second, whether you're using the memory or not.
This is what "random access memory" actually is. Let me show you the machinery.
The Smallest Unit of Memory You Own
Every bit in your RAM stick lives in a 1T1C cell: one transistor, one capacitor. That's it. Two components per bit. Your 16GB stick has roughly 137 billion of these cells.
The capacitor stores the bit. A charged capacitor is a 1. A discharged capacitor is a 0. The transistor is a gate — it connects the capacitor to the outside world when activated, and isolates it when not.

These cells are arranged in a two-dimensional grid. Rows and columns, like a spreadsheet. Every cell in a row shares a word line — the wire that activates all the transistors in that row simultaneously. Every cell in a column shares a bit line — the wire that carries the data out.
How a Read Works
Reading a bit from DRAM is one of those processes that sounds simple and absolutely isn't.
The memory controller activates the word line for the target row. Every transistor in that row turns on. Every capacitor in that row connects to its bit line. The charge stored in each capacitor is almost nothing — about 30 femtofarads of capacitance holding less than a volt. To put "femto" in perspective: a femtofarad is 10⁻¹⁵ farads. A typical AA battery stores roughly 5,000 coulombs of charge. A DRAM cell stores about 0.000000000000030 coulombs. The entire contents of your 32GB RAM, every single bit, could be powered by the static charge you build up shuffling across a carpet. That's how small this is. And yet the sense amplifier has to reliably distinguish it from zero.
A sense amplifier at the end of each bit line detects this voltage difference. It's comparing the bit line voltage to a reference voltage, and the difference might be as small as 50 millivolts. The sense amp swings this to a full logic level — rail-high for a 1, rail-low for a 0. That's your bit.
Here's the part that should bother you: reading is destructive. Schrödinger would have loved DRAM 😹 — you can't observe the bit without destroying it.
When the capacitor shares its charge with the bit line, the capacitor drains. The bit you just read is gone. The sense amplifier detected it, but the original charge is dissipated. Every single DRAM read erases the data it reads.
That's why every read is followed by a write-back. The sense amplifier, having determined the value, drives the bit line back to full voltage and recharges the capacitor. Every read is secretly a read-then-write. Your innocent int x = array[0] triggers a destructive read and a restoration write at the hardware level.
The Refresh Treadmill
Even when nobody is reading your data, it's disappearing.
Capacitors leak. The charge drains through the transistor's junction, through the dielectric, through physics being physics. A DRAM cell loses its charge in milliseconds. Left alone, every bit in your system would decay to garbage.
The memory controller handles this with refresh cycles. It systematically walks through every row and reads it — which, as you now know, drains the capacitors and writes the values back. The JEDEC spec for DDR4 requires every row to be refreshed within 64 milliseconds. For a chip with 65,536 rows, that's one row refresh every 976 nanoseconds.
Your program has no idea this is happening. You never asked for it. You never see it. But roughly 5-10% of your memory bandwidth is consumed by refresh operations that exist solely because the hardware is fighting thermodynamics to keep your data alive.
That's why it's called dynamic RAM. (Penny drop ... )
SRAM: The Expensive Alternative
If DRAM is a leaky bucket that needs constant refilling, SRAM (Static RAM) is a proper container.
An SRAM cell uses six transistors per bit instead of one transistor and one capacitor. Four transistors form a cross-coupled pair of inverters — a latch — that holds a stable 0 or 1 as long as power is applied. Two more transistors act as access gates.
No capacitor. No leaking. No refresh cycles. The data stays put.
The tradeoff is size. Six transistors per bit vs. two components per bit. An SRAM cell is roughly six times the area of a DRAM cell on the same process node. It's also more expensive per bit and draws more standby power.
SRAM is what your CPU caches are made of. DRAM is what your sticks are made of. The entire cache hierarchy exists because we can afford small amounts of fast SRAM but need large amounts of cheap DRAM. That tension between speed and density shapes every modern computer.
The DDR Lineage
The DRAM in your machine is DDR SDRAM — Double Data Rate Synchronous Dynamic RAM. "Double data rate" means it transfers data on both the rising and falling edges of the clock signal, effectively doubling throughput without doubling the clock speed.
Each generation doubles the prefetch width and drops the voltage. DDR1 (2000, 2.5V, 2n prefetch) and DDR2 (2004, 1.8V, 4n prefetch) are ancient history — if either of those is still in production somewhere, someone should send flowers. DDR3 (2007, 1.5V, 8n prefetch) lasted nearly a decade and powered everything from the Mac Pro trash can to the PlayStation 4. If you built a computer between 2010 and 2016, this is what you used, and honestly it was fine.
DDR4 (2014, 1.2V, 8n prefetch with bank groups) is what most desktops and servers are still running in 2026. It added bank groups to reduce conflict penalties but kept the same prefetch width as DDR3 — the improvement was structural, not brute-force.
DDR5 (2021, 1.1V, 3200-6400+ MT/s, 16n prefetch) is the first generation that changed something fundamental.
First, on-die ECC. Every DDR5 chip has error correction built into the die itself. It can detect and correct single-bit errors within each internal transfer — not to be confused with system-level ECC (more on that shortly). You get error correction whether your motherboard supports ECC or not.
Second, dual channels per DIMM. A DDR4 DIMM has one 64-bit channel. A DDR5 DIMM has two independent 32-bit channels. Same total width, but two independent channels mean two independent transactions can be in flight simultaneously. This cuts bank conflict stalls and improves utilization, especially for multi-threaded workloads.
ECC: When Bit Flips Matter
A single bit flip in the wrong place can ruin your day.
In 2003, Belgium's electronic voting system recorded an extra 4,096 votes for a candidate — later attributed to a single bit flip in RAM (bit 13 flipping from 0 to 1 adds exactly 4,096). Cosmic rays — high-energy particles from space — hit silicon and generate enough charge to flip a stored bit. It's rare per cell, but when you have 137 billion cells, the math gets uncomfortable. Google published a study in 2009 showing roughly one correctable error per gigabyte of RAM per year. Every first Tuesday in November, I quietly hope the cosmic rays take a day off from US voting machines, we struggle with societal ECC.
ECC RAM adds an extra chip per channel that stores parity/syndrome bits. The most common scheme uses SECDED — Single Error Correction, Double Error Detection. It can correct any single-bit error and detect (but not correct) any two-bit error. This is why every server you've ever touched runs ECC memory. The cost premium is around 10-20%, and nobody running production workloads considers it optional.
Rowhammer makes this worse. Discovered in 2014, rowhammer exploits the physical proximity of DRAM rows. Rapidly activating the same row — "hammering" it — causes electrical interference that flips bits in adjacent rows. It's a physical attack that crosses process boundaries. Attackers have used it to escalate privileges, escape VMs, and compromise entire systems through nothing but carefully timed memory access patterns. DDR5's on-die ECC mitigates single-bit rowhammer flips, but multi-bit attacks remain an active area of research.
VRAM: Memory for a Different Kind of Processor
The RAM inside your graphics card is a different beast. It's optimized for a fundamentally different access pattern.
Your CPU wants low latency — it needs one specific value now. A GPU wants high throughput — it needs a million values soon. This difference drives the entire GDDR and HBM design.

GDDR (Graphics DDR) shares DNA with regular DDR but makes different tradeoffs. GDDR6X, used in NVIDIA's RTX 4000-series, runs at higher data rates (up to 21 Gbps per pin) and uses a wider bus (256 or 384 bits vs. DDR's 64 bits). The per-pin latency is worse than DDR, but the raw bandwidth is staggering — over 1 TB/s on a high-end GPU.
HBM (High Bandwidth Memory) takes a radically different approach. Instead of discrete chips on a PCB, HBM stacks multiple DRAM dies vertically — 8 or 12 layers tall — and connects them to the processor via an interposer, a silicon bridge that sits beneath both the memory stacks and the processor die. Each HBM stack has a 1024-bit bus. An HBM3-equipped GPU might have six stacks, delivering over 3 TB/s of aggregate bandwidth.
This is why AI accelerators use HBM — training and inference on large language models are memory-bandwidth-bound, not compute-bound. The math units can process data faster than conventional memory can feed them. HBM exists to close that gap.
The NVIDIA H100 makes the cost of this choice concrete. It comes in two variants:
| H100 SXM | H100 PCIe | |
|---|---|---|
| Memory type | HBM3 | HBM2e |
| Capacity | 80 GB | 80 GB |
| Bandwidth | 3.35 TB/s | ~2 TB/s |
| Price (2025) | ~$30,000-40,000 | ~$25,000-30,000 |
Same GPU die. Same 80GB. The SXM version costs $5,000-10,000 more, and a huge chunk of that premium is the newer HBM3 and the SXM board design that feeds it. The H200 pushed further — HBM3e, 141GB, 4.8 TB/s — and costs even more. When people ask why AI infrastructure is so expensive, a significant part of the answer is: stacking DRAM dies twelve layers tall on a silicon interposer and wiring each stack with a 1024-bit bus is not cheap.
When people say a GPU has "80GB of HBM3," they're describing five stacks of vertically-interconnected DRAM dies, each with a bus wider than any DDR channel, all sitting millimeters from the compute die on a shared silicon interposer. That's not a memory stick you slot in. It's a piece of semiconductor engineering that was fabricated as part of the GPU package.
How the CPU Talks to RAM
In the old days — before 2003 — the memory controller lived on the motherboard's northbridge chip. Every memory access had to travel from the CPU, across the front-side bus, through the northbridge, and out to the DIMMs. It was slow and the bus was a bottleneck.
AMD moved the memory controller onto the CPU die with the Athlon 64 in 2003. Intel followed with Nehalem in 2008. Today, every modern CPU has an integrated memory controller (IMC). The CPU talks directly to your RAM sticks with no intermediary chip.
The Addressing Hierarchy
A memory address isn't a flat index. It gets decomposed into a hierarchy of physical selectors.
Channels: Most desktop CPUs have 2 memory channels (DDR4/DDR5). Servers have 4, 6, or 8. Each channel is an independent path to a set of DIMMs.
Ranks: Each DIMM can have 1, 2, or 4 ranks. A rank is a set of chips that respond together to fill the full data width of the channel (64 bits).
Banks: Each rank is divided into banks — typically 16 bank groups of 4 banks each in DDR5. Banks can be accessed independently, allowing parallel operations.
Rows and columns: Within a bank, data is stored in a 2D array. The controller first opens a row (loading it into the bank's row buffer), then reads specific columns from that row.
What Those Timing Numbers Mean
Your RAM sticks have numbers printed on them like "CL16-18-18-36". These are latency timings, measured in clock cycles.
CAS Latency (CL): The number of clock cycles between the column address command and data appearing on the bus. CL16 at 3200 MT/s means 10 nanoseconds (16 cycles / 1600 MHz actual clock). This is the most-quoted number but only tells part of the story.
tRCD (RAS to CAS Delay): How long after opening a row you can issue a column read. If the row you need isn't already open, you pay this penalty first.
tRP (Row Precharge): How long it takes to close the current row before opening a new one.
tRAS (Row Active Time): Minimum time a row must stay open before it can be precharged.
The worst case — you need data from a row that isn't open, and a different row is — costs you tRP + tRCD + CL. That's why the "random" in Random Access Memory is misleading. A row hit (data is in the already-open row) takes CL cycles. A row miss (need to precharge, open a new row, then read) takes tRP + tRCD + CL. In that scenario, "random" access is 3x slower than sequential access to the same row.
This is why memory access patterns matter. This is why the memory controller reorders requests to maximize row hits. This is why prefetching exists.
The Cache Hierarchy
You now have the full picture of DRAM: slow (50-80ns), cheap, dense, and leaky. The CPU core runs at ~4-5 GHz — one clock cycle takes roughly 0.2 nanoseconds. An L1 cache hit takes about 1 nanosecond. A DRAM access takes 50-80 nanoseconds, roughly 200-400 CPU clock cycles of thumb-twiddling.
This gap — the memory wall — has been growing since the 1980s. CPU clock speeds improved roughly 1000x between 1985 and 2005. DRAM latency improved about 10x in the same period. The solution is a hierarchy of progressively larger, slower, cheaper memories that hide the latency of the level below.
All caches are built from SRAM. The six-transistor cells that don't leak, don't need refresh, and switch in under a nanosecond. The price you pay is density — which is why caches are measured in kilobytes and megabytes while main memory is measured in gigabytes.

L1: The Core's Private Scratchpad
Each CPU core has its own L1 cache, split into two halves: L1i (instructions) and L1d (data). Typical size: 32-64 KB each. Access time: roughly 1 nanosecond, or about 4-5 clock cycles on a modern core.
32 KB sounds absurd for a working set. It is. But the L1 isn't meant to hold your working set — it's meant to hold the data and instructions the core needs right now. Its hit rate on typical code is 95%+ because programs exhibit temporal locality (you use something, you'll use it again soon) and spatial locality (you use something, you'll use its neighbor soon).
L2: The Per-Core Buffer
Each core also has a private L2 cache. Typical size: 256 KB to 1 MB. Access time: roughly 3-4 nanoseconds. It's 4-8x slower than L1 but 4-16x larger.
L2 catches the misses from L1. When L1 doesn't have what the core needs, L2 usually does. Combined L1+L2 hit rates typically exceed 97% for well-behaved code.
L3: The Shared Pool
The L3 cache is shared across all cores. On a modern desktop CPU it ranges from 8 MB (low-end) to 64 MB (AMD's X3D chips with stacked cache). On server processors, L3 can be 128 MB or more. Access time: roughly 10-12 nanoseconds.
L3's job is to catch misses from L2 and, critically, to be the rendezvous point for data shared between cores. If core 0 writes a value that core 4 needs, L3 is where they coordinate.
Cache Lines: The 64-Byte Atom
The CPU never fetches a single byte from cache or memory. The smallest unit of transfer is a cache line — 64 bytes on every modern x86 and ARM processor.
When you read array[0], the CPU fetches the entire 64-byte block containing that address. If array holds 4-byte integers, you just got array[0] through array[15] for free. This is spatial locality in action — and it's why iterating an array sequentially is fast. Each cache line fill pays for the next 15 accesses.
It's also why your struct layout matters. If your struct is 72 bytes, every access touches two cache lines. If it's 64 bytes, it fits perfectly in one. If you have an array of structs where you only ever read one field, all the other fields are wasting cache space. That's the argument for struct-of-arrays vs. array-of-structs in performance-critical code.

Write-Back vs. Write-Through
When the CPU writes to a cached location, it has two strategies.
Write-through: Write to the cache and to the next level simultaneously. Simple, always consistent, but slow — every write incurs the latency of the slower level.
Write-back: Write only to the cache. Mark the line as "dirty." Write it to the next level later, when the line is evicted. Faster for the common case (multiple writes to the same line before eviction), but more complex — you need to track which lines are dirty.
Modern CPUs use write-back at every level. The performance difference is enormous. A write-through L1 would bottleneck on L2 latency for every store instruction. Write-back means the core can fire off dozens of writes to L1 at full speed, and the dirty lines trickle down to L2 and L3 in the background.
Cache Coherency: The MESI Protocol
The moment you have multiple cores with private caches, you have a consistency problem. If core 0 and core 4 both cache the same memory line, and core 0 writes to it, core 4's copy is stale.
The MESI protocol (and its variant MOESI, used by AMD) solves this by assigning each cache line one of four states:
- Modified: This cache has the only valid copy, and it's been written to. Main memory is stale.
- Exclusive: This cache has the only copy, and it matches main memory. Can be written without notifying anyone.
- Shared: Multiple caches hold this line. All copies match main memory. Must notify others before writing.
- Invalid: This line is not valid. Must be fetched from elsewhere.
When core 0 writes to a Shared line, it broadcasts an invalidation to all other cores. They mark their copies Invalid. Core 0's copy becomes Modified. This takes ~40-100 nanoseconds depending on the topology — it has to cross the interconnect, hit the other core's cache controller, and wait for acknowledgment.
This is the mechanism behind false sharing, one of the most insidious performance bugs in concurrent programming. Two threads write to different variables, but those variables happen to sit in the same 64-byte cache line. The hardware sees writes to the same line from different cores and starts the invalidation ping-pong. Neither thread is doing anything wrong logically, but physically they're fighting over the same cache line. I've seen false sharing cause a 10x slowdown on workloads that looked perfectly parallel.
The fix is usually alignment padding — force the two variables onto different cache lines. Most languages have annotations for this (alignas(64) in C++, #[repr(align(64))] in Rust, CacheLinePad patterns in Go).
Prefetching: The CPU Guesses Your Future
Modern CPUs don't wait for cache misses. They try to predict what you'll access next and fetch it before you ask.
The hardware prefetcher monitors your access patterns. Sequential access is the easiest case — if you read cache lines N, N+1, N+2, the prefetcher starts loading N+3, N+4 before you get there. Stride patterns (every 4th element, every 8th) are also detected. Random access patterns defeat the prefetcher entirely.
This is another reason sequential memory access is fast. You're not only getting 16 array elements per cache line — the prefetcher is loading the next cache line while you're still processing the current one. The combination means a sequential array scan can approach the theoretical bandwidth of the memory subsystem, while random access pays the full latency penalty on every access.
The Numbers That Matter
Here's the latency hierarchy that shapes every performance decision:
| Level | Typical Size | Latency | CPU Cycles (~4 GHz) |
|---|---|---|---|
| L1 cache | 32-64 KB | ~1 ns | ~4 |
| L2 cache | 256 KB - 1 MB | ~4 ns | ~16 |
| L3 cache | 8-64 MB | ~12 ns | ~48 |
| DRAM | 16-128 GB | ~50-80 ns | ~200-320 |
Every step down the hierarchy is roughly 4x slower and 10-100x larger. This isn't a coincidence — it's the design constraint. If L2 were as fast as L1, it would be the same size and cost. If DRAM were as fast as SRAM, we wouldn't need caches.
The memory wall is the term for the growing gap between CPU speed and memory speed. In 1985, a CPU cycle and a memory access took roughly the same time. By 2005, the CPU was 1000x faster but memory was only 10x faster. Caches exist entirely to hide this 100x disparity. Every cache hit at L1 means the CPU avoided a 50-80 nanosecond stall — an eternity at 4 GHz.
This is why data structures matter more than algorithms for many real-world workloads. A linked list traversal — pointer-chasing through random heap locations — defeats prefetching, defeats spatial locality, and hits DRAM latency on nearly every node. A flat array scan of the same data, even with a worse algorithmic complexity, can be faster because every access is an L1 hit.
I'm not saying throw away your algorithms textbook. I'm saying the cost model it assumed — that all memory accesses cost the same — hasn't been true since the 1990s.
The Full Stack of a Single Address
malloc() gave you an address. That address maps, through the page table and memory controller, to a specific channel, rank, bank, row, and column — an index into a grid of capacitors that are leaking right now. The memory controller refreshes every row every 64 milliseconds. A sense amplifier destroyed and rebuilt your data the last time anyone read it. Six layers of caching — L1i, L1d, L2, L3, TLB, prefetch buffers — exist to hide the fact that those capacitors take 300+ CPU cycles to respond.
Every struct you lay out, every array you iterate, every concurrent data structure you design — the cache hierarchy is the invisible judge of your performance. The CPU hasn't been the bottleneck for decades. Memory has.
And now you know what "memory" actually is: a grid of leaking capacitors, maintained by a refresh circuit running a race against physics, fronted by a hierarchy of SRAM caches playing an elaborate prediction game about what you'll need next.
The next time you see a 64ms stall in a profile, you'll know where to look.
Further Reading
- What Every Programmer Should Know About Memory — Ulrich Drepper (2007) — the definitive deep dive, still relevant
- JEDEC DDR5 SDRAM Standard — the official spec
- Flipping Bits in Memory Without Accessing Them — Kim et al. (2014) — the original rowhammer paper
- DRAM Errors in the Wild: A Large-Scale Field Study — Schroeder et al. (2009) — Google's study on real-world DRAM error rates
- Gallery of Processor Cache Effects — Igor Ostrovsky — excellent visual demonstrations of cache behavior
I'm writing a book about what makes developers irreplaceable in the age of AI. Join the early access list →
Naz Quadri has mass produced more cache misses than he'd like to admit, mostly by iterating linked lists. He blogs at nazquadri.dev. Rabbit holes all the way down 🐇🕳️.