The Parallel Lanes Nobody Uses

SIMD and the Eight-Lane Highway You've Been Driving Solo

Reading time: ~13 minutes

You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called np.array * 2 and it finished before the function call overhead had time to register.

Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.

This is what your CPU can actually do.

The Fundamental Idea

SIMD stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the last twenty years.

The idea is direct. A normal CPU instruction operates on one value:

ADD rax, rbx      # add one 64-bit integer to one other 64-bit integer

A SIMD instruction operates on a packed vector of values in a single clock:

VADDPS ymm0, ymm1, ymm2   # add eight 32-bit floats at once

Eight additions. One instruction. One cycle.

The register ymm0 is 256 bits wide. You pack 8 floats (each 32 bits) into it and treat the whole thing as a single operand. The arithmetic unit is physically wider — eight adders in parallel — and the instruction wires them all to fire simultaneously.

Scalar vs SIMD — 8 instructions vs 1 instruction

This is not a metaphor. It's silicon.

How We Got Here: The Register Zoo

The story of SIMD is a story of Intel and AMD racing to add bigger and bigger registers while pretending backward compatibility wasn't getting worse.

MMX (1996) — Intel introduced the first SIMD extension in the Pentium MMX. Eight 64-bit registers (mm0–mm7) for integer operations. The catch: those registers were aliased to the mantissa fields of the x87 ST(0)–ST(7) floating-point registers. Switching between MMX and x87 FP required executing EMMS to reset the x87 tag word first. (I'm simplifying the aliasing here — the full story involves how x87 tracks "empty" register slots.) Programmers used it. Suffered for it. Moved on.

SSE (1999) — Streaming SIMD Extensions. Eight new 128-bit registers (xmm0–xmm7), finally independent of the FPU stack. Supported 4 single-precision floats or integer variants. Used heavily for 3D graphics and audio in the early 2000s.

SSE2 (2001) — Added double-precision floats and 128-bit integer operations. x86-64 made SSE2 mandatory, so as of 64-bit mode you can assume it exists. This is the baseline.

SSE3, SSSE3, SSE4.1, SSE4.2 (2004–2007) — A string of incremental additions. String comparison instructions, dot products, population counts. Useful but baroque. The naming got embarrassing.

AVX (2011) — Intel widened the registers to 256 bits (ymm0–ymm15). Now you could do 8 floats or 4 doubles at once. The ymm registers are actually the full-width versions of the xmm registers — xmm0 is the lower 128 bits of ymm0.

AVX2 (2013) — Extended AVX to integer operations and added gather instructions (load scattered values from memory into a vector register). Available on Intel Haswell and later, AMD Ryzen. This is the register set most production code targets today.

AVX-512 (2017) — 512-bit registers (zmm0–zmm31). 16 floats or 8 doubles at once. Intel pushed this hard in server chips; it's common in the data center. Desktop support is inconsistent — Intel disabled AVX-512 on Alder Lake desktop SKUs specifically because AVX-512 instructions are power-hungry enough to trigger thermal throttling, and Alder Lake's big/little core design made the behavior unpredictable. AMD added AVX-512 starting with Zen 4. The instruction set is 300+ pages of documentation.

SIMD register width evolution — MMX to AVX-512

The registers kept doubling. The theoretical throughput kept doubling. Most application code never noticed.

Why the Compiler Sometimes Does This For You

Modern compilers — GCC, Clang, MSVC, and rustc (which uses LLVM) — can auto-vectorize loops. This is when the compiler looks at your scalar loop and emits SIMD instructions for it without you asking.

This works well when:

The loop has no data dependencies between iterations (iteration N doesn't use the result of iteration N-1)
The data is contiguous in memory (array, not linked list)
The compiler can prove there's no aliasing (the input and output arrays don't overlap)
The trip count is known or the compiler can generate a scalar fallback for the remainder

A simple sum-of-squares is a textbook case the compiler handles automatically:

pub fn sum_squares(a: &[f32]) -> f32 {
    a.iter().map(|x| x * x).sum()
}

Compile with --release targeting AVX2 and... the multiply vectorizes (vmulps) but the sum stays scalar (vaddss). Wait, what?

Floating-point addition isn't associative — (a + b) + c can give a different result from a + (b + c) due to rounding. The compiler won't reorder your additions without permission, which means it can't pack 8 sums into a single vaddps. Switch to integers and the story changes:

pub fn sum_squares_i32(a: &[i32]) -> i32 {
    a.iter().map(|x| x * x).sum()
}

Now you get vpmulld and vpaddd on ymm registers — 8 integers at once, fully vectorized. Integer addition is associative, so LLVM can reorder freely. See both versions side by side on Compiler Explorer →

This is the kind of thing that makes auto-vectorization both powerful and frustrating. The compiler is doing the right thing — it won't change your program's semantics — but it means the "just write clean code and the compiler will vectorize it" advice has a large asterisk on it.

This breaks down further the moment things get complicated. Add a branch inside the loop: the compiler has to use masked operations or give up. Use a data structure it can't prove is contiguous: it has to generate both a vectorized path and a scalar fallback, with a runtime check. Access non-contiguous memory: it has to use gather instructions, which are slower than you'd hope. Add any function call it can't inline: it bails entirely.

Rust's ownership model actually helps here — slices guarantee contiguous memory and the borrow checker proves non-aliasing at compile time. That's information the auto-vectorizer can use. In C, the compiler has to assume two float* arguments might alias unless you annotate with restrict.

The compiler's auto-vectorizer is optimistic but conservative. You can inspect the emitted SIMD with cargo rustc --release -- --emit asm, or use Compiler Explorer to see exactly what LLVM generated. Read that output. It's educational in a way that is sometimes painful.

Intrinsics: Taking the Wheel

When auto-vectorization isn't enough, you can write SIMD code directly using intrinsics — functions in Rust's std::arch module that map one-to-one to specific CPU instructions.

This is not assembly. You're still writing Rust. You're just telling the compiler exactly which instruction to emit. The ISA-specific code lives inside unsafe blocks, making it explicit where you're stepping outside the compiler's guarantees:

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

/// Add two float slices element-wise using AVX.
/// Handles lengths that aren't a multiple of 8 with a scalar tail.
#[target_feature(enable = "avx")]
unsafe fn add_arrays(a: &[f32], b: &[f32], out: &mut [f32]) {
    let n = a.len().min(b.len()).min(out.len());
    let mut i = 0;
    while i + 8 <= n {
        let va = _mm256_loadu_ps(a.as_ptr().add(i));   // load 8 floats
        let vb = _mm256_loadu_ps(b.as_ptr().add(i));   // load 8 floats
        let vc = _mm256_add_ps(va, vb);                // add all 8
        _mm256_storeu_ps(out.as_mut_ptr().add(i), vc);  // store 8 floats
        i += 8;
    }
    // scalar tail for remainder (if n % 8 != 0)
    for j in i..n {
        out[j] = a[j] + b[j];
    }
}

The __m256 type is a 256-bit vector. _mm256_loadu_ps loads 8 unaligned single-precision floats. _mm256_add_ps adds them. One call, one instruction. The #[target_feature(enable = "avx")] attribute tells the compiler this function requires AVX — calling it on hardware without AVX is undefined behavior, which is why the function is unsafe.

Intrinsics code is not fun to write. The naming convention (_mm256_loadu_ps vs _mm256_load_ps vs _mm512_loadu_ps) requires memorizing a taxonomy. The Intel Intrinsics Guide (at intrinsics.intel.com) is the reference — it lists every intrinsic, the instruction it maps to, the latency, and the throughput. You'll spend time there.

The upside over C: Rust's type system catches width mismatches at compile time. If you accidentally pass an __m128 where an __m256 is expected, that's a type error, not a silent runtime bug. The unsafe boundary also makes it easy to audit — every line that touches raw SIMD is visually contained.

For a higher-level alternative, Rust's portable SIMD API (std::simd) provides type-safe, architecture-independent vector types like f32x8. It's available on nightly and progressing toward stable. When it lands, it will be the preferred way to write explicit SIMD without unsafe or platform-specific intrinsics.

Most application programmers don't write intrinsics. But the programmers who write the libraries you depend on — numpy, simdjson, ripgrep — absolutely do.

Where SIMD Actually Lives

String Search

Finding a byte in a buffer. You do it constantly, you never think about it, and it's the single operation where SIMD makes the most visceral difference. A naive loop checks one byte at a time. SIMD checks 32 with a single _mm256_cmpeq_epi8 — compare 32 bytes simultaneously, get a 32-bit mask of which positions matched.

memchr — the fundamental byte-search operation — is implemented with SIMD at every level: glibc's C implementation, and Rust's memchr crate (which we'll get to in a moment). The function you call every day is already vectorized.

ripgrep is fast partly because of SIMD-accelerated memchr. The memchr crate by Andrew Gallant implements memchr, memmem, and substring search using AVX2 (and AVX-512 where available). The core idea for substring search is Teddy — an algorithm that uses SIMD to find candidate positions in bulk, then verifies them. When ripgrep is blazing through a 2GB log file, it's pushing 32 bytes at a time through vectorized comparisons. This is why it outperforms grep by 5–10x on many workloads. It's not magic. It's lanes.

That's also why string search benchmarks look bizarre to anyone who hasn't seen SIMD before. A loop that calls find in a hot path and a SIMD-accelerated version can differ by 8x with identical O() complexity. The algorithm doesn't tell you the constant factor.

JSON Parsing

In 2019 Daniel Lemire wrote a whitepaper which proved that JSON parsing is fundamentally a SIMD problem, giving birth to simdjson. The bottleneck in parsing isn't the logic — it's scanning through bytes looking for structural characters ({, }, [, ], :, ,, ").

simdjson processes 64 bytes at a time using AVX-512 (or 32 with AVX2). It classifies every byte simultaneously — is this a structural character? A whitespace? A quote? — using bitwise SIMD operations to produce bitmasks. Then it uses those bitmasks to drive parsing without a byte-at-a-time loop.

The result: simdjson parses JSON at 2–3 GB/s on a modern CPU. The fastest pure-scalar parser does maybe 300–500 MB/s. The 6x difference is entirely SIMD.

That's why simdjson exists. That's why it's in MongoDB, Clickhouse, and dozens of other systems that care about throughput.

Image Processing

Every pixel is independent. Every channel is independent. This is SIMD's dream workload — no data dependencies, no branches, just arithmetic on contiguous arrays of bytes. SSE2 processes 16 pixels at once with saturating addition (u8x16::saturating_add in portable SIMD). OpenCV, libjpeg-turbo, libpng — they all have SIMD paths for their hot loops. When Photoshop applies a filter to a 24-megapixel image in under a second, this is why.

ML Inference

This is the one that matters most right now.

Neural network inference is fundamentally matrix multiplication: take a weight matrix, multiply by an input vector, pass through an activation function. Repeat. The core operation — multiply-accumulate on large matrices — is exactly what SIMD was built for.

AVX2's fused multiply-add (_mm256_fmadd_ps via std::arch, or f32x8::mul_add in portable SIMD) does a*b + c on 8 floats in one instruction. For a naive matrix multiply loop, this is an 8x multiplier before you've thought about anything else. Add tiling for cache efficiency and you're in the range of what high-performance BLAS libraries actually do.

AVX-512 with VNNI (Vector Neural Network Instructions, 2019) goes further — it adds instructions specifically for quantized integer dot products used in 8-bit inference. A single vpdpbusd instruction (exposed as _mm512_dpbusd_epi32 in intrinsics) processes 16 multiply-accumulates in one clock. llama.cpp, the library that lets you run large language models on consumer hardware, has hand-written AVX2 and AVX-512 kernels for its matrix multiplication. When you run a local model on your laptop, those kernels are running in tight loops for every token you generate.

The Mindset Shift

Here's the insight that changes how you write code even if you never touch an intrinsic.

SIMD forces you to think in batches, not items.

Scalar code says: "for each element, do this." SIMD code says: "take 8 elements, do this to all of them at once, advance 8." The data structure implications are real.

Arrays of Structures vs Structures of Arrays

Consider a particle system. You might model it like this:

struct Particle {
    x: f32, y: f32, z: f32,       // position
    vx: f32, vy: f32, vz: f32,    // velocity
    mass: f32,
}
let particles: Vec<Particle> = Vec::with_capacity(1_000_000);

This is AoS — Array of Structures. Each particle's data is packed together. Intuitive. Natural.

The goal: update all x positions — x += vx * dt — for every particle.

The problem: x and vx are separated by 24 bytes in each struct. When you load a SIMD vector of 8 x values, you also pull in y, z, vx, vy, vz, mass — data you don't need. Your cache lines are full of noise. Your SIMD registers require a scatter-gather to populate.

The SIMD-friendly layout is SoA — Structure of Arrays:

struct Particles {
    x:  Vec<f32>,
    y:  Vec<f32>,
    z:  Vec<f32>,
    vx: Vec<f32>,
    // ...
}

With SoA, all x values are contiguous. Loading &particles.x[i..i+8] gives 8 consecutive x values, ready to go. Loading &particles.vx[i..i+8] gives the matching 8 vx values. One fused multiply-add updates 8 particles. No scatter-gather. No cache waste.

This is not a micro-optimization. The difference in a physics simulation inner loop can be 4–8x. The code is otherwise identical.

That's why SoA and AoS matter — two data structures with identical asymptotic behavior, identical logical content, identical algorithmic logic. One is auto-vectorizable. One isn't. The difference is 8x. Nobody mentioned this in algorithms class.

This also explains why entity-component systems (ECS) — used in game engines like Unity DOTS and Bevy — look structurally odd until you see SIMD. ECS stores component data in contiguous arrays per component type, not per entity. That's SoA. The performance difference for physics and animation simulations is why the pattern exists.

AoS vs SoA — scattered access vs contiguous SIMD loads

Alignment

SIMD instructions have opinions about memory alignment. Aligned loads — _mm256_load_ps — require the address to be 32-byte aligned (the address mod 32 == 0). Unaligned loads — _mm256_loadu_ps — work on any address, but may be slower on older hardware.

On modern CPUs (Intel Skylake and later, AMD Zen 2 and later), unaligned loads are as fast as aligned loads — as long as you don't cross a 64-byte cache line boundary. So alignment mostly solves itself if you enforce it on your arrays and use _mm256_loadu_ps in your code.

In Rust, you control alignment with #[repr(align(32))]:

#[repr(C, align(32))]
struct AlignedBlock {
    data: [f32; 8],
}

This is the equivalent of C's __attribute__((aligned(32))) or alignas(32). It means: "I plan to load this with SIMD and I want the first element to be register-friendly."

You Don't Need to Write Intrinsics

The practical message is not "go rewrite your code in intrinsics." It's shorter:

Write in a way the compiler can vectorize. Keep your hot loops simple and branch-free. Lay your data out contiguously in the access order you need it. Prefer SoA over AoS in performance-critical code. Reach for libraries (numpy, simdjson, BLAS, any vectorized BLAS-backed ML framework) before reaching for intrinsics.

That's why numpy is fast and a Python for-loop isn't. numpy's inner loops are SIMD-vectorized C. When you call arr * 2, numpy dispatches to a vectorized multiply kernel operating on the entire array in chunks of 8 or 16 elements. Your Python for-loop multiplies one element per bytecode interpretation cycle.

Understand that when two seemingly equivalent implementations have an 8x performance difference, this is frequently why. Not cache (though that's related). Not branch prediction (though that matters too). The data layout didn't allow the CPU to use seven of its eight lanes.

If you do need explicit SIMD, Rust gives you options before you reach for raw intrinsics:

std::simd — Rust's portable SIMD API (nightly, progressing toward stable). Type-safe vector types like f32x8 that compile to the best available instructions on any architecture. This is the future.
wide — a stable crate providing portable SIMD types today. Good for production code that can't wait for std::simd.
pulp — runtime CPU feature detection with safe SIMD dispatch.

For C++ codebases, highway (Google's portable SIMD abstraction) serves a similar role. Don't write raw _mm256_* calls unless you've exhausted the higher-level options — though in Rust, at least the type system will catch width mismatches at compile time instead of letting you discover them at midnight.

What the CPU Looks Like Now

One instruction:
  ADD rax, rbx
  → adds two 64-bit integers
  → uses 64 bits of register space

One SIMD instruction:
  VADDPS ymm0, ymm1, ymm2
  → adds eight 32-bit floats
  → uses 256 bits of register space
  → eight physical adders firing simultaneously

Your loop over 8 million floats:
  Scalar:  8,000,000 add instructions
  AVX2:    1,000,000 add instructions (8x fewer)
  AVX-512: 500,000 add instructions (16x fewer)

The lanes are there. They've been there since 1999, getting wider every few years. Every calculation you've ever run in a Python loop touched one lane of a machine that had eight available.