← All Posts

The Dumbest Smart Tool You Use Every Day

Git Internals: Content-Addressable Storage, DAGs, and the Safety Net You Forgot You Had

Reading time: ~19 minutes


In 2005, the Linux kernel community lost access to BitKeeper, the proprietary version control system they'd been using. Linus Torvalds needed a replacement. He wrote one in two weeks.

Two weeks. Not a prototype. Not a proof of concept. A working distributed version control system that could handle the Linux kernel — fifteen million lines of code, thousands of contributors, a patch flow that would choke any centralised system. He announced it on April 6, 2005. By April 29, git was hosting the Linux kernel. The man knocked it out as a side project because he was annoyed. 👏👏👏 🫡

Twenty years later, every developer on the planet uses it daily, and almost nobody knows how it actually works.

You typed git commit -m "fix bug". What happened? Not "a snapshot was created." What actually happened — on disk, in the file system. Right now, in your project directory, there's a folder called .git that contains the entire history of everything you've ever committed, every branch you've ever created, and every mistake you've ever made. That folder is a database. A surprisingly elegant one. And no one looks inside — why would you?

I thought I understood git for years. I did add, commit, push. I used worktrees. I rebased without flinching. I even had opinions about merge strategies. Then I built a tool for parallel, distributed multi-repo management — orchestrating git operations across dozens of repositories simultaneously, handling concurrent fetches, worktree lifecycle, ref manipulation — and I realised I'd only ever known the tip. The iceberg underneath was vast, and I'd paid it no attention this whole time.

The thing that finally made it click was opening .git/ and exploring what was inside. Once you see the data model, every command stops being a magic spell and starts being a filesystem operation. The fear goes away. Not the respect — git can still ruin your day. But the fear.


The Object Store: Everything Is a Hash

Open a terminal in any git repository and look:

ls .git/objects/

You'll see directories named with two-character hex prefixes -- 0a, 1f, 3b, and so on -- plus two special directories: info and pack. Every file in those hex-prefixed directories is a git object, and every git object is identified by a SHA-1 hash of its contents. (Git 2.29 added experimental SHA-256 support in 2020, and a long-running transition is underway — but every repo you're likely to touch today is still SHA-1, so that's what I'll use throughout.)

That's the core insight. Git is a content-addressable filesystem. You give it content, it hashes the content, and the hash becomes the address. The same content always produces the same hash. Two identical files in different commits produce one object, not two. Git deduplicates by default, not as an optimization -- it's the fundamental storage model.

There are exactly four types of objects.

Blobs

A blob is file contents. Not a file -- blobs don't know their own name. No filename, no permissions, no timestamps. Git prepends a header -- blob <size>\0 -- to the raw bytes, then computes the SHA-1 of that combined input and compresses it with zlib. The hash includes the header, not just the raw content. That's it.

# See it yourself
echo "hello" | git hash-object --stdin
# ce013625030ba8dba906f756967f9e9ca394464a

That hash is deterministic. Run it on any machine, any operating system, any year. Same input, same hash. Always.

Trees

A tree is a directory listing. Each entry in a tree contains a file mode (permissions), an object type (blob or tree), a SHA-1 hash pointing to the object, and a filename. This is where filenames live -- not in blobs, but in the tree that references them. A tree can point to other trees (subdirectories) or blobs (files).

# Inspect a tree — first get the tree hash from a commit, then look inside it
git cat-file -p HEAD          # shows the commit, including its "tree" line
git cat-file -p <tree-hash>   # shows the directory listing for that tree

# Or the shorthand (ugly but useful): HEAD^{tree} means
# "the tree object that HEAD's commit points to"
git cat-file -p HEAD^{tree}
# 100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    .gitignore
# 040000 tree a1b2c3d4e5f6...                              src/

Commits

A commit points to exactly one tree (the root tree of your project at that moment), zero or more parent commits (zero for the initial commit, one for a normal commit, two for a merge), an author, a committer, a timestamp, and a message. That's the entire commit object. It's text. You can read it:

git cat-file -p HEAD
# tree 4b825dc642cb6eb9a060e54bf899d15f23a5c38e
# parent 8a73b2f...
# author Naz Quadri <[email protected]> 1711600000 +0000
# committer Naz Quadri <[email protected]> 1711600000 +0000
#
# fix bug

The commit doesn't contain a diff. It doesn't contain a list of changed files. It contains a pointer to a complete snapshot of every file in the project. Git computes diffs on the fly by comparing two trees. The storage model is snapshots, not deltas — the opposite of what most people assume (i.e., me, for years).

This is the bit that broke my mental model the first time I actually looked. I'd written git hooks, built CI pipelines, argued about rebase vs merge — and I had the storage model backwards the whole time. I thought commits were diffs threaded together, because that's what git log -p shows you. But git log -p is a rendering. Under the hood, each commit is a full snapshot, and the diff is computed on demand by walking the two trees and finding the differences. Storage is cheap; recomputation is cheap; the simple model wins.

Tags

An annotated tag is an object that points to a commit (or any other object), plus a tagger, a date, and a message. Lightweight tags aren't objects at all -- they're just refs, which I'll get to in a moment.

The git object model: blobs, trees, commits, and tags linked by SHA-1 hashes

These four objects -- blobs, trees, commits, tags -- are the entire data model. Every other concept in git (branches, HEAD, the staging area, the reflog) is built on top of this. When people say "git is simple under the hood," this is what they mean. Four object types. Content-addressable storage. A directed acyclic graph of commits.


Refs: Branches Are Files

Here's where it gets good.

cat .git/refs/heads/main
# 8a73b2f1e4d5c6a7b8c9d0e1f2a3b4c5d6e7f8a9

That's it. That's a branch. A branch in git is a file containing a 40-character SHA-1 hash. The file lives at .git/refs/heads/<branch-name>. When you "create a branch," git creates a 41-byte file (40 hex characters plus a newline). When you commit on a branch, git overwrites that file with the new commit's hash.

There is no branch object. There is no branch data structure. There is a file containing a hash.

HEAD works the same way, with one twist:

cat .git/HEAD
# ref: refs/heads/main

HEAD is a file containing a symbolic reference to another ref. It tells git "the current branch is main." When you commit, git follows HEAD to find the branch file, then updates the branch file with the new commit hash.

What about detached HEAD? I'll come back to that.


The Index: What git add Actually Does

Between your working directory and the object database sits a file called .git/index. This is the staging area -- and it's a binary file, not a directory of staged copies.

When you run git add somefile.py, git does three things:

  1. Compresses the file contents with zlib
  2. Writes a blob object to .git/objects/
  3. Updates .git/index to record: this filename, this blob hash, these permissions, this timestamp

The index is a sorted list of entries, one per tracked file, each recording the file's path, its blob hash, and metadata like timestamps and permissions. It's binary for performance -- git reads it into memory on almost every operation, and parsing text would be too slow for repositories with hundreds of thousands of files.

git status works by doing three comparisons: HEAD tree vs. index (staged changes), index vs. working directory (unstaged changes), and files in the working directory not in the index (untracked files). That's why git status can be slow on enormous repos -- it's doing a three-way diff every time.

git commit takes the current state of the index, builds a tree object from it, creates a commit object pointing to that tree, and updates the current branch ref. The working directory isn't involved at all. Only what's in the index gets committed. This is why you can stage part of a file -- the index tracks blob hashes, not file references.

The three areas of git: working directory, staging area, and object database


How Merge Actually Works

When you run git merge feature, git doesn't just "combine changes." It runs a three-way merge algorithm, and understanding the three participants makes merge conflicts stop feeling random.

The three inputs are:

  1. The merge base -- the most recent common ancestor of the two branches being merged. Git finds this by walking the commit graph backward from both branch tips until the paths converge. The command git merge-base main feature shows you this commit.
  2. Ours -- the current HEAD (the branch you're on).
  3. Theirs -- the branch you're merging in.

For each file, git compares all three versions. The logic:

The three-way merge is why git can auto-resolve most merges. Two-way merge (just comparing the two branch tips) would have no way to know what changed -- it would just see two different files. The merge base provides the "what did the file used to look like" reference point.

I used to think merge conflicts were git's fault. They're not. They're what happens when two humans edit the same lines independently. Git is doing the best anyone could do with that input.

Fast-Forward Merges

If the current branch is a direct ancestor of the branch being merged, there's no merge to do. Git just moves the branch pointer forward. No merge commit. No three-way merge. Just updating a ref file. This is a fast-forward merge, and it's why git merge sometimes creates a merge commit and sometimes doesn't.


How Rebase Works: Cherry-Pick in a Loop

Rebase is conceptually simple and practically terrifying. Here's the algorithm:

  1. Find the merge base of the current branch and the target branch
  2. Save all commits from the merge base to the current HEAD as patches
  3. Reset the current branch to the target branch tip
  4. Replay each saved commit on top, one at a time

"Replay each commit" means git creates a new commit with the same diff, the same message, and the same author -- but a different parent, and therefore a different SHA-1 hash. The old commits still exist in the object store (they're immutable), but nothing points to them anymore. The branch ref now points to the new chain.

This is what "rewriting history" means. The commits on your feature branch after a rebase have different hashes than before. If anyone else had a copy of the old hashes -- because you pushed to a shared branch -- their history and yours have diverged. This is why git push --force after a rebase can ruin someone's day. I know this because I was that someone.

Use git push --force-with-lease instead of --force. Plain --force overwrites the remote ref unconditionally — if a teammate pushed a commit while you were rebasing, it's gone. --force-with-lease checks that the remote ref still points where you last saw it. If someone else pushed in the meantime, the push is rejected and nobody's work gets destroyed. Same result when you're the only one on the branch, much safer when you're not.

Rebase is cherry-pick in a loop. git cherry-pick abc123 takes one commit, computes its diff, and applies it as a new commit on the current branch. Rebase does that for every commit in the range. If any individual cherry-pick has a conflict, rebase stops and waits for you to resolve it, then git rebase --continue moves to the next one.


Detached HEAD Demystified

Git's most terrifying warning is "You are in detached HEAD state." Robespierre1 would have approved of the phrasing. The reality is far less revolutionary — HEAD is a file, and right now it contains a commit hash instead of a branch name. That's it. Nobody dies.

Remember that .git/HEAD file?

# Normal: HEAD points to a branch
cat .git/HEAD
# ref: refs/heads/main

# Detached: HEAD points directly to a commit
cat .git/HEAD
# 8a73b2f1e4d5c6a7b8c9d0e1f2a3b4c5d6e7f8a9

That's all detached HEAD is. HEAD contains a SHA instead of a symbolic ref. You're not "on" any branch. If you make commits in this state, they'll work fine -- each new commit updates HEAD to point to the new commit's SHA. But no branch ref is being updated. When you checkout a branch again, those commits become unreachable. Nothing in the ref graph points to them.

This is exactly what your CI system does. actions/checkout in GitHub Actions checks out a specific commit SHA, not a branch. The build runs in detached HEAD state because CI cares about "build this exact commit," not "build whatever main points to right now." If a push happens to main between the checkout and the build finishing, detached HEAD guarantees you're still building the commit that triggered the pipeline.

How do you recover commits made in detached HEAD? The reflog.


The Reflog: Your 90-Day Safety Net

Every time HEAD moves -- every commit, every checkout, every rebase, every reset -- git logs the movement in .git/logs/HEAD. This is the reflog.

git reflog
# 8a73b2f HEAD@{0}: commit: fix bug
# 3c91d4e HEAD@{1}: checkout: moving from feature to main
# f7a2b8c HEAD@{2}: commit: add feature
# 1e4d5c6 HEAD@{3}: rebase (finish): onto main

Every entry records what HEAD pointed to, when, and why it moved. These entries are kept for 90 days by default (configurable via gc.reflogExpire). Even if you rebase, reset, or force-push -- the old commit hashes are in the reflog. As long as a hash is in the reflog, the commit exists in the object store and can be recovered.

# Recover a commit you "lost" to a rebase
git reflog
# Find the hash from before the rebase
git checkout -b recovery-branch abc123

I've used this exactly once in production. I'd rebased a feature branch, resolved conflicts incorrectly, and pushed. The correct version of the code existed only in reflog entries on my local machine. Fifteen minutes of git reflog and git cherry-pick saved a day of rewriting. The hard part wasn't the git commands — it was typing them accurately while dripping fear sweat from my brow onto the keyboard. The reflog is the reason git is hard to permanently lose data with. You have to actively try.

The reflog timeline: orphaned commits remain recoverable for 90 days


Git Worktrees: Multiple Working Copies, One Repository

Most people don't know this feature exists. I didn't until 2023.

git worktree lets you check out multiple branches simultaneously, each in its own directory, all sharing the same .git object store. No cloning. No copying. One repository, multiple working trees.

git worktree add ../hotfix-branch hotfix/urgent-fix
# Creates ../hotfix-branch/ with the hotfix branch checked out
# Your current directory stays on your current branch

The new worktree has its own index, its own HEAD, and its own working directory -- but the objects, refs, and reflog are shared. You can't check out the same branch in two worktrees (git prevents it to avoid index corruption), but you can have main in one directory and feature in another and switch between them by switching terminal tabs instead of running git checkout.

I use this constantly now. Code review in one worktree while continuing development in another. Running the test suite against main in one terminal while writing new tests in another. No stashing. No context-switching commits.

The friction is the setup. Every new repo, you're running git clone, then git worktree add, managing paths, remembering which directory is the "main" one. The worktree-first layout is clean once you have it, but bootstrapping it by hand every time is a pain — enough of a pain that I never stuck with worktrees until I automated the setup. I wrote a shell helper called gwt that flips the model: it clones the repo as a bare repository into a hidden .bare/ directory, writes a .git pointer to it, and checks out your main branch as the first worktree. From there, every branch is just git worktree add -b feature-x feature-x main. The project directory becomes a flat list of branch directories, all sharing one object store. No duplicated objects, no cloning, no drift, and critically — no fiddly setup to forget. I've been using this layout for everything since 2023 and I'm not going back.


How Your CI Really Uses Git

GitHub Actions' actions/checkout doesn't run git clone. By default, it runs a shallow clone: git clone --depth=1. This fetches exactly one commit -- the one that triggered the pipeline -- and none of the history. The .git/objects directory contains only the blobs, trees, and single commit needed for that snapshot.

Why? Speed. A repository with 50,000 commits and years of history might have a .git folder measured in gigabytes. CI doesn't need that. It needs the source code at one point in time. Shallow clone gives it that in seconds.

This has consequences. git log in a shallow clone shows one commit. git blame fails. git diff HEAD~5 fails because HEAD~5 doesn't exist. If your CI pipeline needs history -- for changelog generation, for git describe, for semantic versioning -- you have to explicitly configure fetch-depth: 0 in your workflow to get a full clone. Every CI job I've debugged that involved "why can't git see the previous tag" had this as the root cause.

The checkout also uses --detach, putting the build in detached HEAD state. And it fetches using the HTTPS protocol with a token, not SSH -- which is why your local SSH-based git remote URLs don't affect CI at all.


The Packfile: How .git Stays Small

If git stores a complete snapshot for every commit, why isn't .git enormous?

Two reasons. First, deduplication: identical files across commits are stored as a single blob. If you have 1,000 commits and only changed 3 files, the other files are just tree entries pointing to the same blobs.

Second: packfiles. When you run git gc (or when git decides to run it automatically), it takes all those individual object files and packs them into a single file: .git/objects/pack/pack-<hash>.pack, with a companion .idx index file.

Inside the packfile, git uses delta compression. It finds objects that are similar (often different versions of the same file across commits), stores one version in full, and stores the others as deltas -- just the differences. This is not the same as storing diffs between commits. Git chooses delta bases heuristically, sometimes using an older version of a file as the base and the newer version as the delta, sometimes the reverse. The algorithm optimizes for compression ratio, not chronological order.

The result is that a repository's packfile is often dramatically smaller than the sum of its snapshots would suggest. The Linux kernel repository has millions of objects spanning decades of history, and its .git directory sits around a few gigabytes (about 4.5 GB as of late 2024, growing steadily). Without delta compression, it would be orders of magnitude larger.

# See your packfile stats (git 2.26+ for the -H human-readable flag;
# on older git, drop the H and do the arithmetic yourself)
git count-objects -vH
# count: 0
# size: 0 bytes
# in-pack: 48203
# packs: 1
# size-pack: 12.4 MiB

When the Packfile Eats Your Disk

Delta compression is brilliant until someone commits something that shouldn't be there. I learned this the hard way.

One of my repos started getting slow. Clones took forever. CI was timing out. du -sh .git returned 3.2 GB for a project with maybe 50MB of source code. Something was very wrong.

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -rnk3 | head -5 — this one-liner finds the largest objects in your repo. The culprit: a 2.8 GB pickle file. A data scientist had committed a serialised model weeks earlier, realised the mistake, deleted it in the next commit, and moved on. But "deleted" in git means "removed from the current tree." The blob is still in the object store. Every commit after it references a tree that was built on top of the commit that included it. The 2.8 GB is in the packfile forever.

Removing it was a nightmare. You can't just delete a blob — every commit hash downstream depends on the tree that contained it. Change the tree, and the commit hash changes. Change the commit hash, and every child commit's parent hash changes. It cascades all the way to HEAD. Every commit after the pickle was introduced had to be rewritten with a new hash.

I used git filter-repo (the modern replacement for git filter-branch, which is too slow and too easy to get wrong):

git filter-repo --path model.pkl --invert-paths

One command. It rewrote every commit, removed the blob, repacked the repo. The .git directory dropped from 3.2 GB to 47 MB. But every developer on the team had to re-clone — their local repos still had the old hashes, and git pull can't reconcile a rewritten history. Force-push, Slack message, apologies all round.

The lesson: .gitignore your data files BEFORE the first commit. git rm doesn't remove history. And if you understand the object model — blobs, trees, commits, SHA-1 chains — you understand exactly why this is so painful and why there's no shortcut.


git gc: The Garbage Collector

git gc does three things:

  1. Packs loose objects into packfiles (delta compression, deduplication)
  2. Prunes unreachable objects -- objects not reachable from any ref or reflog entry, and older than the grace period (default: 2 weeks via gc.pruneExpire)
  3. Packs refs -- consolidates individual ref files from .git/refs/ into .git/packed-refs for faster lookup

Git runs gc --auto periodically when the number of loose objects exceeds a threshold (default: 6,700, configurable via gc.auto). You'll occasionally see "Auto packing the repository in background for optimum performance" after a commit or fetch -- that's gc --auto.

The two-week grace period for pruning is important. When you rebase and "lose" commits, those commit objects become unreachable from any ref. But the reflog still references them for 90 days. And even after the reflog entry expires, gc waits another two weeks before deleting the object. Git gives you multiple safety nets before anything is truly gone.

Git gc: loose objects packed and compressed, unreachable objects pruned


The DAG: Why Git History Is a Graph, Not a Line

Every commit (except the root) points to one or more parents. This creates a directed acyclic graph -- directed because edges go from child to parent, acyclic because you can't create a cycle (a commit can't be its own ancestor). Branches are just named pointers into this graph. Merges create commits with two parents. Rebases create new linear chains that replace old ones.

When you run git log, git walks this graph from HEAD backward, following parent pointers. git log --graph --oneline --all draws the ASCII art representation that makes the branching structure visible. I have this aliased to git lg and I use it roughly every four minutes.

The DAG is the reason git can answer questions like "what's the merge base?" or "has this branch been merged into main?" efficiently. These are graph traversal problems, and the commit graph is built to support them.

Understanding the DAG is the difference between using git and understanding git. Commands stop being incantations. git rebase is "copy these nodes to a new position in the graph." git merge is "create a node with two parents." git reset --hard is "move this pointer to a different node." git cherry-pick is "copy one node."

The multi-repo orchestration tool I mentioned at the outset — concurrent fetches, worktree lifecycle, ref manipulation across dozens of repositories in parallel. That tool was the goal. This post is the side effect. Every section here started as rough notes I scribbled while debugging something which my tool was failing to do as I expected. I never set out to learn git internals. I set out to build something, and the internals were in the way. The best learning I've ever done worked exactly like that.


Further Reading


  1. Maximilien Robespierre — the man who detached quite a few heads during the French Revolution. He was eventually detached from his own. 


I'm writing a book about what makes developers irreplaceable in the age of AI. Join the early access list →


Naz Quadri once force-pushed a rebase to a shared branch at 11pm and has been mass-producing reflog entries ever since. He blogs at nazquadri.dev. Rabbit holes all the way down 🐇🕳️.