But try to understand
Try to understand
Try try try to understand
Git’s a magic command.

Heart  💕

Once upon a time, I believed git was storing diffs somewhere. But then I learned I was wrong.

It’s challenging to wield git’s clunky interface when you have a broken mental model of its internals. Learning more about what’s happening inside git transformed me into a more effective git user.

In this post, I’ll attempt to explain all the deep details of git diff to my past self.

📍 Git add makes blobs

We can add files to repos using git add. But behind the porcelain, git’s busy compressing and storing this file deep in its bowels. Git terms the results of this process a “blob.”

Git stores blobs (among other things) inside the .git/objects directory.

$ git init
Initialized empty Git repository in /tmp/bar/.git/
$ echo "Hi, I'm blob" > foo
$ git add foo
$ tree .git/objects/
.git/objects/
└── 26
  └── 45aab142ef6b135a700d037e75cd9f1f1c94dc

But what’s in a blob? And why is this blob stored as ./26/45aab142ef6b135a700d037e75cd9f1f1c94dc?

🗃️ Git stores things by their hash

Why did git add foo store the contents of foo as 2645aab142ef6b135a700d037e75cd9f1f1c94dc?

Git mapped our file to a number via a hash function.

A hash function maps data to a unique number (mostly)—whenever the data changes, the hash function’s output changes dramatically.

SHA1 is the hash function git uses by default. And when we git add foo git applies SHA1 to the contents of fooHi, I'm blob\n—and that spits out 2645aab142ef6b135a700d037e75cd9f1f1c94dc.

Blobs are all about content. The filename “foo” doesn’t matter at all! We could have named the file “🌈”—git still would have stored it in the same place. If the file contents are EXACTLY the same, then the hash will be exactly the same.

🌱 Git commit creates commits and trees

You already know git commit creates a commit, but what is a commit?

A commit is a type of object. Git uses the word “object” to mean: a commit, a folder or directory (tree), a file (blob), or a tag. Git stores objects in its object database—everything inside the .git/objects directory.

$ git commit -m 'Initial Commit'
[main (root-commit) 0644991] Initial Commit
1 file changed, 1 insertion(+)
create mode 100644 foo
$ tree .git/objects/
.git/objects/
├── 06
│   └── 449913ac0e43b73bfbd3141f5643a4db6d47f8
├── 26
│   └── 45aab142ef6b135a700d037e75cd9f1f1c94dc
└── 41
  └── 81320a57137264d436b2ef861c31f430256bf4

After our commit, the object database has three objects: 06449913, 2645aab1, and 4181320a.

So now we’ve established that one of these three objects is our blob (2645aab1)—let’s see if we can suss out the others.

✨ The magic command

The magic command to learn about any object is git cat-file -p. We can use that command to find out more about our mystery objects:

$ git cat-file -p 06449913ac0e43b73bfbd3141f5643a4db6d47f8
tree 4181320a57137264d436b2ef861c31f430256bf4
author Tyler Cipriani <tcipriani@wikimedia.org> 1652310544 -0600
committer Tyler Cipriani <tcipriani@wikimedia.org> 1652310544 -0600

Initial Commit

This object (06449913) appears to be our commit. A commit is metadata compressed and stored inside git’s object database.

Some of the metadata is obvious, but then there’s a tree. And that tree points to our other mystery object, 418132. Let’s see what we can learn about our last remaining mystery object using our magic command:

$ git cat-file -p 4181320a57137264d436b2ef861c31f430256bf4
100644 blob 2645aab142ef6b135a700d037e75cd9f1f1c94dc    foo

So a tree is an object that stores a directory listing of objects by their SHA1s. And a commit is an object that points at a tree by recording the tree’s SHA1!

Commits point to trees, and trees point to blobs and other trees. Neat!

📈 Git’s dependency graph

So if we graphed the state of dependencies in our object database, we’d get something like this:

Simple git repo’s object dependency graph

The commit incorporates our tree, which includes our blob—everything depends on our blob!

So if we change even a single bit inside a single file: git will notice—everything is entirely traceable from the commit down to the bit level. We get this for free by hashing objects and including those hashes in other objects.

This is the whole concept of a Merkle Directed Acyclic Graph (Merkle DAG)!

🍔 So, where’s the diff?

When we type git diff, git presents us a diff. We know there are blobs and trees and commits—so where’s the diff!?

Git doesn’t store diffs anywhere at all! It derives diffs from what’s stored in the object database.

$ echo "I'm ALSO blob" > baz
$ git add baz
$ git commit -m 'Add baz'
$ tree .git/objects/
.git/objects/
├── 06
│   └── 449913ac0e43b73bfbd3141f5643a4db6d47f8
├── 26
│   └── 45aab142ef6b135a700d037e75cd9f1f1c94dc
├── 41
│   └── 81320a57137264d436b2ef861c31f430256bf4
├── 95
│   └── 42599fac463c434456c0a16b13e346787f25da
├── 9b
│   └── 2716e4540c11e8d590e906dd8fa5a75904810a
└── e6
   └── 5a7344c46cebe61d052de6e30d33636e1cd0b4

We made a new commit, and now we have three new objects. We added a new file (blob), which made our directory different (tree), and we committed it (commit).

Our graph now looks like this:

Simple git repo’s updated object dependency graph

You might be surprised by a few things in the graph:

  • Our new commit stores its parent commit as metadata
  • Our new tree points to our old blob, and our NEW blob

So now what happens when we try git diff:

$ git diff 064499..e65a73
diff --git a/baz b/baz
new file mode 100644
index 0000000..9b2716e
--- /dev/null
+++ b/baz
@@ -0,0 +1 @@
+I'm ALSO blob

Git compares the two commits, finds their trees, sees a new blob in the second commit, and shows you the diff of /dev/null and baz.

No diffs. Just Merkle DAGs. And now you know.


Thanks to Joe Swanson for providing excellent early feedback on this post. And thanks to Kostah Harlan for reading an early draft of this post and making it less terrible. <3