The Directed Acyclic Graph (DAG) is a concept I run into over-and-over again; which is, perhaps, somewhat ironic.
A DAG is a representation of vertices (nodes) that are connected by directional edges (arcs—i.e., lines) in such a way that there are no cycles (e.g., starting at Node A
, you shouldn’t be able to return to Node A
).
DAGs have lots of uses in computer science problems and in discrete mathematics. You’ll find DAGs in build-systems, network problems, and, importantly (for this blog post, if not generally) in Git.
One way to think of a DAG is as a set of dependencies—each node may have a dependency on one or more other nodes. That is, in order to get to Node B
you must route through Node A
, so Node B
depends on Node A
:
The visualization of dependencies in a JSON object is (SURPRISE!) different from the input format needed to visualize a DAG using the D3.js Force layout. To change the above object into Force’s expected input, I created a little helper function:
var forceFormat = function(dag) {
var orderedNodes = [],
nodes = [],
links = [],
usesPack = false;
// Basically a dumb Object.keys
for (node in dag) {
if ( !dag.hasOwnProperty( node ) ) continue;
orderedNodes.push(node);
}
orderedNodes.forEach(function(node) {
var sources = dag[node];
if (!sources) return;
sources.forEach(function(source) {
var source = orderedNodes.indexOf(source);
// If the source isn't in the Git DAG, it's in a packfile
if (source < 0) {
if (usesPack) return;
source = orderedNodes.length;
usesPack = true;
}
links.push({
'source': source,
'target': orderedNodes.indexOf(node)
});
});
nodes.push({'name': node});
});
// Add pack file to end of list
if (usesPack) nodes.push({'name': 'PACK'});
return { 'nodes': nodes, 'links': links };
};
var forceInput = forceFormat(dag);
forceFormat
outputs a JSON object that can be used as input for the Force layout.
{
"links": [
{
"source": "Node A",
"target": "Node B"
}
],
"nodes": [
{ "name": "Node A" },
{ "name": "Node B" }
]
}
I can pass this resulting JSON object off to a function that I created after a long time staring at one of mbostock’s many amazing examples to create a D3 Force graph of verticies and edges:
// http://bl.ocks.org/mbostock/1138500
var makeGraph = function(target, graphData) {
var target = d3.select(target),
bounds = target.node().getBoundingClientRect(),
fill = d3.scale.category20(),
radius = 25;
var svg = target.append('svg')
.attr('width', bounds.width)
.attr('height', bounds.height);
// Arrow marker for end-of-line arrow
svg.append('defs').append('marker')
.attr('id', 'arrowhead')
.attr('refX', 17.5)
.attr('refY', 2)
.attr('markerWidth', 8)
.attr('markerHeight', 4)
.attr('orient', 'auto')
.attr('fill', '#ccc')
.append('path')
.attr('d', 'M 0,0 V 4 L6,2 Z');
var link = svg.selectAll('line')
.data(graphData.links)
.enter()
.append('line')
.attr('class', 'link')
.attr('marker-end', 'url(#arrowhead)');
// Create a group for each node
var node = svg.selectAll('g')
.data(graphData.nodes)
.enter()
.append('g');
// Color the node based on node's git-type (otherwise, hot pink!)
node.append('circle')
.attr('r', radius)
.attr('class', 'node')
.attr('fill', function(d) {
var blue = '#1BA1E2',
red = 'tomato',
green = '#5BB75B',
pink = '#FE57A1';
if (d.name.endsWith('.b')) { return red; }
if (d.name.endsWith('.t')) { return blue; }
if (d.name.endsWith('.c')) { return green; }
return pink;
});
node.append('text')
.attr('y', radius * 1.5)
.attr('text-anchor', 'middle')
.attr('fill', '#555')
.text(function(d) {
if (d.name.length > 10) {
return d.name.substring(0, 8) + '...';
}
return d.name;
});
// If the node has a type: tag it
node.append('text')
.attr('text-anchor', 'middle')
.attr('y', 4)
.attr('fill', 'white')
.attr('class', 'bold-text')
.text(function(d) {
if (d.name.endsWith('.b')) { return 'BLOB'; }
if (d.name.endsWith('.t')) { return 'TREE'; }
if (d.name.endsWith('.c')) { return 'COMMIT'; }
return '';
});
var charge = 700 * graphData.nodes.length;
var force = d3.layout.force()
.size([bounds.width, bounds.height])
.nodes(graphData.nodes)
.links(graphData.links)
.linkDistance(150)
.charge(-(charge))
.gravity(1)
.on('tick', tick);
// No fancy animation, tick amount varies based on number of nodes
force.start();
for (var i = 0; i < graphData.nodes.length * 100; ++i) force.tick();
force.stop();
function tick(e) {
// Push sources up and targets down to form a weak tree.
var k = -12 * e.alpha;
link
.each(function(d) { d.source.y -= k, d.target.y += k; })
.attr('x2', function(d) { return d.source.x; })
.attr('y2', function(d) { return d.source.y; })
.attr('x1', function(d) { return d.target.x; })
.attr('y1', function(d) { return d.target.y; });
node
.attr('transform', function(d) {
return 'translate(' + d.x + ',' + d.y + ')';
});
}
};
makeGraph('.merkle-1', forceInput);
You’d be forgiven for thinking that is a line.
This directional line is a DAG—albeit a simple one. Node B
depends on Node A
and that is the whole graph. If you want to get to Node B
then you have to start at Node A
. Depending on your problem-space, Node B
could be many things: A place in Königsberg, a target in a Makefile (or a Rakefile), or (brace yourself) a Git object.
Git Object Anatomy ¶
In order to understand how Git is a DAG, you need to understand Git “objects”:
mkdir merkle
$ cd merkle
$ echo 'This is the beginning' > README
$ git init
$ git add .
$ git -m 'Initial Commit'
$ find .git/objects/ -type f
$ .git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353
.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624
.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
What are Git objects? Because they look like nonsense:
cat .git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353
$ xKOR02,V¢T¤̼¼̼t.
After a little digging through the Pro Git book, Git objects are a little less non-sensicle. Git objects are simply zlib
compressed, formatted messages:
python2 -c 'import sys,zlib; \
$ print zlib.decompress(sys.stdin.read());' \
< .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
commit 195tree 098e6de29daf4e55f83406b49f5768df9bc7d624
author Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700
committer Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700
Initial Commit
Parts of that message are obvious: author
and committer
obviously come from my .gitconfig
. There is a Unix epoch timestamp with a timezone offset. commit
is the type of object. 195
is the byte-length of the remainder of the message.
There are a few parts of that message that aren’t immediately obvious. What is tree 098e6de29daf4e55f83406b49f5768df9bc7d624
? And why would we store this message in .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
and not .git/objects/commit-message
? Is a merkle what I think it is? The answer to all of these questions and many more is the same: Cryptographic Hash Functions.
Hash Functions ¶
A cryptographic hash function is a function that when given an input of any length it creates a fixed-length output. Furthermore (and more importantly), the fixed-length output should be unique to a given input; any change in input will likely cause a big change in the output. Git uses a cryptographic hash function called Secure Hash Algorithm 1 (SHA-1).
You can play with the SHA-1 function on the command line:
echo 'message' | sha1sum
$ 1133e3acf0a4cbb9d8b3bfd3f227731b8cd2650b -
echo 'message' | sha1sum
$ 1133e3acf0a4cbb9d8b3bfd3f227731b8cd2650b -
echo 'message1' | sha1sum
$ c133514a60a4641b83b365d3dc7b715dc954e010 -
Note the big change in the output of sha1sum
from a tiny change in input. This is what cryptographic hash functions do.
Hash that DAG! ¶
Now that we have some idea of what is inside a commit object, let’s reverse-engineer the commit object from the HEAD
of our merkle
repo:
python2 -c 'import sys,zlib; \
$ print zlib.decompress(sys.stdin.read());' \
< .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da | \
od -c
0000000 c o m m i t 1 9 5 \0 t r e e
0000020 0 9 8 e 6 d e 2 9 d a f 4 e 5 5
0000040 f 8 3 4 0 6 b 4 9 f 5 7 6 8 d f
0000060 9 b c 7 d 6 2 4 \n a u t h o r
0000100 T y l e r C i p r i a n i <
0000120 t c i p r i a n i @ w i k i m e
0000140 d i a . o r g > 1 4 5 8 6 0 4
0000160 1 2 0 - 0 7 0 0 \n c o m m i t
0000200 t e r T y l e r C i p r i a
0000220 n i < t c i p r i a n i @ w i
0000240 k i m e d i a . o r g > 1 4 5
0000260 8 6 0 4 1 2 0 - 0 7 0 0 \n \n I
0000300 n i t i a l C o m m i t \n \n
0000317
printf 'tree 098e6de29daf4e55f83406b49f5768df9bc7d62k4\n' >> commit-msg
$ printf 'author Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\n' >> commit-msg
$ printf 'committer Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\n' >> commit-msg
$ printf '\nInitial Commit\n' >> commit-msg $
sha1sum <(cat \
$ <(printf "commit ") \
<(wc -c < commit-msg | tr -d '\n') \
<(printf '%b' '\0') commit-msg)
1a06ce381ac14f7a5baa1670691c2ff8a73aa6da /dev/fd/63
Hmm… that seems familiar
export COMMIT_HASH=$(sha1sum <(cat <(printf "commit ") <(wc -c < commit-msg | tr -d '\n') <(printf '%b' '\0') commit-msg) | cut -d' ' -f1)
$ find ".git/objects/${COMMIT_HASH:0:2}" -type f -name "${COMMIT_HASH:(-38)}"
$ .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
The commit object is a zlib-compressed, formatted message that is stored in a file named after the SHA-1 hash of the file’s un-zlib
compressed contents.
(/me wipes brow)
Let’s use git-cat-file
to see if we can explore the tree 098e6de29daf4e55f83406b49f5768df9bc7d62k4
-part of the commit message object:
cat .git/HEAD
$ ref: refs/heads/master
cat .git/refs/heads/master
$ 1a06ce381ac14f7a5baa1670691c2ff8a73aa6da
git cat-file -p 1a06ce381ac14f7a5baa1670691c2ff8a73aa6da
$ tree 098e6de29daf4e55f83406b49f5768df9bc7d624
author Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700
committer Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700
git cat-file -p 098e6de29daf4e55f83406b49f5768df9bc7d624
$ 100644 blob 1b9f426a8407ffee551ad2993c5d7d3780296353 README
git cat-file -p 1b9f426a8407ffee551ad2993c5d7d3780296353
$ This is the beginning
Hey that’s the text I put into README
!
So .git/HEAD
refers to .git/refs/heads/master
, calling git-cat-file
on the object found inside that file shows that it’s the commit object we recreated. The commit object points to 098e6de29daf4e55f83406b49f5768df9bc7d624
, which is a tree object with the contents: 100644 blob 1b9f426a8407ffee551ad2993c5d7d3780296353 README
The blob
object 1b9f426a8407ffee551ad2993c5d7d3780296353
is the contents of README
! So it seems each commit
object points to a tree
object that points to other objects.
What was I talking about? Merkle DAGs? D3.js? ¶
Let’s see if we can paste together what Git is doing at a low-level when we make a new commit:
- Take the contents of
README
, hash the contents using SHA-1, and store as ablob
object in.git/objects
. - Create a directory listing of the git working directory, listing each file, with its directory permissions and its hash value. Hash this directory listing and store as a
tree
in.git/objects
. - Take the commit message, along with info from
.gitconfig
and the hash of the top-level tree. Hash this information and store it as acommit
object in.git/objects
.
It seems that there may be a chain of dependencies:
var gitDag = {
// blob (add .b for blob)
'1b9f426a8407ffee551ad2993c5d7d3780296353.b': [],
// tree (.t == tree) is a hash that includes the hash from blob
'098e6de29daf4e55f83406b49f5768df9bc7d624.t': ['1b9f426a8407ffee551ad2993c5d7d3780296353.b'],
// commit (.c == commit) is a hash that includes the hash from tree
'1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': ['098e6de29daf4e55f83406b49f5768df9bc7d624.t'],
};
makeGraph('.merkle-2', forceFormat(gitDag));
You’d be forgiven for thinking that is a line.
What’s really happening is that there is a commit
object (1a06ce38
) that depends on a tree
object (098e6de2
) that depends on a blob
(1b9f426a
).
Since it’s running each of these objects through a hash function and each of them contains a reference up the chain of dependencies, a minor change to either the blob
or the tree
will create a drastically different commit
object.
Applying a cryptographic hash function on top of a graph was Ralph Merkle’s big idea. This scheme makes magic possible. Transferring verifiable and trusted information through an untrusted medium is toatz for realz possible with Ralph’s little scheme.
The idea is that if you have the root-node hash, that is, the cryptographic hash of the node that depends on all other nodes (the commit
object in Git), and you obtained that root-node hash from a trusted source, you can trust all sub-nodes that stem from that root node if the hash of all those sub-root-nodes matches the root-node hash!
This is the mechanism by which things like Git, IPFS, Bitcoin, and BitTorrent are made possible: changing any one node in the graph changes all nodes that depend on that node all the way to the root-node (the commit
in Git).
Tales from the Merkle Graph ¶
I wrote a simple NodeJS script that creates a graph that is suitable for input into the JavaScript that I’ve already written that will create a D3.js force graph with whatever it finds in .git/objects
.
#!/usr/bin/env nodejs
/* makeDag - creates a JSON dependency graph from .git/objects */
var glob = require('glob'),
fs = require('fs'),
zlib = require('zlib');
var types = ['tree', 'commit', 'blob'],
treeRegex = {
// 100644 README\0[20 byte sha1]
regex: /[0-9]+\s[^\0]+\0((.|\n){20})/gm,
fn: function(sha) {
var buf = new Buffer(sha[1], 'binary');
return buf.toString('hex') + '.b';
}
},
commitRegex = {
// tree 098e6de29daf4e55f83406b49f5768df9bc7d624
regex: /(tree|parent)\s([a-f0-9]{40})/gm,
fn: function(sha) {
if (sha[1] === 'tree') {
return sha[2] + '.t';
}
return sha[2] + '.c';
}
},
total = 0,
final = {};
// determine file type, parse out SHA1s
var handleObjects = function(objData, name) {
types.forEach(function(type) {
var re, regex, match, key;
if (!objData.startsWith(type)) { return; }
key = name + '.' + type[0];
final[key] = [];
if (type === 'tree') { objType = treeRegex; }
if (type === 'commit') { objType = commitRegex; }
if (type === 'blob') { return; }
// Remove the object-type and size from file
objData = objData.split('\0');
objData.shift();
objData = objData.join('\0');
// Recursive regex match remainder
while ((match = objType.regex.exec(objData)) !== null) {
final[key].push(objType.fn(match));
}
});
// Don't output until you've got it all
if (Object.keys(final).length !== total) {
return;
}
// Output what ya got.
console.log(final);
};
// Readable object names not file names
var getName = function(file) {
var fileParts = file.split('/'),
len = fileParts.length;
return fileParts[len - 2] + fileParts[len - 1];
}
// Inflate the deflated git object file
var handleFile = function(file, out) {
var name = getName(file);
fs.readFile(file, function(e, data) {
zlib.inflate(data, function(e, data) {
if (e) { console.log(file, e); return; }
handleObjects(data.toString('binary'), name);
});
});
};
// Sort through the gitobjects directory
var handleFiles = function(files) {
files.forEach(function(file) {
fs.stat(file, function(e, f) {
if (e) { return; }
if (f.isFile()) {
// Don't worry about pack files for now
if (file.indexOf('pack') > -1) { return; }
total++;
handleFile(file);
}
});
});
};
(function() {
glob('.git/objects/**/*', function(e, f) {
if (e) { throw e; }
handleFiles(f);
});
})();
Merkle graph transformations are often difficult to describe, but easy to see. Using this last piece of code to create and view graphs for several repositories has been illuminating. The graph visualization both illuminates and challenges my understanding of Git in ways I didn’t anticipate.
The Tale of Commit Message Bike-shedding ¶
When you change your commit message, what happens to the graph? What depends on a commit? Where is the context for a commit?
git commit --amend -m 'This is the commit message now'
$ master 585448a] This is the commit message now
[Date: Mon Mar 21 16:48:40 2016 -0700
1 file changed, 1 insertion(+)
create mode 100644 README
find .git/objects -type f
$ .git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353
.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624
.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
.git/objects/da/94af3a21ac7e0c875bbbe6162aa1d26d699c73
Now the DAG is a bit different:
var gitDag = { '098e6de29daf4e55f83406b49f5768df9bc7d624.t': [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b' ],
'1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],
'1b9f426a8407ffee551ad2993c5d7d3780296353.b': [],
'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ] }
makeGraph('.merkle-3', forceFormat(gitDag));
Here we see that there are now two commit
objects (1a06ce38
and da94af3a
) that both depend on a single tree
object (098e6de2
) that depends on a single blob
(1b9f426a
).
One of these commit objects will never be seen with git log
.
The Orphan Blob That Dared to Dream ¶
TIL: Git creates blob
objects as soon as a file is added to the staging area.
echo 'staged' > staged
$ find .git/objects -type f
$ .git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353
.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624
.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
.git/objects/da/94af3a21ac7e0c875bbbe6162aa1d26d699c73
Notice that nothing depends on this object just yet. It’s a lonely orphan blob
.
git add staged
$ find .git/objects -type f
$ .git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353
.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624
.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
.git/objects/da/94af3a21ac7e0c875bbbe6162aa1d26d699c73
.git/objects/19/d9cc8584ac2c7dcf57d2680375e80f099dc481
makeDag
$ { '098e6de29daf4e55f83406b49f5768df9bc7d624.t': [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b' ],
'19d9cc8584ac2c7dcf57d2680375e80f099dc481.b': [],
'1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],
'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],
'1b9f426a8407ffee551ad2993c5d7d3780296353.b': [] }
Even unstaging and deleting the file doesn’t remove the object. Orphan objects in git are only garbage collected as part of git gc --prune
.
When this object is committed to the repo, it creates a whole new layer of the graph:
git commit -m 'Add staged file'
$ master 4f407b3] Add staged file
[1 file changed, 1 insertion(+)
create mode 100644 staged
makeDag
$ { '098e6de29daf4e55f83406b49f5768df9bc7d624.t': [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b' ],
'19d9cc8584ac2c7dcf57d2680375e80f099dc481.b': [],
'1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],
'1b9f426a8407ffee551ad2993c5d7d3780296353.b': [],
'4f407b396e6ecbb65de6cf192259c18ecd4d1e9b.c':
'7ce38101e91de29ee0fee3aa9940cc81159e0f8d.t',
[ 'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c' ],
'7ce38101e91de29ee0fee3aa9940cc81159e0f8d.t':
'1b9f426a8407ffee551ad2993c5d7d3780296353.b',
[ '19d9cc8584ac2c7dcf57d2680375e80f099dc481.b' ],
'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ] }
So we’ve created a new commit (4f407b39
) that is the parent of a different commit (da94af3a
) and a new tree (7ce38101
) that contains our old README
blob (1b9f426a
) and our new, previously orphaned, blob (19d9cc85
).
The Tale of Powerful Software ¶
I’ve always enjoyed the idea that software (and computer science more generally) is nothing but an abstraction to manage complexity. Good software— powerful software—like Git—is a software that manages an incredible amount of complexity and hides it completely from the user.
In recognition of this idea, I’ll leave you with the graph of my local copy of clippy—a small command line tool I created that is like man(1)
except it shows Clippy at the end of the man
output (yes, it’s dumb).
This should give you an idea of the complexity that is abstracted by the Git merkle graph: this repo contains 5 commits!
Having enough knowledge on how to implement that kind of output in programming is really necessary. Thus, it’s a good thing that there are a lot of blogs like this one who shares information and techniques that a person is needed to have in order for them to also do the same output that they have done.
This article was really helpful, thanks! I’m trying to understand the significance of DAGs when it comes to networked storage such as IPFS. I was a bit surprised that git doesn’t store files in blocks to reduce duplication but I suppose it was originally meant for source code rather than binary data.
Cheers, Durand
PS: it would be helpful to have Google/Facebook login for your comments because I had to dig up my yahoo account just to comment here!
On Wed 2016-08-31 12:09:38 AM, Durand D Souza wrote: > it would be helpful to have Google/Facebook login for your comments because I had to dig up my yahoo account just to comment here!
heh, yeah, OpenIDs are, unfortunately, not super easy to come by anymore.
There is also an email-based login system available for ikiwiki that I should look into setting up. I wonder if I can use an external SMTP provider? I really don’t want to maintain an email server
hrm…I’ll look into this.
Comment by thcipriani — Thu 2016-09-01 03:43:31 PM
This should be done now! I’ve enabled email login for comments as well.