{ "version": "https://jsonfeed.org/version/1", "title": "Tyler Cipriani: pages tagged vcs", "home_page_url": "https://tylercipriani.com/tags/vcs/", "feed_url": "https://tylercipriani.com/tags/vcs/index.json", "items": [ { "id": "https://tylercipriani.com/blog/2016/09/07/delete-local-git-commits/", "title": "Delete Local Git Commits", "url": "https://tylercipriani.com/blog/2016/09/07/delete-local-git-commits/", "author": { "name": "Tyler Cipriani" }, "tags": [ "computing", "git", "vcs" ], "date_published": "2016-09-07T17:48:08Z", "date_modified": "2017-07-01T00:49:09Z", "content_html": "
Usually when I need to delete a commit from a local git repo I use:
\n\nThis starts an interactive rebase session for all commits from HEAD..@{u}
. Once I’m in the git rebase editor I put a drop
on the line that has the commit I want to delete and save the file. Magic. Commit gone.
drop c572357 [LOCAL] Change to be removed\n\n# Rebase dbfcb3a..c572357 onto dbfcb3a (1 command(s))\n#\n# Commands:\n# p, pick = use commit\n# r, reword = use commit, but edit the commit message\n# e, edit = use commit, but stop for amending\n# s, squash = use commit, but meld into previous commit\n# f, fixup = like "squash", but discard this commit's log message\n# x, exec = run command (the rest of the line) using shell\n# d, drop = remove commit\n#\n# These lines can be re-ordered; they are executed from top to bottom.\n#\n# If you remove a line here THAT COMMIT WILL BE LOST.\n#\n# However, if you remove everything, the rebase will be aborted.\n#\n# Note that empty commits are commented out
This is fine if you’re there to drive an editor for an interactive rebase.
\nThe problem with interactive rebase is you can’t easily script it.
\nAt work I help to maintain our “beta cluster” which is a bunch of virtual machines that is supposed to mirror the setup of production. The extent to which production is actually mirrored is debatable; however, it is true that we use the same puppet repository as production. Further, this puppet repository resides on a virtual machine within the cluster that acts as the beta cluster’s puppetmaster.
\nOften, to test a patch for production puppet, folks will apply patches on top of beta’s local copy of the production puppet repo.
\nroot@puppetmaster# cd /var/lib/git/operations/puppet\nroot@puppetmaster# git apply --check --3way /home/thcipriani/0001-some.patch\nroot@puppetmaster# git am --3way /home/thcipriani/0001-some.path\nroot@puppetmaster# git status\nOn branch production\nYour branch is ahead of 'origin/production' by 6 commits.\n (use "git push" to publish your local commits)\n\nnothing to commit, working directory clean
The problem here is that a lot of these commits are old and busted.
\nroot@puppetmaster# git log -6 --format='%h %ai'\nad4f419 2016-08-25 20:46:23 +0530\nd3e9ab8 2016-09-05 15:49:48 +0100\n59b7377 2016-08-30 16:21:39 -0700\n5d2609e 2016-08-17 14:17:02 +0100\n194ee1c 2016-07-19 13:19:54 -0600\n057c9f7 2016-01-07 17:11:12 -0800
It’d be nice to have a script that periodically cleans up these patches based on some criteria. This script would have to potentially be able to remove, say, 59b7377
while leaving all the others in-tact, and it would have to do so without any access to git rebase -i
or any user input.
The way to do this is with git rebase --onto
.
Selective quoting of man git-rebase
SYNOPSIS\n git rebase [--onto <newbase>] [<upstream>]\n\nDESCRIPTION\n All changes made by commits in the current branch but that are not\n in <upstream> are saved to a temporary area. This is the same set\n of commits that would be shown by git log <upstream>..HEAD\n\n The current branch is reset to <newbase> --onto option was supplied.\n\n The commits that were previously saved into the temporary area are\n then reapplied to the current branch, one by one, in order.
The commits we want are all the commits surrounding 59b7377
without 59b7377
.
root@puppetmaster# git log --format='%h %ai' 59b7377..HEAD\nad4f419 2016-08-25 20:46:23 +0530\nd3e9ab8 2016-09-05 15:49:48 +0100\nroot@puppetmaster# git log --format='%h %ai' @{u}..59b7377^\n5d2609e 2016-08-17 14:17:02 +0100\n194ee1c 2016-07-19 13:19:54 -0600\n057c9f7 2016-01-07 17:11:12 -0800
So we really want to take the commits 59b7377..HEAD
and rebase them onto the commit before 59b7377
(which is 59b7377^
)
root@puppetmaster# git rebase --onto 59b7377^ 59b7377\nroot@puppetmaster# git log -6 --format='%h %ai'\nad4f419 2016-08-25 20:46:23 +0530\nd3e9ab8 2016-09-05 15:49:48 +0100\n5d2609e 2016-08-17 14:17:02 +0100\n194ee1c 2016-07-19 13:19:54 -0600\n057c9f7 2016-01-07 17:11:12 -0800\nd5e1340 2016-09-07 23:45:25 +0200
Tada! Commit gone.
\nI like to post my photos on the internet. I used to post all of my photos on Flickr, but that site has been getting worse and worse and worse. More recently, I’ve been using a static photo gallery generator I wrote hacked together that I (perhaps unfortunately) named hiraeth.
Hiraeth is, to put it mildly, missing some features. There are a few reasons that I opted to create hiraeth rather than use something that was already built:
\n~/Pictures
directory is organized—I can find stuff—and I don’t want to mess all that up to generate a crappy website out of my photos.~/Pictures
, which creates…unique challenges :)Hiraeth is invoked like: publish [edited-photo-dir] [output-dir]
. Hiraeth looks for a file named _metadata.yaml
inside the directory of edited photos and uses that to map photo files to photo descriptions and add titles and whatnot to the page. It makes a few different sized thumbnails of each photo, grabs the exif info, and generates some html.
Hiraeth was designed to look and behave like a static version of Flickr circa 2007. There are still features to add, but there is a base that works in place at least.
\nI mange my ~/Pictures
directory using git-annex (which I’ve wanted to write something about for a long time). This is mostly amazing and great. Git-annex has a lot of cool features. For instance, in git-annex once you’ve copied files to a remote, it will allow you to “drop
” a file locally to save space. You can still get the file back from the remote any time you rootin’ tootin’ feel like, so nbd. Occasionally, when I’m running out of space on one machine or another, I’ll drop
a bunch of photos.
The ability to drop a bunch of photos means that hiraeth needs to be able to get photo metadata from a picture without having the file actually be on disk.
\nWe use gerrit at work and I genuinely like it.
\n<rant>
The web-UI is one of the worst interfaces I’ve ever used. The web interface is an unfortunate mix of late-90s, designed-by-engineers, impossibly-option-filled interface mashed together in an unholy union with a fancy-schmancy new-fangled javascripty single-page application. It’s basically a mix of two interface paradigms I hate, yet rarely see in concert: back-button breakage + no design aesthetic whatsoever. </rant>
HOWEVER, The workflow gerrit enforces, the git features it uses, and the beautiful repository history that results makes gerrit a really nice code review system.
\nGerrit is the first system I’ve seen use git-notes.
\nGerrit has a cool feature where it keeps all of the patch review in git-notes:
\ntyler@taskmaster:mediawiki-core$ git fetch origin refs/notes/*:refs/notes/*\nremote: Counting objects: 176401, done\nremote: Finding sources: 100% (147886/147886)\nremote: Getting sizes: 100% (1723/1723)\nremote: Compressing objects: 100% (116810/116810)\nremote: Total 147886 (delta 120436), reused 147854 (delta 120434)\nReceiving objects: 100% (147886/147886), 14.91 MiB | 3.01 MiB/s, done.\nResolving deltas: 100% (120449/120449), done.\nFrom ssh://gerrit.wikimedia.org:29418/mediawiki/core\n * [new ref] refs/notes/commits -> refs/notes/commits\n * [new ref] refs/notes/review -> refs/notes/review\ntyler@taskmaster:mediawiki-core$ ls -l .git/refs/notes\ntotal 8\n-rw-r--r-- 1 tyler tyler 41 Aug 28 16:44 commits\n-rw-r--r-- 1 tyler tyler 41 Aug 28 16:44 review\ntyler@taskmaster:mediawiki-core$ git log --show-notes=review --author='Tyler Cipriani'\n commit ab131d4be475bf87b0f0a86fa356a2b1a188a673\n Author: Tyler Cipriani <tcipriani@wikimedia.org>\n Date: Tue Mar 22 09:08:52 2016 -0700\n \n Revert "Add link to anon's user page; remove "Not logged in""\n \n This reverts change I049d0671a7050.\n \n This change was reverted in the wmf/1.27.0-wmf.17. Since there is no\n clear consensus, revert in master before branching wmf/1.27.0-wmf.18.\n Bug: T121793\n Change-Id: I2dc0f2562c908d4e419d34e80a64065843778f3d\n \n Notes (review):\n Verified+2: jenkins-bot\n Code-Review+2: Legoktm <legoktm.wikipedia@gmail.com>\n Submitted-by: jenkins-bot\n Submitted-at: Tue, 22 Mar 2016 18:08:27 +0000\n Reviewed-on: https://gerrit.wikimedia.org/r/278923\n Project: mediawiki/core\n Branch: refs/heads/master
This is super cool. You can have, effectively, an offline backup of lots of information you’d usually have to brave the gerrit web-ui to find. Plus, you don’t have to have this information in your local repo taking up space, it’s only there if you fetch it down.
\nThere is another project from Google that uses git-notes for review called git-appraise.
\nThis is the stated use of git-notes in the docs: store extra information about a commit, without changing the SHA1 of the commit by modifying its contents.
\nIt is, however, noteworthy that you can store a note that points to any object in your repository and not just commit objects.
\nAfter some minor testing it seems that I can store all the EXIF info I need about my images in git-notes without actually having those images on disk; i.e., I can have git-annex drop the actual files and just have broken symlinks that point to where the files live in annex.
\nI wrote a small bash script to play with some of these ideas.
\ntyler@taskmaster:Pictures$ git photo show fish.jpg\n+ git notes --ref=pictures show d4a9c57715ce63a228577900d1abc027\nerror: No note found for object d4a9c57715ce63a228577900d1abc0273396e8ef.\ntyler@taskmaster:Pictures$ git photo add fish.jpg\n+ git notes --ref=pictures add -m 'FileName: fish.jpg\nFileTypeExtension: jpg\nMake: NIKON CORPORATION\nModel: NIKON D610\nLensID: AF-S Zoom-Nikkor 24-70mm f/2.8G ED\nFocalLength: 62.0 mm\nFNumber: 2.8\nISO: 3200' d4a9c57715ce63a228577900d1abc027\n+ set +x\ntyler@taskmaster:Pictures$ git photo show fish.jpg\n+ git notes --ref=pictures show d4a9c57715ce63a228577900d1abc027\nFileName: fish.jpg\nFileTypeExtension: jpg\nMake: NIKON CORPORATION\nModel: NIKON D610\nLensID: AF-S Zoom-Nikkor 24-70mm f/2.8G ED\nFocalLength: 62.0 mm\nFNumber: 2.8\nISO: 3200\n+ set +x
Now it seems like it should be possible to git push origin refs/notes/pictures
, fetch them on the other side, and modify hiraeth to read EXIF from notes when the symlink target doesn’t exist.
We’ll see how any of that goes in practice :
\n
The Directed Acyclic Graph (DAG) is a concept I run into over-and-over again; which is, perhaps, somewhat ironic.
\nA DAG is a representation of vertices (nodes) that are connected by directional edges (arcs—i.e., lines) in such a way that there are no cycles (e.g., starting at Node A
, you shouldn’t be able to return to Node A
).
DAGs have lots of uses in computer science problems and in discrete mathematics. You’ll find DAGs in build-systems, network problems, and, importantly (for this blog post, if not generally) in Git.
\nOne way to think of a DAG is as a set of dependencies—each node may have a dependency on one or more other nodes. That is, in order to get to Node B
you must route through Node A
, so Node B
depends on Node A
:
The visualization of dependencies in a JSON object is (SURPRISE!) different from the input format needed to visualize a DAG using the D3.js Force layout. To change the above object into Force’s expected input, I created a little helper function:
\nvar forceFormat = function(dag) {\n var orderedNodes = [],\n nodes = [],\n links = [],\n usesPack = false;\n\n // Basically a dumb Object.keys\n for (node in dag) {\n if ( !dag.hasOwnProperty( node ) ) continue;\n orderedNodes.push(node);\n }\n\n orderedNodes.forEach(function(node) {\n var sources = dag[node];\n\n if (!sources) return;\n\n sources.forEach(function(source) {\n var source = orderedNodes.indexOf(source);\n\n // If the source isn't in the Git DAG, it's in a packfile\n if (source < 0) {\n if (usesPack) return;\n source = orderedNodes.length;\n usesPack = true;\n }\n\n links.push({\n 'source': source,\n 'target': orderedNodes.indexOf(node)\n });\n });\n nodes.push({'name': node});\n });\n\n // Add pack file to end of list\n if (usesPack) nodes.push({'name': 'PACK'});\n\n return { 'nodes': nodes, 'links': links };\n};\n\nvar forceInput = forceFormat(dag);
forceFormat
outputs a JSON object that can be used as input for the Force layout.
{\n "links": [\n {\n "source": "Node A",\n "target": "Node B"\n }\n ],\n "nodes": [\n { "name": "Node A" },\n { "name": "Node B" }\n ]\n}
I can pass this resulting JSON object off to a function that I created after a long time staring at one of mbostock’s many amazing examples to create a D3 Force graph of verticies and edges:
\n// http://bl.ocks.org/mbostock/1138500\nvar makeGraph = function(target, graphData) {\n var target = d3.select(target),\n bounds = target.node().getBoundingClientRect(),\n fill = d3.scale.category20(),\n radius = 25;\n\n var svg = target.append('svg')\n .attr('width', bounds.width)\n .attr('height', bounds.height);\n\n // Arrow marker for end-of-line arrow\n svg.append('defs').append('marker')\n .attr('id', 'arrowhead')\n .attr('refX', 17.5)\n .attr('refY', 2)\n .attr('markerWidth', 8)\n .attr('markerHeight', 4)\n .attr('orient', 'auto')\n .attr('fill', '#ccc')\n .append('path')\n .attr('d', 'M 0,0 V 4 L6,2 Z');\n\n var link = svg.selectAll('line')\n .data(graphData.links)\n .enter()\n .append('line')\n .attr('class', 'link')\n .attr('marker-end', 'url(#arrowhead)');\n\n // Create a group for each node\n var node = svg.selectAll('g')\n .data(graphData.nodes)\n .enter()\n .append('g');\n\n // Color the node based on node's git-type (otherwise, hot pink!)\n node.append('circle')\n .attr('r', radius)\n .attr('class', 'node')\n .attr('fill', function(d) {\n var blue = '#1BA1E2',\n red = 'tomato',\n green = '#5BB75B',\n pink = '#FE57A1';\n\n if (d.name.endsWith('.b')) { return red; }\n if (d.name.endsWith('.t')) { return blue; }\n if (d.name.endsWith('.c')) { return green; }\n return pink;\n });\n\n node.append('text')\n .attr('y', radius * 1.5)\n .attr('text-anchor', 'middle')\n .attr('fill', '#555')\n .text(function(d) {\n if (d.name.length > 10) {\n return d.name.substring(0, 8) + '...';\n }\n\n return d.name;\n });\n\n // If the node has a type: tag it\n node.append('text')\n .attr('text-anchor', 'middle')\n .attr('y', 4)\n .attr('fill', 'white')\n .attr('class', 'bold-text')\n .text(function(d) {\n if (d.name.endsWith('.b')) { return 'BLOB'; }\n if (d.name.endsWith('.t')) { return 'TREE'; }\n if (d.name.endsWith('.c')) { return 'COMMIT'; }\n return '';\n });\n\n var charge = 700 * graphData.nodes.length;\n\n var force = d3.layout.force()\n .size([bounds.width, bounds.height])\n .nodes(graphData.nodes)\n .links(graphData.links)\n .linkDistance(150)\n .charge(-(charge))\n .gravity(1)\n .on('tick', tick);\n\n // No fancy animation, tick amount varies based on number of nodes\n force.start();\n for (var i = 0; i < graphData.nodes.length * 100; ++i) force.tick();\n force.stop();\n\n function tick(e) {\n // Push sources up and targets down to form a weak tree.\n var k = -12 * e.alpha;\n\n link\n .each(function(d) { d.source.y -= k, d.target.y += k; })\n .attr('x2', function(d) { return d.source.x; })\n .attr('y2', function(d) { return d.source.y; })\n .attr('x1', function(d) { return d.target.x; })\n .attr('y1', function(d) { return d.target.y; });\n\n node\n .attr('transform', function(d) {\n return 'translate(' + d.x + ',' + d.y + ')';\n });\n }\n};\nmakeGraph('.merkle-1', forceInput);
You’d be forgiven for thinking that is a line.
\nThis directional line is a DAG—albeit a simple one. Node B
depends on Node A
and that is the whole graph. If you want to get to Node B
then you have to start at Node A
. Depending on your problem-space, Node B
could be many things: A place in Königsberg, a target in a Makefile (or a Rakefile), or (brace yourself) a Git object.
In order to understand how Git is a DAG, you need to understand Git “objects”:
\n$ mkdir merkle\n$ cd merkle\n$ echo 'This is the beginning' > README\n$ git init\n$ git add .\n$ git -m 'Initial Commit'\n$ find .git/objects/ -type f\n.git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353\n.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624\n.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
What are Git objects? Because they look like nonsense:
\n\nAfter a little digging through the Pro Git book, Git objects are a little less non-sensicle. Git objects are simply zlib
compressed, formatted messages:
$ python2 -c 'import sys,zlib; \\\n print zlib.decompress(sys.stdin.read());' \\\n < .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da\ncommit 195tree 098e6de29daf4e55f83406b49f5768df9bc7d624\nauthor Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\ncommitter Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\n\nInitial Commit
Parts of that message are obvious: author
and committer
obviously come from my .gitconfig
. There is a Unix epoch timestamp with a timezone offset. commit
is the type of object. 195
is the byte-length of the remainder of the message.
There are a few parts of that message that aren’t immediately obvious. What is tree 098e6de29daf4e55f83406b49f5768df9bc7d624
? And why would we store this message in .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
and not .git/objects/commit-message
? Is a merkle what I think it is? The answer to all of these questions and many more is the same: Cryptographic Hash Functions.
A cryptographic hash function is a function that when given an input of any length it creates a fixed-length output. Furthermore (and more importantly), the fixed-length output should be unique to a given input; any change in input will likely cause a big change in the output. Git uses a cryptographic hash function called Secure Hash Algorithm 1 (SHA-1).
\nYou can play with the SHA-1 function on the command line:
\n$ echo 'message' | sha1sum\n1133e3acf0a4cbb9d8b3bfd3f227731b8cd2650b -\n$ echo 'message' | sha1sum\n1133e3acf0a4cbb9d8b3bfd3f227731b8cd2650b -\n$ echo 'message1' | sha1sum\nc133514a60a4641b83b365d3dc7b715dc954e010 -
Note the big change in the output of sha1sum
from a tiny change in input. This is what cryptographic hash functions do.
Now that we have some idea of what is inside a commit object, let’s reverse-engineer the commit object from the HEAD
of our merkle
repo:
$ python2 -c 'import sys,zlib; \\\nprint zlib.decompress(sys.stdin.read());' \\\n< .git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da | \\\nod -c\n 0000000 c o m m i t 1 9 5 \\0 t r e e \n 0000020 0 9 8 e 6 d e 2 9 d a f 4 e 5 5\n 0000040 f 8 3 4 0 6 b 4 9 f 5 7 6 8 d f\n 0000060 9 b c 7 d 6 2 4 \\n a u t h o r \n 0000100 T y l e r C i p r i a n i <\n 0000120 t c i p r i a n i @ w i k i m e\n 0000140 d i a . o r g > 1 4 5 8 6 0 4\n 0000160 1 2 0 - 0 7 0 0 \\n c o m m i t\n 0000200 t e r T y l e r C i p r i a\n 0000220 n i < t c i p r i a n i @ w i\n 0000240 k i m e d i a . o r g > 1 4 5\n 0000260 8 6 0 4 1 2 0 - 0 7 0 0 \\n \\n I\n 0000300 n i t i a l C o m m i t \\n \\n\n 0000317
$ printf 'tree 098e6de29daf4e55f83406b49f5768df9bc7d62k4\\n' >> commit-msg\n$ printf 'author Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\\n' >> commit-msg\n$ printf 'committer Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\\n' >> commit-msg\n$ printf '\\nInitial Commit\\n' >> commit-msg
$ sha1sum <(cat \\\n <(printf "commit ") \\\n <(wc -c < commit-msg | tr -d '\\n') \\\n <(printf '%b' '\\0') commit-msg)\n1a06ce381ac14f7a5baa1670691c2ff8a73aa6da /dev/fd/63
Hmm… that seems familiar
\n$ export COMMIT_HASH=$(sha1sum <(cat <(printf "commit ") <(wc -c < commit-msg | tr -d '\\n') <(printf '%b' '\\0') commit-msg) | cut -d' ' -f1)\n$ find ".git/objects/${COMMIT_HASH:0:2}" -type f -name "${COMMIT_HASH:(-38)}"\n.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da
The commit object is a zlib-compressed, formatted message that is stored in a file named after the SHA-1 hash of the file’s un-zlib
compressed contents.
(/me wipes brow)
\nLet’s use git-cat-file
to see if we can explore the tree 098e6de29daf4e55f83406b49f5768df9bc7d62k4
-part of the commit message object:
$ git cat-file -p 1a06ce381ac14f7a5baa1670691c2ff8a73aa6da\ntree 098e6de29daf4e55f83406b49f5768df9bc7d624\nauthor Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700\ncommitter Tyler Cipriani <tcipriani@wikimedia.org> 1458604120 -0700
$ git cat-file -p 098e6de29daf4e55f83406b49f5768df9bc7d624\n100644 blob 1b9f426a8407ffee551ad2993c5d7d3780296353 README
Hey that’s the text I put into README
!
So .git/HEAD
refers to .git/refs/heads/master
, calling git-cat-file
on the object found inside that file shows that it’s the commit object we recreated. The commit object points to 098e6de29daf4e55f83406b49f5768df9bc7d624
, which is a tree object with the contents: 100644 blob 1b9f426a8407ffee551ad2993c5d7d3780296353 README
The blob
object 1b9f426a8407ffee551ad2993c5d7d3780296353
is the contents of README
! So it seems each commit
object points to a tree
object that points to other objects.
Let’s see if we can paste together what Git is doing at a low-level when we make a new commit:
\nREADME
, hash the contents using SHA-1, and store as a blob
object in .git/objects
.tree
in .git/objects
..gitconfig
and the hash of the top-level tree. Hash this information and store it as a commit
object in .git/objects
.It seems that there may be a chain of dependencies:
\nvar gitDag = {\n // blob (add .b for blob)\n '1b9f426a8407ffee551ad2993c5d7d3780296353.b': [],\n // tree (.t == tree) is a hash that includes the hash from blob\n '098e6de29daf4e55f83406b49f5768df9bc7d624.t': ['1b9f426a8407ffee551ad2993c5d7d3780296353.b'],\n // commit (.c == commit) is a hash that includes the hash from tree\n '1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': ['098e6de29daf4e55f83406b49f5768df9bc7d624.t'],\n};\n\nmakeGraph('.merkle-2', forceFormat(gitDag));
You’d be forgiven for thinking that is a line.
\nWhat’s really happening is that there is a commit
object (1a06ce38
) that depends on a tree
object (098e6de2
) that depends on a blob
(1b9f426a
).
Since it’s running each of these objects through a hash function and each of them contains a reference up the chain of dependencies, a minor change to either the blob
or the tree
will create a drastically different commit
object.
Applying a cryptographic hash function on top of a graph was Ralph Merkle’s big idea. This scheme makes magic possible. Transferring verifiable and trusted information through an untrusted medium is toatz for realz possible with Ralph’s little scheme.
\nThe idea is that if you have the root-node hash, that is, the cryptographic hash of the node that depends on all other nodes (the commit
object in Git), and you obtained that root-node hash from a trusted source, you can trust all sub-nodes that stem from that root node if the hash of all those sub-root-nodes matches the root-node hash!
This is the mechanism by which things like Git, IPFS, Bitcoin, and BitTorrent are made possible: changing any one node in the graph changes all nodes that depend on that node all the way to the root-node (the commit
in Git).
I wrote a simple NodeJS script that creates a graph that is suitable for input into the JavaScript that I’ve already written that will create a D3.js force graph with whatever it finds in .git/objects
.
#!/usr/bin/env nodejs\n/* makeDag - creates a JSON dependency graph from .git/objects */\n\nvar glob = require('glob'),\n fs = require('fs'),\n zlib = require('zlib');\n\nvar types = ['tree', 'commit', 'blob'],\n treeRegex = {\n // 100644 README\\0[20 byte sha1]\n regex: /[0-9]+\\s[^\\0]+\\0((.|\\n){20})/gm,\n fn: function(sha) {\n var buf = new Buffer(sha[1], 'binary');\n return buf.toString('hex') + '.b';\n }\n },\n commitRegex = {\n // tree 098e6de29daf4e55f83406b49f5768df9bc7d624\n regex: /(tree|parent)\\s([a-f0-9]{40})/gm,\n fn: function(sha) {\n if (sha[1] === 'tree') {\n return sha[2] + '.t';\n }\n return sha[2] + '.c';\n }\n },\n total = 0,\n final = {};\n\n// determine file type, parse out SHA1s\nvar handleObjects = function(objData, name) {\n types.forEach(function(type) {\n var re, regex, match, key;\n\n if (!objData.startsWith(type)) { return; }\n\n key = name + '.' + type[0];\n final[key] = [];\n if (type === 'tree') { objType = treeRegex; }\n if (type === 'commit') { objType = commitRegex; }\n if (type === 'blob') { return; }\n\n // Remove the object-type and size from file\n objData = objData.split('\\0');\n objData.shift();\n objData = objData.join('\\0');\n\n // Recursive regex match remainder\n while ((match = objType.regex.exec(objData)) !== null) {\n final[key].push(objType.fn(match));\n }\n });\n\n // Don't output until you've got it all\n if (Object.keys(final).length !== total) {\n return;\n }\n\n // Output what ya got.\n console.log(final);\n};\n\n// Readable object names not file names\nvar getName = function(file) {\n var fileParts = file.split('/'),\n len = fileParts.length;\n return fileParts[len - 2] + fileParts[len - 1];\n}\n\n// Inflate the deflated git object file\nvar handleFile = function(file, out) {\n var name = getName(file);\n\n fs.readFile(file, function(e, data) {\n zlib.inflate(data, function(e, data) {\n if (e) { console.log(file, e); return; }\n handleObjects(data.toString('binary'), name);\n });\n });\n};\n\n// Sort through the gitobjects directory\nvar handleFiles = function(files) {\n files.forEach(function(file) {\n fs.stat(file, function(e, f) {\n if (e) { return; }\n if (f.isFile()) {\n // Don't worry about pack files for now\n if (file.indexOf('pack') > -1) { return; }\n total++;\n handleFile(file);\n }\n });\n\n });\n};\n\n(function() {\n glob('.git/objects/**/*', function(e, f) {\n if (e) { throw e; }\n handleFiles(f);\n });\n})();
Merkle graph transformations are often difficult to describe, but easy to see. Using this last piece of code to create and view graphs for several repositories has been illuminating. The graph visualization both illuminates and challenges my understanding of Git in ways I didn’t anticipate.
\nWhen you change your commit message, what happens to the graph? What depends on a commit? Where is the context for a commit?
\n$ git commit --amend -m 'This is the commit message now'\n[master 585448a] This is the commit message now\n Date: Mon Mar 21 16:48:40 2016 -0700\n 1 file changed, 1 insertion(+)\n create mode 100644 README\n$ find .git/objects -type f\n.git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353\n.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624\n.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da\n.git/objects/da/94af3a21ac7e0c875bbbe6162aa1d26d699c73
Now the DAG is a bit different:
\nvar gitDag = { '098e6de29daf4e55f83406b49f5768df9bc7d624.t': [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b' ],\n '1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],\n '1b9f426a8407ffee551ad2993c5d7d3780296353.b': [],\n 'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ] }\n\nmakeGraph('.merkle-3', forceFormat(gitDag));
Here we see that there are now two commit
objects (1a06ce38
and da94af3a
) that both depend on a single tree
object (098e6de2
) that depends on a single blob
(1b9f426a
).
One of these commit objects will never be seen with git log
.
TIL: Git creates blob
objects as soon as a file is added to the staging area.
$ echo 'staged' > staged\n$ find .git/objects -type f\n.git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353\n.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624\n.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da\n.git/objects/da/94af3a21ac7e0c875bbbe6162aa1d26d699c73
Notice that nothing depends on this object just yet. It’s a lonely orphan blob
.
$ git add staged\n$ find .git/objects -type f\n.git/objects/1b/9f426a8407ffee551ad2993c5d7d3780296353\n.git/objects/09/8e6de29daf4e55f83406b49f5768df9bc7d624\n.git/objects/1a/06ce381ac14f7a5baa1670691c2ff8a73aa6da\n.git/objects/da/94af3a21ac7e0c875bbbe6162aa1d26d699c73\n.git/objects/19/d9cc8584ac2c7dcf57d2680375e80f099dc481\n$ makeDag\n{ '098e6de29daf4e55f83406b49f5768df9bc7d624.t': [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b' ],\n '19d9cc8584ac2c7dcf57d2680375e80f099dc481.b': [],\n '1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],\n 'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],\n '1b9f426a8407ffee551ad2993c5d7d3780296353.b': [] }
Even unstaging and deleting the file doesn’t remove the object. Orphan objects in git are only garbage collected as part of git gc --prune
.
When this object is committed to the repo, it creates a whole new layer of the graph:
\n$ git commit -m 'Add staged file'\n[master 4f407b3] Add staged file\n 1 file changed, 1 insertion(+)\n create mode 100644 staged\n$ makeDag\n{ '098e6de29daf4e55f83406b49f5768df9bc7d624.t': [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b' ],\n '19d9cc8584ac2c7dcf57d2680375e80f099dc481.b': [],\n '1a06ce381ac14f7a5baa1670691c2ff8a73aa6da.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ],\n '1b9f426a8407ffee551ad2993c5d7d3780296353.b': [],\n '4f407b396e6ecbb65de6cf192259c18ecd4d1e9b.c': \n [ '7ce38101e91de29ee0fee3aa9940cc81159e0f8d.t',\n 'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c' ],\n '7ce38101e91de29ee0fee3aa9940cc81159e0f8d.t': \n [ '1b9f426a8407ffee551ad2993c5d7d3780296353.b',\n '19d9cc8584ac2c7dcf57d2680375e80f099dc481.b' ],\n 'da94af3a21ac7e0c875bbbe6162aa1d26d699c73.c': [ '098e6de29daf4e55f83406b49f5768df9bc7d624.t' ] }
So we’ve created a new commit (4f407b39
) that is the parent of a different commit (da94af3a
) and a new tree (7ce38101
) that contains our old README
blob (1b9f426a
) and our new, previously orphaned, blob (19d9cc85
).
I’ve always enjoyed the idea that software (and computer science more generally) is nothing but an abstraction to manage complexity. Good software— powerful software—like Git—is a software that manages an incredible amount of complexity and hides it completely from the user.
\nIn recognition of this idea, I’ll leave you with the graph of my local copy of clippy—a small command line tool I created that is like man(1)
except it shows Clippy at the end of the man
output (yes, it’s dumb).
This should give you an idea of the complexity that is abstracted by the Git merkle graph: this repo contains 5 commits!
\nBetter git it in your soul*.
\nGit has a lot of great tutorials for getting started. There are also a number of great articles on how to use git and github for your workflow.
\nWhat I haven’t seen is an article on how to integrate git with your current site without storing any code on github. I’m writing this blog to create a quick reference for how to get up and running using git on your existing site.
\nI’m making the assumption that you have the following:
I’m using Ubuntu 12.04 locally, but I’d assume that most of this won’t be too different on a different distribution or on a Mac—but I’m probably totally wrong about that ☺
\nRSync your site to your local development environment
\nIn order to begin to develop locally (and break the old cowboy-coding habits that you’ve undoubtedly developed over the years) you need a local copy of your site.
\ncd
to the directory in which you will be storing these files (i.e. cd /srv/www/tylercipriani.com/public_html
)htdocs
or public_html
from your webserver into this local directory:The command breaks down like this:
\na
means “Archive”—keeps permissions, mtimes, etc the samev
means “Verbose”—increases verbosity of the commande
means “RSH”—allows you to use remote shell (same as RSH=command):/path/to/htdocs/
—the path to you htdocs folder. The trailing /
is significant—it means copy the content of the htdocs directory rather than the directory by name.
is the current directoryInitialize git in local development environment.
\nThis step will create a new git repository on your local machine and add all the code that you’ve rsynced in the previous step to that repo.
\ncd
to the directory to which you previously rsynced your site and initialize a git repository by running git init
git add .
Commit all your newly added files to the repo by running your first commit git commit -m “First Commit”
Setup a bare repo on your web server
\nYou need a bare repo out on your webserver that will act as a mirror to your local development environment.
\nssh into your webserver and make a new directory, I usually make it above the webroot (i.e. htdocs
)
Once inside the new directory initialize a bare repository by using the --bare
flag:
Now we can define a new post-receive hook that will be triggered whenever an update is pushed to this new bare repository. The post-receive hook can be any type of script you want, the script below is written in bash. cd
into the .git/hooks
directory and create a file called “post-receive”. Copy the code below into the file:
make sure that this code is executable by running chmod +x .git/hooks/post-receive
Push to your new repo, you beautiful command-line ninja, you!
\nBack on your local machine, in the webroot of your local development environment, add your bare webserver repo as your remote
and push your git repo up to your server. The post-receive hook will take care of the rest!
$ git remote add web ssh://user@tylercipriani.com/home/user/tylercipriani.com.git\n$ git push -u origin master
By using the -u
flag you’re setting the upstream which means you can just run git pull
without further arguments to merge origin and master.