I’ve often pondered this great and terrible beast –
rsync. It’s spiny, nearly impenetrable command-line
interface. Its majestic and wonderful efficiency. The depths of its man
page, and the heights of its use-cases.
Leaving aside the confusing implications of trailing slashes,
rsync is amazing. The Wikimedia deployment tooling – scap (which at
this point has been iterated on for over a decade) – still makes heavy
use of rsync. At $DAYJOB - 3,
rsync is used to manage a library of hundreds of thousands
of flac, mp3, and ogg files. It’s
hard to argue with rsync. The amount of network traffic
generated via rsync is really hard to beat with any other
program.
But what’s it doing?
rsync is fast. rsync is ubiquitous.
rsync uses few client resources, and little network IO.
OK…Why?
I started reading about the rsync algorithm when a fellow I work alongside began
espousing the relative superiority of zsync for the case of our
deployment server. Currently scap has a built-in (and quite
fancy) fan-out system so as not to put too high of a load on only 1
server; however, zsync flips the rsync
algorithm on its head, running the rsync algorithm on the
client rather than the server. What exactly is rsync doing
that makes the load on the server so high?
The Meat
For the purposes of explanation, let’s say you ran the command:
rsync α β.
The rsync algorithm boils down to 5 steps –
- Split file β into chunks of length n.
- Calculate a weak (adler32) and strong (md4) checksum for each chunk of file β.
- Send those checksums to the rsync server (where file α is)
- Find all the chunks of length n in α that are in β by comparing checksums
- Create a list of instructions to recreate α from β
Easy.
Do it then
I actually think it would have been easier for me to understand a bad
python implementation of the rsync algorithm, than to read
a tech report on rsync. So with that in mind, here’s a bad
python implementation of the rsync algorithm.
Pythonic!
First it might be helpful to define my block size, and create a couple of helper functions to create the rolling checksums.
import collections
import hashlib
import zlib
BLOCK_SIZE = 4096
# Helper functions
# ----------------
def md5_chunk(chunk):
"""
Returns md5 checksum for chunk
"""
m = hashlib.md5()
m.update(chunk)
return m.hexdigest()
def adler32_chunk(chunk):
"""
Returns adler32 checksum for chunk
"""
return zlib.adler32(chunk)I’ll also need a function that creates a rolling checksum of a file.
The checksums_file function will read in
BLOCK_SIZE bytes through to the end of the file, calculate
both the adler32 checksum and the md5 checksum
for those chunks, and then put those chunks in a data structure.
I’d like a nice interface beyond primitives for both the signatures
and the list of checksums – I’ll create 2 objects Signature
and Chunks to make that interface. Chunks is
basically a list of Signatures with a few other methods for
fanciness.
# Checksum objects
# ----------------
Signature = collections.namedtuple('Signature', 'md5 adler32')
class Chunks(object):
"""
Data stucture that holds rolling checksums for file B
"""
def __init__(self):
self.chunks = []
self.chunk_sigs = {}
def append(self, sig):
self.chunks.append(sig)
self.chunk_sigs.setdefault(sig.adler32, {})
self.chunk_sigs[sig.adler32][sig.md5] = len(self.chunks) - 1
def get_chunk(self, chunk):
adler32 = self.chunk_sigs.get(adler32_chunk(chunk))
if adler32:
return adler32.get(md5_chunk(chunk))
return None
def __getitem__(self, idx):
return self.chunks[idx]
def __len__(self):
return len(self.chunks)
# Build Chunks from a file
# ------------------------
def checksums_file(fn):
"""
Returns object with checksums of file
"""
chunks = Chunks()
with open(fn) as f:
while True:
chunk = f.read(BLOCK_SIZE)
if not chunk:
break
chunks.append(
Signature(
adler32=adler32_chunk(chunk),
md5=md5_chunk(chunk)
)
)
return chunksNow I need a couple of methods to complete the algorithm – one that
will find the BLOCK_SIZE chunks in file β that are in file
α, and one that will produce instructions that can be used to assemble
the new and improved β from the β we’ve already got.
The _get_block_list function will return a list of chunk
indices and bytes. The chunk indices are indices of chunks already
present in file β (we know from the checksums_file
function), the bytes are raw bytes that are in α but may not be in β. If
a chunk is found in α that is not in β then the first byte of that chunk
is appended to the output list and a checksum is calculated for the next
BLOCK_SIZE chunk.
This is why network IO for rsync is so efficient – the
only raw data that is sent is the information missing from the remote.
This is also why rsync causes higher load on the server
than the client – it’s not just checksumming files, it’s checksumming,
comparing, and building a diff. And it’s doing that process for every
machine to which it is attempting to sync.
def _get_block_list(file_one, file_two):
"""
The good stuff.
1. create rolling checksums file_two
2. for each chunk in file one, determine if chunk is already in file_two
a. If so:
i. return the index of that chunk
ii. move the read head by the size of a chunk
b. If not:
i. return the next byte
ii. move the read head by 1 byte
3. start over at 2 until you're out of file to read
"""
checksums = checksums_file(file_two)
blocks = []
offset = 0
with open(file_one) as f:
while True:
chunk = f.read(BLOCK_SIZE)
if not chunk:
break
chunk_number = checksums.get_chunk(chunk)
if chunk_number is not None:
offset += BLOCK_SIZE
blocks.append(chunk_number)
continue
else:
offset += 1
blocks.append(chunk[0])
f.seek(offset)
continue
return blocksThe poorly named file function (but it’s in the
rsync.py module, so rsync.file is good…right?
No? OK.) takes the list of chunk indices and raw bytes from
_get_block_list, finds the chunks in β referenced by the
index, combines those chunks with the raw bytes from α and returns a
string that is the same as file α – it just took a weird route to get
there :)
def file(file_one, file_two):
"""
Essentially this returns file one, but in a fancy way :)
The output from get_block_list is a list of either chunk indexes or data as
strings.
If it's a chunk index, then read that chunk from the file and append it to
output. If it's not a chunk index, then it's actual data and should just be
appended to output directly.
"""
output = ''
with open(file_two) as ft:
for block in _get_block_list(file_one, file_two):
if isinstance(block, int):
ft.seek(block * BLOCK_SIZE)
output += ft.read(BLOCK_SIZE)
else:
output += block
return outputCreating a python file that imports this script as a module, and
invokes file is all you need to actually run it. I wrote a
bunch of tests to help me write the script. The core of the test file
was simply:
import rsync
if __name__ == '__main__':
rsync.file('fixtures/foo.txt', 'fixtures/bar.txt')And at the end of our python-ing, we came to see the
rsync beast in a new light – not a beast at all, just a
misunderstood algorithm with a lot of command line flags. The End.