I’ve often pondered this great and terrible beast –
rsync
. It’s spiny, nearly impenetrable command-line
interface. Its majestic and wonderful efficiency. The depths of its man
page, and the heights of its use-cases.
Leaving aside the confusing implications of trailing slashes,
rsync
is amazing. The Wikimedia deployment tooling – scap (which at
this point has been iterated on for over a decade) – still makes heavy
use of rsync
. At $DAYJOB - 3
,
rsync
is used to manage a library of hundreds of thousands
of flac
, mp3
, and ogg
files. It’s
hard to argue with rsync
. The amount of network traffic
generated via rsync
is really hard to beat with any other
program.
But what’s it doing?
rsync
is fast. rsync
is ubiquitous.
rsync
uses few client resources, and little network IO.
OK…Why?
I started reading about the rsync
algorithm when a fellow I work alongside began
espousing the relative superiority of zsync for the case of our
deployment server. Currently scap
has a built-in (and quite
fancy) fan-out system so as not to put too high of a load on only 1
server; however, zsync
flips the rsync
algorithm on its head, running the rsync
algorithm on the
client rather than the server. What exactly is rsync
doing
that makes the load on the server so high?
The Meat
For the purposes of explanation, let’s say you ran the command:
rsync α β
.
The rsync
algorithm boils down to 5 steps –
- Split file β into chunks of length n.
- Calculate a weak (adler32) and strong (md4) checksum for each chunk of file β.
- Send those checksums to the rsync server (where file α is)
- Find all the chunks of length n in α that are in β by comparing checksums
- Create a list of instructions to recreate α from β
Easy.
Do it then
I actually think it would have been easier for me to understand a bad
python implementation of the rsync
algorithm, than to read
a tech report on rsync
. So with that in mind, here’s a bad
python implementation of the rsync
algorithm.
Pythonic!
First it might be helpful to define my block size, and create a couple of helper functions to create the rolling checksums.
import collections
import hashlib
import zlib
= 4096
BLOCK_SIZE
# Helper functions
# ----------------
def md5_chunk(chunk):
"""
Returns md5 checksum for chunk
"""
= hashlib.md5()
m
m.update(chunk)return m.hexdigest()
def adler32_chunk(chunk):
"""
Returns adler32 checksum for chunk
"""
return zlib.adler32(chunk)
I’ll also need a function that creates a rolling checksum of a file.
The checksums_file
function will read in
BLOCK_SIZE
bytes through to the end of the file, calculate
both the adler32
checksum and the md5
checksum
for those chunks, and then put those chunks in a data structure.
I’d like a nice interface beyond primitives for both the signatures
and the list of checksums – I’ll create 2 objects Signature
and Chunks
to make that interface. Chunks
is
basically a list of Signatures
with a few other methods for
fanciness.
# Checksum objects
# ----------------
= collections.namedtuple('Signature', 'md5 adler32')
Signature
class Chunks(object):
"""
Data stucture that holds rolling checksums for file B
"""
def __init__(self):
self.chunks = []
self.chunk_sigs = {}
def append(self, sig):
self.chunks.append(sig)
self.chunk_sigs.setdefault(sig.adler32, {})
self.chunk_sigs[sig.adler32][sig.md5] = len(self.chunks) - 1
def get_chunk(self, chunk):
= self.chunk_sigs.get(adler32_chunk(chunk))
adler32
if adler32:
return adler32.get(md5_chunk(chunk))
return None
def __getitem__(self, idx):
return self.chunks[idx]
def __len__(self):
return len(self.chunks)
# Build Chunks from a file
# ------------------------
def checksums_file(fn):
"""
Returns object with checksums of file
"""
= Chunks()
chunks with open(fn) as f:
while True:
= f.read(BLOCK_SIZE)
chunk if not chunk:
break
chunks.append(
Signature(=adler32_chunk(chunk),
adler32=md5_chunk(chunk)
md5
)
)
return chunks
Now I need a couple of methods to complete the algorithm – one that
will find the BLOCK_SIZE
chunks in file β that are in file
α, and one that will produce instructions that can be used to assemble
the new and improved β from the β we’ve already got.
The _get_block_list
function will return a list of chunk
indices and bytes. The chunk indices are indices of chunks already
present in file β (we know from the checksums_file
function), the bytes are raw bytes that are in α but may not be in β. If
a chunk is found in α that is not in β then the first byte of that chunk
is appended to the output list and a checksum is calculated for the next
BLOCK_SIZE
chunk.
This is why network IO for rsync
is so efficient – the
only raw data that is sent is the information missing from the remote.
This is also why rsync
causes higher load on the server
than the client – it’s not just checksumming files, it’s checksumming,
comparing, and building a diff. And it’s doing that process for every
machine to which it is attempting to sync.
def _get_block_list(file_one, file_two):
"""
The good stuff.
1. create rolling checksums file_two
2. for each chunk in file one, determine if chunk is already in file_two
a. If so:
i. return the index of that chunk
ii. move the read head by the size of a chunk
b. If not:
i. return the next byte
ii. move the read head by 1 byte
3. start over at 2 until you're out of file to read
"""
= checksums_file(file_two)
checksums = []
blocks = 0
offset with open(file_one) as f:
while True:
= f.read(BLOCK_SIZE)
chunk if not chunk:
break
= checksums.get_chunk(chunk)
chunk_number
if chunk_number is not None:
+= BLOCK_SIZE
offset
blocks.append(chunk_number)continue
else:
+= 1
offset 0])
blocks.append(chunk[
f.seek(offset)continue
return blocks
The poorly named file
function (but it’s in the
rsync.py
module, so rsync.file
is good…right?
No? OK.) takes the list of chunk indices and raw bytes from
_get_block_list
, finds the chunks in β referenced by the
index, combines those chunks with the raw bytes from α and returns a
string that is the same as file α – it just took a weird route to get
there :)
def file(file_one, file_two):
"""
Essentially this returns file one, but in a fancy way :)
The output from get_block_list is a list of either chunk indexes or data as
strings.
If it's a chunk index, then read that chunk from the file and append it to
output. If it's not a chunk index, then it's actual data and should just be
appended to output directly.
"""
= ''
output with open(file_two) as ft:
for block in _get_block_list(file_one, file_two):
if isinstance(block, int):
* BLOCK_SIZE)
ft.seek(block += ft.read(BLOCK_SIZE)
output else:
+= block
output
return output
Creating a python file that imports this script as a module, and
invokes file
is all you need to actually run it. I wrote a
bunch of tests to help me write the script. The core of the test file
was simply:
import rsync
if __name__ == '__main__':
file('fixtures/foo.txt', 'fixtures/bar.txt') rsync.
And at the end of our python-ing, we came to see the
rsync
beast in a new light – not a beast at all, just a
misunderstood algorithm with a lot of command line flags. The End.