The Rsync Algorithm in Python

← Offline spellcheck | posts | Literate Vimrc →

I’ve often pondered this great and terrible beast – rsync. It’s spiny, nearly impenetrable command-line interface. Its majestic and wonderful efficiency. The depths of its man page, and the heights of its use-cases.

Leaving aside the confusing implications of trailing slashes, rsync is amazing. The Wikimedia deployment tooling – scap (which at this point has been iterated on for over a decade) – still makes heavy use of rsync. At $DAYJOB - 3, rsync is used to manage a library of hundreds of thousands of flac, mp3, and ogg files. It’s hard to argue with rsync. The amount of network traffic generated via rsync is really hard to beat with any other program.

But what’s it doing?

rsync is fast. rsync is ubiquitous. rsync uses few client resources, and little network IO. OK…Why?

I started reading about the rsync algorithm when a fellow I work alongside began espousing the relative superiority of zsync for the case of our deployment server. Currently scap has a built-in (and quite fancy) fan-out system so as not to put too high of a load on only 1 server; however, zsync flips the rsync algorithm on its head, running the rsync algorithm on the client rather than the server. What exactly is rsync doing that makes the load on the server so high?

The Meat

For the purposes of explanation, let’s say you ran the command: rsync α β.

The rsync algorithm boils down to 5 steps –

Split file β into chunks of length n.
Calculate a weak (adler32) and strong (md4) checksum for each chunk of file β.
Send those checksums to the rsync server (where file α is)
Find all the chunks of length n in α that are in β by comparing checksums
Create a list of instructions to recreate α from β

Easy.

Do it then

I actually think it would have been easier for me to understand a bad python implementation of the rsync algorithm, than to read a tech report on rsync. So with that in mind, here’s a bad python implementation of the rsync algorithm.

Pythonic!

First it might be helpful to define my block size, and create a couple of helper functions to create the rolling checksums.

import collections
import hashlib
import zlib


BLOCK_SIZE = 4096


# Helper functions
# ----------------
def md5_chunk(chunk):
    """
    Returns md5 checksum for chunk
    """
    m = hashlib.md5()
    m.update(chunk)
    return m.hexdigest()


def adler32_chunk(chunk):
    """
    Returns adler32 checksum for chunk
    """
    return zlib.adler32(chunk)

I’ll also need a function that creates a rolling checksum of a file. The checksums_file function will read in BLOCK_SIZE bytes through to the end of the file, calculate both the adler32 checksum and the md5 checksum for those chunks, and then put those chunks in a data structure.

I’d like a nice interface beyond primitives for both the signatures and the list of checksums – I’ll create 2 objects Signature and Chunks to make that interface. Chunks is basically a list of Signatures with a few other methods for fanciness.

# Checksum objects
# ----------------
Signature = collections.namedtuple('Signature', 'md5 adler32')


class Chunks(object):
    """
    Data stucture that holds rolling checksums for file B
    """
    def __init__(self):
        self.chunks = []
        self.chunk_sigs = {}

    def append(self, sig):
        self.chunks.append(sig)
        self.chunk_sigs.setdefault(sig.adler32, {})
        self.chunk_sigs[sig.adler32][sig.md5] = len(self.chunks) - 1

    def get_chunk(self, chunk):
        adler32 = self.chunk_sigs.get(adler32_chunk(chunk))

        if adler32:
            return adler32.get(md5_chunk(chunk))

        return None

    def __getitem__(self, idx):
        return self.chunks[idx]

    def __len__(self):
        return len(self.chunks)


# Build Chunks from a file
# ------------------------
def checksums_file(fn):
    """
    Returns object with checksums of file
    """
    chunks = Chunks()
    with open(fn) as f:
        while True:
            chunk = f.read(BLOCK_SIZE)
            if not chunk:
                break

            chunks.append(
                Signature(
                    adler32=adler32_chunk(chunk),
                    md5=md5_chunk(chunk)
                )
            )

        return chunks

Now I need a couple of methods to complete the algorithm – one that will find the BLOCK_SIZE chunks in file β that are in file α, and one that will produce instructions that can be used to assemble the new and improved β from the β we’ve already got.

The _get_block_list function will return a list of chunk indices and bytes. The chunk indices are indices of chunks already present in file β (we know from the checksums_file function), the bytes are raw bytes that are in α but may not be in β. If a chunk is found in α that is not in β then the first byte of that chunk is appended to the output list and a checksum is calculated for the next BLOCK_SIZE chunk.

This is why network IO for rsync is so efficient – the only raw data that is sent is the information missing from the remote. This is also why rsync causes higher load on the server than the client – it’s not just checksumming files, it’s checksumming, comparing, and building a diff. And it’s doing that process for every machine to which it is attempting to sync.

def _get_block_list(file_one, file_two):
    """
    The good stuff.

    1. create rolling checksums file_two
    2. for each chunk in file one, determine if chunk is already in file_two
        a. If so:
            i. return the index of that chunk
            ii. move the read head by the size of a chunk
        b. If not:
            i. return the next byte
            ii. move the read head by 1 byte
    3. start over at 2 until you're out of file to read
    """
    checksums = checksums_file(file_two)
    blocks = []
    offset = 0
    with open(file_one) as f:
        while True:
            chunk = f.read(BLOCK_SIZE)
            if not chunk:
                break

            chunk_number = checksums.get_chunk(chunk)

            if chunk_number is not None:
                offset += BLOCK_SIZE
                blocks.append(chunk_number)
                continue
            else:
                offset += 1
                blocks.append(chunk[0])
                f.seek(offset)
                continue

    return blocks

The poorly named file function (but it’s in the rsync.py module, so rsync.file is good…right? No? OK.) takes the list of chunk indices and raw bytes from _get_block_list, finds the chunks in β referenced by the index, combines those chunks with the raw bytes from α and returns a string that is the same as file α – it just took a weird route to get there :)

def file(file_one, file_two):
    """
    Essentially this returns file one, but in a fancy way :)

    The output from get_block_list is a list of either chunk indexes or data as
    strings.

    If it's a chunk index, then read that chunk from the file and append it to
    output. If it's not a chunk index, then it's actual data and should just be
    appended to output directly.
    """
    output = ''
    with open(file_two) as ft:
        for block in _get_block_list(file_one, file_two):
            if isinstance(block, int):
                ft.seek(block * BLOCK_SIZE)
                output += ft.read(BLOCK_SIZE)
            else:
                output += block

    return output

Creating a python file that imports this script as a module, and invokes file is all you need to actually run it. I wrote a bunch of tests to help me write the script. The core of the test file was simply:

import rsync

if __name__ == '__main__':
    rsync.file('fixtures/foo.txt', 'fixtures/bar.txt')

And at the end of our python-ing, we came to see the rsync beast in a new light – not a beast at all, just a misunderstood algorithm with a lot of command line flags. The End.