Backup Is Hard; Let's Go Shopping
log in to bookmark this presentaton
Abstract
This is the story of a solution to a huge problem: fast, secure online backup. A single client generates a hundred gigabytes, millions of data chunks, and thousands of file system snapshots. To appreciate the problem's scale, consider that a Python array holding content hashes for 1,000,000 files consumes 100 MB of memory. File hashes are only a portion of the required per-file metadata, and that's only one for snapshot of thousands.
We'll tour the hard parts of this system with no apology for their difficulty. The httplib/asyncore hybrid monster that served millions of parallel requests, transparently retrying on failures and timeouts, with only 300 lines of python. The RESTful API – fully respecting hypertext, with every request safely repeatable, even POSTs, and not a single hard-coded URL in the client. The encryption scheme that leaked nothing – not even modification times – but could quickly diff local file systems against the server. And, that one time that a client accidentally requested a 4.76 megabyte URL in production.