diff --git a/README.md b/README.md index be60045..08a79f2 100644 --- a/README.md +++ b/README.md @@ -38,22 +38,98 @@ version of the files available in the target directory.

The remote diffing algorithm is based on CDC. In our tests, it is up to 30x -faster than the one used in rsync (1500 MB/s vs 50 MB/s). +faster than the one used in `rsync` (1500 MB/s vs 50 MB/s). -The following chart shows a comparison of `cdc_rsync` and Linux rsync running +The following chart shows a comparison of `cdc_rsync` and Linux `rsync` running under Cygwin on Windows. The test data consists of 58 development builds of some game provided to us for evaluation purposes. The builds are 40-45 GB large. For this experiment, we uploaded the first build, then synced the second build with each of the two tools and measured the time. For example, syncing -from build 1 to build 2 took 210 seconds with the Linux rsync, but only 75 +from build 1 to build 2 took 210 seconds with the Cygwin `rsync`, but only 75 seconds with `cdc_rsync`. The three outliers are probably feature drops from another development branch, where the delta was much higher. Overall, -`cdc_rsync` syncs files about **3 times faster** than Linux rsync. +`cdc_rsync` syncs files about **3 times faster** than Cygwin `rsync`.

Comparison of cdc_rsync and Linux rsync running in Cygwin

+We also ran the experiment with the native Linux `rsync`, i.e syncing Linux to +Linux, to rule out issues with Cygwin. Linux `rsync` performed on average 35% +worse than Cygwin `rsync`, which can be attributed to CPU differences. We did +not include it in the figure because to this, but you can find it +[here](docs/cdc_rsync_vs_cygwin_rsync_vs_linux_rsync.png). + +### How does it work and why is it faster? + +The standard Linux `rsync` splits a file into fixed-size chunks of typically +several KB. + +

+ Linux rsync uses fixed size chunks +

+ +If the file is modified in the middle, e.g. by inserting `xxxx` after `567`, +this usually means that the modified chunks as well as +all subsequent chunks change. + +

+ Fixed size chunks after inserting data +

+ +The standard `rsync` algorithm hashes the chunks of the remote "old" file +and sends the hashes to the local device. The local device then figures out +which parts of the "new" file matches known chunks. + +

+ Syncing a file with the standard Linux rsync +
+ Standard rsync algorithm +

+ +This is a simplification. The actual algorithm is more complicated and uses +two hashes, a weak rolling hash and a strong hash, see +[here](https://rsync.samba.org/tech_report/) for a great overview. What makes +`rsync` relatively slow is the "no match" situation where the rolling hash does +not match any remote hash, and the algorithm has to roll the hash forward and +perform a hash map lookup for each byte. `rsync` goes to +[great lengths](https://github.com/librsync/librsync/blob/master/src/hashtable.h) +optimizing lookups. + +`cdc_rsync` does not use fixed-size chunks, but instead variable-size, +content-defined chunks. That means, chunk boundaries are determined by the +*local content* of the file, in practice a 64 byte sliding window. For more +details, see +[the FastCDC paper](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia) +or take a look at [our implementation](fastcdc/fastcdc.h). + +

+ cdc_rsync uses variable, content-defined size chunks +

+ +If the file is modified in the middle, only the modified +chunks, but not subsequent chunks +change (unless they are less than 64 bytes away from the modifications). + +

+ Content-defined chunks after inserting data +

+ +Computing the chunk boundaries is cheap and involves only a left-shift, a memory +lookup, an `add` and an `and` operation for each input byte. This is cheaper +than the hash map lookup for the standard `rsync` algorithm. + +Because of this, the `cdc_rsync` algorithm is faster than the standard +`rsync`. It is also simpler. Since chunk boundaries move along with insertions +or deletions, the task to match local and remote hashes is a trivial set +difference operation. It does not involve a per-byte hash map lookup. + +

+ Syncing a file with cdc_rsync +
+ cdc_rsync algorithm +

+ ## CDC Stream `cdc_stream` is a tool to stream files and directories from a Windows machine to a diff --git a/docs/cdc_rsync_animation.gif b/docs/cdc_rsync_animation.gif new file mode 100644 index 0000000..46052ac Binary files /dev/null and b/docs/cdc_rsync_animation.gif differ diff --git a/docs/cdc_rsync_vs_cygwin_rsync_vs_linux_rsync.png b/docs/cdc_rsync_vs_cygwin_rsync_vs_linux_rsync.png new file mode 100644 index 0000000..602264c Binary files /dev/null and b/docs/cdc_rsync_vs_cygwin_rsync_vs_linux_rsync.png differ diff --git a/docs/fixed_size_chunks.png b/docs/fixed_size_chunks.png new file mode 100644 index 0000000..f568340 Binary files /dev/null and b/docs/fixed_size_chunks.png differ diff --git a/docs/fixed_size_chunks_inserted.png b/docs/fixed_size_chunks_inserted.png new file mode 100644 index 0000000..4a422a8 Binary files /dev/null and b/docs/fixed_size_chunks_inserted.png differ diff --git a/docs/linux_rsync_animation.gif b/docs/linux_rsync_animation.gif new file mode 100644 index 0000000..f3c6db4 Binary files /dev/null and b/docs/linux_rsync_animation.gif differ diff --git a/docs/variable_size_chunks.png b/docs/variable_size_chunks.png new file mode 100644 index 0000000..1f74a87 Binary files /dev/null and b/docs/variable_size_chunks.png differ diff --git a/docs/variable_size_chunks_inserted.png b/docs/variable_size_chunks_inserted.png new file mode 100644 index 0000000..604c1a5 Binary files /dev/null and b/docs/variable_size_chunks_inserted.png differ