[cdc_rsync] Improve README (#50)
Adds more info about how cdc_rsync works and why it's faster. Fixes #49
84
README.md
@@ -38,22 +38,98 @@ version of the files available in the target directory.
|
|||||||
</p>
|
</p>
|
||||||
|
|
||||||
The remote diffing algorithm is based on CDC. In our tests, it is up to 30x
|
The remote diffing algorithm is based on CDC. In our tests, it is up to 30x
|
||||||
faster than the one used in rsync (1500 MB/s vs 50 MB/s).
|
faster than the one used in `rsync` (1500 MB/s vs 50 MB/s).
|
||||||
|
|
||||||
The following chart shows a comparison of `cdc_rsync` and Linux rsync running
|
The following chart shows a comparison of `cdc_rsync` and Linux `rsync` running
|
||||||
under Cygwin on Windows. The test data consists of 58 development builds
|
under Cygwin on Windows. The test data consists of 58 development builds
|
||||||
of some game provided to us for evaluation purposes. The builds are 40-45 GB
|
of some game provided to us for evaluation purposes. The builds are 40-45 GB
|
||||||
large. For this experiment, we uploaded the first build, then synced the second
|
large. For this experiment, we uploaded the first build, then synced the second
|
||||||
build with each of the two tools and measured the time. For example, syncing
|
build with each of the two tools and measured the time. For example, syncing
|
||||||
from build 1 to build 2 took 210 seconds with the Linux rsync, but only 75
|
from build 1 to build 2 took 210 seconds with the Cygwin `rsync`, but only 75
|
||||||
seconds with `cdc_rsync`. The three outliers are probably feature drops from
|
seconds with `cdc_rsync`. The three outliers are probably feature drops from
|
||||||
another development branch, where the delta was much higher. Overall,
|
another development branch, where the delta was much higher. Overall,
|
||||||
`cdc_rsync` syncs files about **3 times faster** than Linux rsync.
|
`cdc_rsync` syncs files about **3 times faster** than Cygwin `rsync`.
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="docs/cdc_rsync_vs_cygwin_rsync.png" alt="Comparison of cdc_rsync and Linux rsync running in Cygwin" width="753" />
|
<img src="docs/cdc_rsync_vs_cygwin_rsync.png" alt="Comparison of cdc_rsync and Linux rsync running in Cygwin" width="753" />
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
We also ran the experiment with the native Linux `rsync`, i.e syncing Linux to
|
||||||
|
Linux, to rule out issues with Cygwin. Linux `rsync` performed on average 35%
|
||||||
|
worse than Cygwin `rsync`, which can be attributed to CPU differences. We did
|
||||||
|
not include it in the figure because to this, but you can find it
|
||||||
|
[here](docs/cdc_rsync_vs_cygwin_rsync_vs_linux_rsync.png).
|
||||||
|
|
||||||
|
### How does it work and why is it faster?
|
||||||
|
|
||||||
|
The standard Linux `rsync` splits a file into fixed-size chunks of typically
|
||||||
|
several KB.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="docs/fixed_size_chunks.png" alt="Linux rsync uses fixed size chunks" width="258" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
If the file is modified in the middle, e.g. by inserting `xxxx` after `567`,
|
||||||
|
this usually means that <span style="color: red">the modified chunks as well as
|
||||||
|
all subsequent chunks</span> change.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="docs/fixed_size_chunks_inserted.png" alt="Fixed size chunks after inserting data" width="301" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
The standard `rsync` algorithm hashes the chunks of the remote "old" file
|
||||||
|
and sends the hashes to the local device. The local device then figures out
|
||||||
|
which parts of the "new" file matches known chunks.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="docs/linux_rsync_animation.gif" alt="Syncing a file with the standard Linux rsync" width="855" />
|
||||||
|
<br>
|
||||||
|
Standard rsync algorithm
|
||||||
|
</p>
|
||||||
|
|
||||||
|
This is a simplification. The actual algorithm is more complicated and uses
|
||||||
|
two hashes, a weak rolling hash and a strong hash, see
|
||||||
|
[here](https://rsync.samba.org/tech_report/) for a great overview. What makes
|
||||||
|
`rsync` relatively slow is the "no match" situation where the rolling hash does
|
||||||
|
not match any remote hash, and the algorithm has to roll the hash forward and
|
||||||
|
perform a hash map lookup for each byte. `rsync` goes to
|
||||||
|
[great lengths](https://github.com/librsync/librsync/blob/master/src/hashtable.h)
|
||||||
|
optimizing lookups.
|
||||||
|
|
||||||
|
`cdc_rsync` does not use fixed-size chunks, but instead variable-size,
|
||||||
|
content-defined chunks. That means, chunk boundaries are determined by the
|
||||||
|
*local content* of the file, in practice a 64 byte sliding window. For more
|
||||||
|
details, see
|
||||||
|
[the FastCDC paper](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia)
|
||||||
|
or take a look at [our implementation](fastcdc/fastcdc.h).
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="docs/variable_size_chunks.png" alt="cdc_rsync uses variable, content-defined size chunks" width="260" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
If the file is modified in the middle, only <span style="color: red">the modified
|
||||||
|
chunks</span>, but not <span style="color: #38761d">subsequent chunks</span>
|
||||||
|
change (unless they are less than 64 bytes away from the modifications).
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="docs/variable_size_chunks_inserted.png" alt="Content-defined chunks after inserting data" width="314" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
Computing the chunk boundaries is cheap and involves only a left-shift, a memory
|
||||||
|
lookup, an `add` and an `and` operation for each input byte. This is cheaper
|
||||||
|
than the hash map lookup for the standard `rsync` algorithm.
|
||||||
|
|
||||||
|
Because of this, the `cdc_rsync` algorithm is faster than the standard
|
||||||
|
`rsync`. It is also simpler. Since chunk boundaries move along with insertions
|
||||||
|
or deletions, the task to match local and remote hashes is a trivial set
|
||||||
|
difference operation. It does not involve a per-byte hash map lookup.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="docs/cdc_rsync_animation.gif" alt="Syncing a file with cdc_rsync" width="857" />
|
||||||
|
<br>
|
||||||
|
cdc_rsync algorithm
|
||||||
|
</p>
|
||||||
|
|
||||||
## CDC Stream
|
## CDC Stream
|
||||||
|
|
||||||
`cdc_stream` is a tool to stream files and directories from a Windows machine to a
|
`cdc_stream` is a tool to stream files and directories from a Windows machine to a
|
||||||
|
|||||||
BIN
docs/cdc_rsync_animation.gif
Normal file
|
After Width: | Height: | Size: 128 KiB |
BIN
docs/cdc_rsync_vs_cygwin_rsync_vs_linux_rsync.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
docs/fixed_size_chunks.png
Normal file
|
After Width: | Height: | Size: 7.2 KiB |
BIN
docs/fixed_size_chunks_inserted.png
Normal file
|
After Width: | Height: | Size: 9.1 KiB |
BIN
docs/linux_rsync_animation.gif
Normal file
|
After Width: | Height: | Size: 276 KiB |
BIN
docs/variable_size_chunks.png
Normal file
|
After Width: | Height: | Size: 7.2 KiB |
BIN
docs/variable_size_chunks_inserted.png
Normal file
|
After Width: | Height: | Size: 9.3 KiB |