This CL changes the chunking algorithm from "normalized chunking" to
simple "regression chunking", and changes the has criteria from
'hash&mask' to 'hash<=threshold'. These are all ideas taken from
testing and analysis done at
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Regression chunking was introduced in
https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf
The algorithm uses an arbitrary number of regressions using power-of-2
regression target lengths. This means we can use a simple bitmask for
the regression hash criteria.
Regression chunking yields high deduplication rates even for lower max
chunk sizes, so that the cdc_stream max chunk can be reduced to 512K
from 1024K. This fixes potential latency spikes from large chunks.
Adds support for local syncs of files and folders on the same Windows
machine, e.g. cdc_rsync C:\source C:\dest. The two main changes are
- Skip the check whether the port is available remotely with PortManager.
- Do not deploy cdc_rsync_server.
- Run cdc_rsync_server directly, not through an SSH tunnel.
The current implementation is not optimal as it starts
cdc_rsync_server as a separate process and communicates to it via a
TCP port.
* Fix#76 fastcdc chunk boundary off-by-one.
This ensures that the last byte included in the gear-hash that identified the
chunk boundary is included in the chunk. This ensures chunks are still matched
when the byte immediately after them is changed.
* Init gear hash to all 1's to prevent zero-length chunks with min_size=0.
Also change the `MaxChunkSize` test to use min_size=0 to test this works.
* Add a Github action for building and testing
On Windows, -- -//third_party/... doesn't seem to work, so add all test directories manually. Also run the tests_*. We run only fastbuild tests here, since the opt tests will be run in the release workflow.
Also fix a number of compilation and test issues found along the way.
The tools allow efficient and fast synchronization of large directory
trees from a Windows workstation to a Linux target machine.
cdc_rsync* support efficient copy of files by using content-defined
chunking (CDC) to identify chunks within files that can be reused.
asset_stream_manager + cdc_fuse_fs support efficient streaming of a
local directory to a remote virtual file system based on FUSE. It also
employs CDC to identify and reuse unchanged data chunks.