This CL changes the chunking algorithm from "normalized chunking" to
simple "regression chunking", and changes the has criteria from
'hash&mask' to 'hash<=threshold'. These are all ideas taken from
testing and analysis done at
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Regression chunking was introduced in
https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf
The algorithm uses an arbitrary number of regressions using power-of-2
regression target lengths. This means we can use a simple bitmask for
the regression hash criteria.
Regression chunking yields high deduplication rates even for lower max
chunk sizes, so that the cdc_stream max chunk can be reduced to 512K
from 1024K. This fixes potential latency spikes from large chunks.
The tools allow efficient and fast synchronization of large directory
trees from a Windows workstation to a Linux target machine.
cdc_rsync* support efficient copy of files by using content-defined
chunking (CDC) to identify chunks within files that can be reused.
asset_stream_manager + cdc_fuse_fs support efficient streaming of a
local directory to a remote virtual file system based on FUSE. It also
employs CDC to identify and reuse unchanged data chunks.