Commit Graph

3 Commits

Author SHA1 Message Date
Donovan Baarda
fcc4cbc3f3 Change fastcdc to a better and simpler algorithm. (#79)
This CL changes the chunking algorithm from "normalized chunking" to
simple "regression chunking", and changes the has criteria from
'hash&mask' to 'hash<=threshold'. These are all ideas taken from
testing and analysis done at
  https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Regression chunking was introduced in
  https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf

The algorithm uses an arbitrary number of regressions using power-of-2
regression target lengths. This means we can use a simple bitmask for
the regression hash criteria.

Regression chunking yields high deduplication rates even for lower max
chunk sizes, so that the cdc_stream max chunk can be reduced to 512K
from 1024K. This fixes potential latency spikes from large chunks.
2023-02-08 15:06:41 +01:00
Donovan Baarda
9cf71cae65 Fix #76 fastcdc chunk boundary off-by-one. (#78)
* Fix #76 fastcdc chunk boundary off-by-one.

This ensures that the last byte included in the gear-hash that identified the
chunk boundary is included in the chunk. This ensures chunks are still matched
when the byte immediately after them is changed.

* Init gear hash to all 1's to prevent zero-length chunks with min_size=0.

Also change the `MaxChunkSize` test to use min_size=0 to test this works.
2023-01-23 14:39:02 +01:00
Christian Schneider
4326e972ac Releasing the former Stadia file transfer tools
The tools allow efficient and fast synchronization of large directory
trees from a Windows workstation to a Linux target machine.

cdc_rsync* support efficient copy of files by using content-defined
chunking (CDC) to identify chunks within files that can be reused.

asset_stream_manager + cdc_fuse_fs support efficient streaming of a
local directory to a remote virtual file system based on FUSE. It also
employs CDC to identify and reuse unchanged data chunks.
2022-11-03 10:39:10 +01:00