mirror of
https://github.com/nestriness/cdc-file-transfer.git
synced 2026-01-30 14:45:37 +02:00
Change fastcdc to a better and simpler algorithm. (#79)
This CL changes the chunking algorithm from "normalized chunking" to simple "regression chunking", and changes the has criteria from 'hash&mask' to 'hash<=threshold'. These are all ideas taken from testing and analysis done at https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst Regression chunking was introduced in https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf The algorithm uses an arbitrary number of regressions using power-of-2 regression target lengths. This means we can use a simple bitmask for the regression hash criteria. Regression chunking yields high deduplication rates even for lower max chunk sizes, so that the cdc_stream max chunk can be reduced to 512K from 1024K. This fixes potential latency spikes from large chunks.
This commit is contained in:
@@ -14,7 +14,7 @@ experimentation. See the file `indexer.h` for preprocessor macros that can be
|
||||
enabled, for example:
|
||||
|
||||
```
|
||||
bazel build -c opt --copt=-DCDC_GEAR_TABLE=1 //cdc_indexer
|
||||
bazel build -c opt --copt=-DCDC_GEAR_BITS=32 //cdc_indexer
|
||||
```
|
||||
|
||||
At the end of the operation, the indexer outputs a summary of the results such
|
||||
@@ -25,7 +25,7 @@ as the following:
|
||||
Operation succeeded.
|
||||
|
||||
Chunk size (min/avg/max): 128 KB / 256 KB / 1024 KB | Threads: 12
|
||||
gear_table: 64 bit | mask_s: 0x49249249249249 | mask_l: 0x1249249249
|
||||
gear_table: 64 bit | threshold: 0x7fffc0001fff
|
||||
Duration: 00:03
|
||||
Total files: 2
|
||||
Total chunks: 39203
|
||||
|
||||
Reference in New Issue
Block a user