Files
Donovan Baarda fcc4cbc3f3 Change fastcdc to a better and simpler algorithm. (#79)
This CL changes the chunking algorithm from "normalized chunking" to
simple "regression chunking", and changes the has criteria from
'hash&mask' to 'hash<=threshold'. These are all ideas taken from
testing and analysis done at
  https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Regression chunking was introduced in
  https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf

The algorithm uses an arbitrary number of regressions using power-of-2
regression target lengths. This means we can use a simple bitmask for
the regression hash criteria.

Regression chunking yields high deduplication rates even for lower max
chunk sizes, so that the cdc_stream max chunk can be reduced to 512K
from 1024K. This fixes potential latency spikes from large chunks.
2023-02-08 15:06:41 +01:00

3.2 KiB

CDC Indexer

This directory contains a CDC indexer based on our implementation of FastCDC.

Run the sample with Bazel:

bazel run -c opt //cdc_indexer -- --inputs '/path/to/files'

The CDC algorithm can be tweaked with a few compile-time constants for experimentation. See the file indexer.h for preprocessor macros that can be enabled, for example:

bazel build -c opt --copt=-DCDC_GEAR_BITS=32 //cdc_indexer

At the end of the operation, the indexer outputs a summary of the results such as the following:

00:02   7.44 GB in 2 files processed at 3.1 GB/s, 50% deduplication
Operation succeeded.

Chunk size (min/avg/max): 128 KB / 256 KB / 1024 KB  |  Threads: 12
gear_table: 64 bit  |  threshold: 0x7fffc0001fff
           Duration:           00:03
        Total files:               2
       Total chunks:           39203
      Unique chunks:           20692
         Total data:         9.25 GB
        Unique data:         4.88 GB
         Throughput:       3.07 GB/s
    Avg. chunk size:          247 KB
      Deduplication:           47.2%

 160 KB #########                                      1419 ( 7%)
 192 KB ########                                       1268 ( 6%)
 224 KB ###################                            2996 (14%)
 256 KB ########################################       6353 (31%)
 288 KB ######################                         3466 (17%)
 320 KB ##########################                     4102 (20%)
 352 KB ######                                          946 ( 5%)
 384 KB                                                  75 ( 0%)
 416 KB                                                  27 ( 0%)
 448 KB                                                   7 ( 0%)
 480 KB                                                   5 ( 0%)
 512 KB                                                   1 ( 0%)
 544 KB                                                   4 ( 0%)
 576 KB                                                   2 ( 0%)
 608 KB                                                   3 ( 0%)
 640 KB                                                   3 ( 0%)
 672 KB                                                   3 ( 0%)
 704 KB                                                   2 ( 0%)
 736 KB                                                   0 ( 0%)
 768 KB                                                   0 ( 0%)
 800 KB                                                   1 ( 0%)
 832 KB                                                   0 ( 0%)
 864 KB                                                   0 ( 0%)
 896 KB                                                   0 ( 0%)
 928 KB                                                   0 ( 0%)
 960 KB                                                   0 ( 0%)
 992 KB                                                   0 ( 0%)
1024 KB                                                   9 ( 0%)

For testing multiple combinations and comparing the results, the indexer also features a flag --results_file="results.csv" which appends the raw data to the given file in CSV format. Combine this flag with --description to label each experiment with additional columns.