Change fastcdc to a better and simpler algorithm. (#79)

This CL changes the chunking algorithm from "normalized chunking" to
simple "regression chunking", and changes the has criteria from
'hash&mask' to 'hash<=threshold'. These are all ideas taken from
testing and analysis done at
  https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Regression chunking was introduced in
  https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf

The algorithm uses an arbitrary number of regressions using power-of-2
regression target lengths. This means we can use a simple bitmask for
the regression hash criteria.

Regression chunking yields high deduplication rates even for lower max
chunk sizes, so that the cdc_stream max chunk can be reduced to 512K
from 1024K. This fixes potential latency spikes from large chunks.
This commit is contained in:
Donovan Baarda
2023-02-09 01:06:41 +11:00
committed by GitHub
parent 24906eb36e
commit fcc4cbc3f3
10 changed files with 121 additions and 331 deletions

View File

@@ -27,16 +27,10 @@
#include "fastcdc/fastcdc.h"
// Compile-time parameters for the FastCDC algorithm.
#define CDC_GEAR_32BIT 1
#define CDC_GEAR_64BIT 2
#ifndef CDC_GEAR_TABLE
#define CDC_GEAR_TABLE CDC_GEAR_64BIT
#endif
#ifndef CDC_MASK_STAGES
#define CDC_MASK_STAGES 7
#endif
#ifndef CDC_MASK_BIT_LSHIFT_AMOUNT
#define CDC_MASK_BIT_LSHIFT_AMOUNT 3
#define CDC_GEAR_32BIT 32
#define CDC_GEAR_64BIT 64
#ifndef CDC_GEAR_BITS
#define CDC_GEAR_BITS CDC_GEAR_64BIT
#endif
namespace cdc_ft {
@@ -66,23 +60,20 @@ struct IndexerConfig {
uint32_t num_threads;
// Which hash function to use.
HashType hash_type;
// The masks will be populated by the indexer, setting them here has no
// effect. They are in this struct so that they can be conveniently accessed
// when printing the operation summary (and since they are derived from the
// configuration, they are technically part of it).
uint64_t mask_s;
uint64_t mask_l;
// The threshold will be populated by the indexer, setting it here has no
// effect. It is in this struct so that it can be conveniently accessed
// when printing the operation summary (and since it is derived from the
// configuration, it is technically part of it).
uint64_t threshold;
};
class Indexer {
public:
using hash_t = std::string;
#if CDC_GEAR_TABLE == CDC_GEAR_32BIT
typedef fastcdc::Chunker32<CDC_MASK_STAGES, CDC_MASK_BIT_LSHIFT_AMOUNT>
Chunker;
#elif CDC_GEAR_TABLE == CDC_GEAR_64BIT
typedef fastcdc::Chunker64<CDC_MASK_STAGES, CDC_MASK_BIT_LSHIFT_AMOUNT>
Chunker;
#if CDC_GEAR_BITS == CDC_GEAR_32BIT
typedef fastcdc::Chunker32<> Chunker;
#elif CDC_GEAR_BITS == CDC_GEAR_64BIT
typedef fastcdc::Chunker64<> Chunker;
#else
#error "Unknown gear table"
#endif