feat(SplitBlob/SpliceBlob): add chunking algorithm #357

tyler-french · 2026-01-07T20:23:15Z

For CDC (Content-Defined Chunking), having the client and server agree upon a chunking algorithm unlocks a new level of possible improvements, where we can have distributed, deterministic, and reproducible chunking.

This PR implements a chunking algorithm negotiation to GetCapabilities. Most notably, it:

Defines a chunking threshold (algorithm agnositic)
Defaults to no algorithm (server is not expected to be able to chunk, but can verify/store chunking info)
Adds FastCDC 2020 as a supported algorithm with suggested defaults.

Why FastCDC 2020?

FastCDC 2020 is very fast (~5GB/s on my AMD Ryzen 9950), and is backed by a clear spec described in the paper: IEEE paper pdf. Of the CDC algorithms, it is the most popular, with https://github.com/nlfiedler/fastcdc-rs mirroring the paper's implementation.

This algorithm has a couple more very important benefits:

It's pretty simple to implement: DON'T MERGE/EXAMPLE: fastcdc 2020 bazel#28438
It's deterministic and reproducible
It's fast and well-known

Thank you to @sluongng for much of the initial version of this PR

Why the threshold?

We only chunk blobs > the threshold size. Having a threshold be >= the max chunk size is an important design decision for the first iterations of implementation here. Mainly, there's a cyclic re-chunking possibility that could happen if it's not upheld. For example, if the chunking threshold is only 1MB, and a chunker (avg size 512k and max 2MB) produces a 1.5MB chunk, the server is going to try to re-chunk this again. It also means the CAS doesn't know if a chunk is a full blob, or is a chunk, or is a file that was chunked into a single chunk. Having threshold > max size chunk also has the benefit of guaranteeing that we'll always get >1 chunk, which simplifies things a lot.

Why the default chunk size?

512kB may seem a little large for CDC, but it hits the sweet spot with Bazel. For medium-side repos (for example https://github.com/buildbuddy-io/buildbuddy) with many GoLink objects, it performs well with the size of the artifacts.

Using my --disk_cache from the past 1-2 months of development, I ran some benchmarking using different sizes, and got the following results:

AvgSize  │ %Chunked   │ %Reused    │ %Unique    │ Dedup%   │ Saved
─────────────────────────────────────────────────────────────────────────────────────
16KB    │      18.1% │      48.0% │      52.0% │    50.0% │    141.91 GB
32KB    │      15.8% │      46.0% │      54.0% │    48.2% │    136.90 GB
64KB    │      13.2% │      42.3% │      57.7% │    45.8% │    130.18 GB
128KB   │       7.5% │      38.4% │      61.6% │    43.1% │    122.24 GB
256KB   │       6.1% │      32.5% │      67.5% │    39.8% │    112.96 GB
**512KB   │       4.2% │      24.0% │      76.0% │    35.8% │    101.52 GB
1MB     │       1.9% │      16.4% │      83.6% │    31.0% │     88.00 GB
2MB     │       1.6% │       8.8% │      91.2% │    25.9% │     73.47 GB

We get some good benefits of starting on the larger end, at 512k, with a 2MB threshold. This can be adjusted later, but de-duplication is strong here (35%), and this only affects 4% of files.

We could drop to 64k, but we'd only get ~10% more de-duplication savings and still need to chunk 3x the number of files. Another option is to use 256k and 1MB. But this would add 2x the overhead of chunking, with only a small improvement. I think 256k or 512k would be both good options, but having a larger threshold reduces the amount of chunking overhead we have, which significantly helps performance.

Introduce ChunkingFunction which enum is a set of known chunking algorithms that the server can recommend to the client. Provide FastCDC_2020 as the first explicit chunking algorithm. The server advertise these through a new chunking_configuration field in CacheCapabilities message. There, the server may set the chunking functions that it supports as well as the relevant configuration parameters for that chunking algorithm.

Copilot

Pull request overview

This pull request adds Content-Defined Chunking (CDC) algorithm negotiation to the Remote Execution API, enabling distributed, deterministic, and reproducible chunking between clients and servers. It introduces FastCDC 2020 as the first supported algorithm with configuration parameters for optimal deduplication.

Changes:

Adds ChunkingFunction enum and ChunkingConfiguration message to define supported chunking algorithms and their parameters
Extends SplitBlobRequest, SplitBlobResponse, and SpliceBlobRequest messages with chunking_function fields
Introduces FastCDC 2020 algorithm support with configurable parameters (avg_chunk_size_bytes, normalization_level, seed) and sensible defaults (512 KiB average, 2 MiB threshold)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.