Skip to content

Conversation

@tyler-french
Copy link
Contributor

@tyler-french tyler-french commented Jan 7, 2026

For CDC (Content-Defined Chunking), having the client and server agree upon a chunking algorithm unlocks a new level of possible improvements, where we can have distributed, deterministic, and reproducible chunking.

This PR implements a chunking algorithm negotiation to GetCapabilities. Most notably, it:

  • Defines a chunking threshold (algorithm agnositic)
  • Defaults to no algorithm (server is not expected to be able to chunk, but can verify/store chunking info)
  • Adds FastCDC 2020 as a supported algorithm with suggested defaults.

Why FastCDC 2020?

FastCDC 2020 is very fast (~5GB/s on my AMD Ryzen 9950), and is backed by a clear spec described in the paper: IEEE paper pdf. Of the CDC algorithms, it is the most popular, with https://github.com/nlfiedler/fastcdc-rs mirroring the paper's implementation.

This algorithm has a couple more very important benefits:

Thank you to @sluongng for much of the initial version of this PR

Why the threshold?

We only chunk blobs > the threshold size. Having a threshold be >= the max chunk size is an important design decision for the first iterations of implementation here. Mainly, there's a cyclic re-chunking possibility that could happen if it's not upheld. For example, if the chunking threshold is only 1MB, and a chunker (avg size 512k and max 2MB) produces a 1.5MB chunk, the server is going to try to re-chunk this again. It also means the CAS doesn't know if a chunk is a full blob, or is a chunk, or is a file that was chunked into a single chunk. Having threshold > max size chunk also has the benefit of guaranteeing that we'll always get >1 chunk, which simplifies things a lot.

Why the default chunk size?

512kB may seem a little large for CDC, but it hits the sweet spot with Bazel. For medium-side repos (for example https://github.com/buildbuddy-io/buildbuddy) with many GoLink objects, it performs well with the size of the artifacts.

Using my --disk_cache from the past 1-2 months of development, I ran some benchmarking using different sizes, and got the following results:

AvgSize  │ %Chunked   │ %Reused    │ %Unique    │ Dedup%   │ Saved
─────────────────────────────────────────────────────────────────────────────────────
16KB    │      18.1% │      48.0% │      52.0% │    50.0% │    141.91 GB
32KB    │      15.8% │      46.0% │      54.0% │    48.2% │    136.90 GB
64KB    │      13.2% │      42.3% │      57.7% │    45.8% │    130.18 GB
128KB   │       7.5% │      38.4% │      61.6% │    43.1% │    122.24 GB
256KB   │       6.1% │      32.5% │      67.5% │    39.8% │    112.96 GB
**512KB   │       4.2% │      24.0% │      76.0% │    35.8% │    101.52 GB
1MB     │       1.9% │      16.4% │      83.6% │    31.0% │     88.00 GB
2MB     │       1.6% │       8.8% │      91.2% │    25.9% │     73.47 GB

We get some good benefits of starting on the larger end, at 512k, with a 2MB threshold. This can be adjusted later, but de-duplication is strong here (35%), and this only affects 4% of files.

We could drop to 64k, but we'd only get ~10% more de-duplication savings and still need to chunk 3x the number of files. Another option is to use 256k and 1MB. But this would add 2x the overhead of chunking, with only a small improvement. I think 256k or 512k would be both good options, but having a larger threshold reduces the amount of chunking overhead we have, which significantly helps performance.

Introduce ChunkingFunction which enum is a set of known chunking
algorithms that the server can recommend to the client.

Provide FastCDC_2020 as the first explicit chunking algorithm.

The server advertise these through a new chunking_configuration field in
CacheCapabilities message. There, the server may set the chunking
functions that it supports as well as the relevant configuration
parameters for that chunking algorithm.
@tyler-french tyler-french force-pushed the chunking-algo branch 2 times, most recently from 5050f99 to e7d3f11 Compare January 26, 2026 19:22
@tyler-french tyler-french changed the title WIP: server should prefer a chunking algorithm feat(SplitBlob/SpliceBlob): add chunking algorithm Jan 26, 2026
@tyler-french tyler-french marked this pull request as ready for review January 27, 2026 03:42
Copilot AI review requested due to automatic review settings January 27, 2026 03:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Content-Defined Chunking (CDC) algorithm negotiation to the Remote Execution API, enabling distributed, deterministic, and reproducible chunking between clients and servers. It introduces FastCDC 2020 as the first supported algorithm with configuration parameters for optimal deduplication.

Changes:

  • Adds ChunkingFunction enum and ChunkingConfiguration message to define supported chunking algorithms and their parameters
  • Extends SplitBlobRequest, SplitBlobResponse, and SpliceBlobRequest messages with chunking_function fields
  • Introduces FastCDC 2020 algorithm support with configurable parameters (avg_chunk_size_bytes, normalization_level, seed) and sensible defaults (512 KiB average, 2 MiB threshold)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tyler-french tyler-french force-pushed the chunking-algo branch 2 times, most recently from 685803d to d2d2404 Compare January 29, 2026 02:37
Copy link
Collaborator

@sluongng sluongng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im a bit occupied this week, so I plan to give this a better read next week. Got some small nits but lgtm overall.

I think it would be nice if we could provide a test vector for this, similar to https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/sha256tree_test_vectors.txt, which was added for sha256tree. This way, folks can test their implementation against the test vector to verify that the generated chunks are identical across implementations.

// - Level 2: Most chunks match desired size (recommended)
// - Level 3: Nearly all chunks are the desired size
//
// If unset, clients SHOULD use 2 as the default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using 2 here when https://docs.rs/fastcdc/3.2.1/fastcdc/v2020/struct.FastCDC.html uses 1 as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works a bit better with build artifacts, but either is fine with me:

See https://github.com/buildbuddy-io/fastcdc2020/blob/main/fastcdc/fastcdc.go#L21-L41

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should provide concrete benchmark setup so that folks can verify against their own data set to see what's best.

If we think 2 is definitively best for RBE use case, let's just document that we should default use 2 and remove the normalization level config knob here. That would help us simplify the setup a bit more. Whoever is interested in using a different level could add it back later.

WDYT?

Comment on lines +2383 to +2396
// The minimum chunk size will be avg_chunk_size_bytes / 4.
// The maximum chunk size will be avg_chunk_size_bytes * 4.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/will/SHOULD/

This leaves room for additional params to be added in the future if folks need them.

Comment on lines +2196 to +2200
// Servers which support this chunking function SHOULD advertise the following
// configuration parameters through the CacheCapabilities message:
// - avg_chunk_size_bytes
// - normalization_level
// - seed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make it s/SHOULD/MUST/ here to force servers that support fastcdc to always advertise relevant chunking config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was why I added defaults, so if the server uses the defaults, they can leave it unset, we can assume the defaults. Do you think it's necessary to require even with the defaults?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the chunking may happen on both client side and server side. If the server bother to add FastCDC into the chunking algorithm it support, we should mandate that it MUST set these fields to make sure that the client and server stay consistent overtime. (the default may change overtime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants