-
Notifications
You must be signed in to change notification settings - Fork 132
feat(SplitBlob/SpliceBlob): add chunking algorithm #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduce ChunkingFunction which enum is a set of known chunking algorithms that the server can recommend to the client. Provide FastCDC_2020 as the first explicit chunking algorithm. The server advertise these through a new chunking_configuration field in CacheCapabilities message. There, the server may set the chunking functions that it supports as well as the relevant configuration parameters for that chunking algorithm.
5050f99 to
e7d3f11
Compare
e7d3f11 to
9f979d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds Content-Defined Chunking (CDC) algorithm negotiation to the Remote Execution API, enabling distributed, deterministic, and reproducible chunking between clients and servers. It introduces FastCDC 2020 as the first supported algorithm with configuration parameters for optimal deduplication.
Changes:
- Adds ChunkingFunction enum and ChunkingConfiguration message to define supported chunking algorithms and their parameters
- Extends SplitBlobRequest, SplitBlobResponse, and SpliceBlobRequest messages with chunking_function fields
- Introduces FastCDC 2020 algorithm support with configurable parameters (avg_chunk_size_bytes, normalization_level, seed) and sensible defaults (512 KiB average, 2 MiB threshold)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
685803d to
d2d2404
Compare
sluongng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im a bit occupied this week, so I plan to give this a better read next week. Got some small nits but lgtm overall.
I think it would be nice if we could provide a test vector for this, similar to https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/sha256tree_test_vectors.txt, which was added for sha256tree. This way, folks can test their implementation against the test vector to verify that the generated chunks are identical across implementations.
| // - Level 2: Most chunks match desired size (recommended) | ||
| // - Level 3: Nearly all chunks are the desired size | ||
| // | ||
| // If unset, clients SHOULD use 2 as the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using 2 here when https://docs.rs/fastcdc/3.2.1/fastcdc/v2020/struct.FastCDC.html uses 1 as default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works a bit better with build artifacts, but either is fine with me:
See https://github.com/buildbuddy-io/fastcdc2020/blob/main/fastcdc/fastcdc.go#L21-L41
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think we should provide concrete benchmark setup so that folks can verify against their own data set to see what's best.
If we think 2 is definitively best for RBE use case, let's just document that we should default use 2 and remove the normalization level config knob here. That would help us simplify the setup a bit more. Whoever is interested in using a different level could add it back later.
WDYT?
| // The minimum chunk size will be avg_chunk_size_bytes / 4. | ||
| // The maximum chunk size will be avg_chunk_size_bytes * 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/will/SHOULD/
This leaves room for additional params to be added in the future if folks need them.
| // Servers which support this chunking function SHOULD advertise the following | ||
| // configuration parameters through the CacheCapabilities message: | ||
| // - avg_chunk_size_bytes | ||
| // - normalization_level | ||
| // - seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could make it s/SHOULD/MUST/ here to force servers that support fastcdc to always advertise relevant chunking config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was why I added defaults, so if the server uses the defaults, they can leave it unset, we can assume the defaults. Do you think it's necessary to require even with the defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the chunking may happen on both client side and server side. If the server bother to add FastCDC into the chunking algorithm it support, we should mandate that it MUST set these fields to make sure that the client and server stay consistent overtime. (the default may change overtime)
d2d2404 to
10f3cbe
Compare
10f3cbe to
215416a
Compare
For CDC (Content-Defined Chunking), having the client and server agree upon a chunking algorithm unlocks a new level of possible improvements, where we can have distributed, deterministic, and reproducible chunking.
This PR implements a chunking algorithm negotiation to GetCapabilities. Most notably, it:
Why FastCDC 2020?
FastCDC 2020 is very fast (~5GB/s on my AMD Ryzen 9950), and is backed by a clear spec described in the paper: IEEE paper pdf. Of the CDC algorithms, it is the most popular, with https://github.com/nlfiedler/fastcdc-rs mirroring the paper's implementation.
This algorithm has a couple more very important benefits:
Thank you to @sluongng for much of the initial version of this PR
Why the threshold?
We only chunk blobs > the threshold size. Having a threshold be >= the max chunk size is an important design decision for the first iterations of implementation here. Mainly, there's a cyclic re-chunking possibility that could happen if it's not upheld. For example, if the chunking threshold is only 1MB, and a chunker (avg size 512k and max 2MB) produces a 1.5MB chunk, the server is going to try to re-chunk this again. It also means the CAS doesn't know if a chunk is a full blob, or is a chunk, or is a file that was chunked into a single chunk. Having threshold > max size chunk also has the benefit of guaranteeing that we'll always get >1 chunk, which simplifies things a lot.
Why the default chunk size?
512kB may seem a little large for CDC, but it hits the sweet spot with Bazel. For medium-side repos (for example https://github.com/buildbuddy-io/buildbuddy) with many GoLink objects, it performs well with the size of the artifacts.
Using my
--disk_cachefrom the past 1-2 months of development, I ran some benchmarking using different sizes, and got the following results:We get some good benefits of starting on the larger end, at 512k, with a 2MB threshold. This can be adjusted later, but de-duplication is strong here (35%), and this only affects 4% of files.
We could drop to 64k, but we'd only get ~10% more de-duplication savings and still need to chunk 3x the number of files. Another option is to use 256k and 1MB. But this would add 2x the overhead of chunking, with only a small improvement. I think 256k or 512k would be both good options, but having a larger threshold reduces the amount of chunking overhead we have, which significantly helps performance.