Skip to content

Compress virtual chunk locations.#1776

Merged
paraseba merged 10 commits into
mainfrom
push-sulplzntlnyo
Mar 9, 2026
Merged

Compress virtual chunk locations.#1776
paraseba merged 10 commits into
mainfrom
push-sulplzntlnyo

Conversation

@paraseba
Copy link
Copy Markdown
Collaborator

@paraseba paraseba commented Mar 6, 2026

It only compresses on IC2, if there are enough virtual chunks and the
locations are long enough.

It only compresses on IC2, if there are enough virtual chunks and the
locations are long enough.
@paraseba paraseba force-pushed the push-sulplzntlnyo branch from 53da127 to 8499a04 Compare March 6, 2026 03:50
@paraseba paraseba changed the title Add compressed_location and location_dictionary fields to manifest fl… Compress virtual chunk locations. Mar 6, 2026
Comment thread icechunk-python/python/icechunk/_icechunk_python.pyi
Comment thread icechunk-python/src/config.rs Outdated
#[pymethods]
impl PyManifestVirtualChunkLocationCompressionConfig {
#[new]
#[pyo3(signature = (min_virtual_chunks_to_compress=None, dictionary_max_training_samples=None, dictionary_max_size_bytes=None, compression_level=None))]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[pyo3(signature = (min_virtual_chunks_to_compress=None, dictionary_max_training_samples=None, dictionary_max_size_bytes=None, compression_level=None))]
#[pyo3(signature = (min_virtual_chunks_to_compress=None, *, dictionary_max_training_samples=None, dictionary_max_size_bytes=None, compression_level=None))]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be min_num_chunks instead of min_virtual_chunks_to_compress. The "virtual" and "compress" is already in ManifestVirtualChunkLocationCompressionConfig

impl PyManifestConfig {
#[new]
#[pyo3(signature = (preload=None, splitting=None))]
#[pyo3(signature = (preload=None, splitting=None, virtual_chunk_location_compression=None))]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[pyo3(signature = (preload=None, splitting=None, virtual_chunk_location_compression=None))]
#[pyo3(signature = (*, preload=None, splitting=None, virtual_chunk_location_compression=None))]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would break api

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference is that this is an OK break. At the very least we can stick * after splitting

// compression algorithm used for compressed_location fields
// 0 = none, 1 = zstd dictionary compression
// Introduced in spec version 2.0
compression_algorithm: uint8 = 1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A spec file should be updated. Are you planning to do that in one go at the end?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ... the spec is very outdated

Comment thread icechunk/src/format/manifest.rs
virtual_chunks_compression_config: Option<
&ManifestVirtualChunkLocationCompressionConfig,
>,
) -> IcechunkResult<Option<Self>> {
Copy link
Copy Markdown
Collaborator

@dcherian dcherian Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can cause a failure now?

I see it's train_location_dictionary (so hard to spot a ? in the diff :) )

Comment thread icechunk/src/format/manifest.rs Outdated
let compressed_locations =
match (&location_compression_dict, virtual_chunks_compression_config) {
(Some(d), Some(config)) => compress_locations(&sorted_chunks, d, config),
_ => vec![None; sorted_chunks.len()],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems weird to me that the config can be None here. Doesn't that mean the default, which is to compress? Did I misunderstand the change to config.rs?

Copy link
Copy Markdown
Collaborator

@dcherian dcherian Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, the default is not to compress! Would be good to explicitly write that out somewhere, maybe in config.rs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was the default to be compression for IC2, and of course we cannot compress in IC1. Did I get it wrong?

Comment thread icechunk/src/format/manifest.rs
Comment thread icechunk/src/format/manifest.rs
Comment thread icechunk/src/format/manifest.rs Outdated
Comment thread icechunk/src/format/manifest.rs
Comment thread icechunk/src/format/manifest.rs Outdated
assert_eq!(repo.spec_version(), SpecVersionBin::V2dot0);

// Rewrite manifests (now with IC2 compression enabled)
rewrite_manifests(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

paraseba added 3 commits March 9, 2026 13:27
<!-- Please review our AI & contribution guidelines:
https://icechunk.io/en/latest/ai-policy/ -->

## Description

We add a new `extra` field pretty much where it makes sense. There is no
significant space impact because these are all flatbuffers tables and
the field is optional.

Code-wise, we don't do anything. All these fields will remain empty
until we find a use for them after Icechunk 2.0.

<!-- Describe your changes and the reasoning behind them. -->
@paraseba paraseba enabled auto-merge (squash) March 9, 2026 17:43
@paraseba paraseba merged commit d7e28e3 into main Mar 9, 2026
19 of 20 checks passed
@paraseba paraseba deleted the push-sulplzntlnyo branch March 9, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants