Compress virtual chunk locations.#1776
Conversation
It only compresses on IC2, if there are enough virtual chunks and the locations are long enough.
53da127 to
8499a04
Compare
| #[pymethods] | ||
| impl PyManifestVirtualChunkLocationCompressionConfig { | ||
| #[new] | ||
| #[pyo3(signature = (min_virtual_chunks_to_compress=None, dictionary_max_training_samples=None, dictionary_max_size_bytes=None, compression_level=None))] |
There was a problem hiding this comment.
| #[pyo3(signature = (min_virtual_chunks_to_compress=None, dictionary_max_training_samples=None, dictionary_max_size_bytes=None, compression_level=None))] | |
| #[pyo3(signature = (min_virtual_chunks_to_compress=None, *, dictionary_max_training_samples=None, dictionary_max_size_bytes=None, compression_level=None))] |
There was a problem hiding this comment.
Perhaps this should be min_num_chunks instead of min_virtual_chunks_to_compress. The "virtual" and "compress" is already in ManifestVirtualChunkLocationCompressionConfig
| impl PyManifestConfig { | ||
| #[new] | ||
| #[pyo3(signature = (preload=None, splitting=None))] | ||
| #[pyo3(signature = (preload=None, splitting=None, virtual_chunk_location_compression=None))] |
There was a problem hiding this comment.
| #[pyo3(signature = (preload=None, splitting=None, virtual_chunk_location_compression=None))] | |
| #[pyo3(signature = (*, preload=None, splitting=None, virtual_chunk_location_compression=None))] |
There was a problem hiding this comment.
I think this would break api
There was a problem hiding this comment.
My personal preference is that this is an OK break. At the very least we can stick * after splitting
| // compression algorithm used for compressed_location fields | ||
| // 0 = none, 1 = zstd dictionary compression | ||
| // Introduced in spec version 2.0 | ||
| compression_algorithm: uint8 = 1; |
There was a problem hiding this comment.
A spec file should be updated. Are you planning to do that in one go at the end?
There was a problem hiding this comment.
yeah ... the spec is very outdated
| virtual_chunks_compression_config: Option< | ||
| &ManifestVirtualChunkLocationCompressionConfig, | ||
| >, | ||
| ) -> IcechunkResult<Option<Self>> { |
There was a problem hiding this comment.
What can cause a failure now?
I see it's train_location_dictionary (so hard to spot a ? in the diff :) )
| let compressed_locations = | ||
| match (&location_compression_dict, virtual_chunks_compression_config) { | ||
| (Some(d), Some(config)) => compress_locations(&sorted_chunks, d, config), | ||
| _ => vec![None; sorted_chunks.len()], |
There was a problem hiding this comment.
Seems weird to me that the config can be None here. Doesn't that mean the default, which is to compress? Did I misunderstand the change to config.rs?
There was a problem hiding this comment.
Aha, the default is not to compress! Would be good to explicitly write that out somewhere, maybe in config.rs
There was a problem hiding this comment.
The intention was the default to be compression for IC2, and of course we cannot compress in IC1. Did I get it wrong?
| assert_eq!(repo.spec_version(), SpecVersionBin::V2dot0); | ||
|
|
||
| // Rewrite manifests (now with IC2 compression enabled) | ||
| rewrite_manifests( |
<!-- Please review our AI & contribution guidelines: https://icechunk.io/en/latest/ai-policy/ --> ## Description We add a new `extra` field pretty much where it makes sense. There is no significant space impact because these are all flatbuffers tables and the field is optional. Code-wise, we don't do anything. All these fields will remain empty until we find a use for them after Icechunk 2.0. <!-- Describe your changes and the reasoning behind them. -->
It only compresses on IC2, if there are enough virtual chunks and the
locations are long enough.