-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: add compression level configuration for JSON/CSV writers #18954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @andygrove, @Dandandan, @viirya! |
Adds `compression_level` option to `JsonOptions` and `CsvOptions` allowing users to specify compression level for ZSTD, GZIP, BZIP2, and XZ algorithms. - Add compression_level field to JsonOptions and CsvOptions in config.rs - Add convert_async_writer_with_level method (non-breaking, extends API) - Keep original convert_async_writer for backward compatibility - Update JsonWriterOptions and CsvWriterOptions with compression_level - Update ObjectWriterBuilder to support compression level - Update JSON and CSV sinks to pass compression level through - Update proto definitions and conversions for serialization Closes apache#18947
a7efa3c to
b3691fc
Compare
|
A tiny |
I wanted to do it but I think @andygrove triggered it before I did. 🙂 |
| pub fn new_with_level( | ||
| writer_options: WriterBuilder, | ||
| compression: CompressionTypeVariant, | ||
| compression_level: Option<u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to just have compression_level: u32 and direct users to use new if they want default (None) compression level? Thoughts? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at flate2::Compression, it uses new(level: u32) + default() rather than Option.
My rationale was config system integration. CsvOptions.compression_level is Option<u32> because users may or may not specify it in config. The new_with_level(..., Option<u32>) signature makes the TryFrom<&CsvOptions> impl straightforward.
But I agree the public API could be cleaner. I could:
- Keep
new_with_level(..., compression_level: u32)as you suggest (non-optional) - Let
TryFrominternally callnew()whencompression_levelisNone, ornew_with_level()whenSome
Would you prefer that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this response LLM generated? It doesn't make sense with regards to the codebase; TryFrom<&CsvOptions> doesn't use new() or new_with_level()
| /// Compression level for the output file. The valid range depends on the | ||
| /// compression algorithm: | ||
| /// - ZSTD: 1 to 22 (default: 3) | ||
| /// - GZIP: 0 to 10 (default: varies by implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does varies by implementation mean here? Depends on system library, depends on rust crate dependency (in which case ideally we'd know which it is)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @Jefffrey ! You're right that "varies by implementation" is vague. Let me clarify:
The GZIP compression in async-compression uses flate2 under the hood. Looking at the flate2 source code, the default is level 6:
// From https://github.com/rust-lang/flate2-rs/blob/main/src/lib.rs#L220-L224
impl Default for Compression {
fn default() -> Compression {
Compression(6)
}
}This is the standard zlib/gzip default (going back to the original zlib implementation). The valid range is 0-9 (not 0-10 as I incorrectly wrote).
I'll update the comment to be more precise:
/// - GZIP: 0 to 9 (default: 6)
The reason I was initially cautious is that some compression libraries allow you to swap backends (e.g., flate2 can use miniz_oxide, zlib-rs, or native zlib), but they all follow the same 0-9 range and default to 6 for compatibility.
Would you like me to also update the code to fix the comment and rerun CI workflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should specify the default if known; having it left as implementation specific is very confusing to users
| /// - BZIP2: 0 to 9 (default: 6) | ||
| /// - XZ: 0 to 9 (default: 6) | ||
| /// If not specified, the default level for the compression algorithm is used. | ||
| pub compression_level: Option<u32>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realize that there is no impl JsonOptions with all with_xyz(mut self, ...) setters like the CsvOptions.
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a roundtrip test for proto (enhancing an existing one if possible) and possibly and end-to-end test to show this new config in use when writing a file
| /// - BZIP2: 0 to 9 (default: 6) | ||
| /// - XZ: 0 to 9 (default: 6) | ||
| /// If not specified, the default level for the compression algorithm is used. | ||
| pub compression_level: Option<u32>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include level inside compression type CompressionTypeVariant, like
pub enum CompressionTypeVariant {
/// Gzip-ed file, level 1–9
Gzip { level: u32 },
....
This introduces some API changes, but I think it's cleaner and better for the long term 🤔
Which issue does this PR close?
Closes #18947
Rationale for this change
Currently, DataFusion uses default compression levels when writing compressed JSON and CSV files. For ZSTD, this means level 3, which prioritizes speed over compression ratio. Users working with large datasets who want to optimize for storage costs or network transfer have no way to increase the compression level.
This is particularly important for cloud data lake scenarios where storage and egress costs can be significant.
What changes are included in this PR?
compression_level: Option<u32>field toJsonOptionsandCsvOptionsinconfig.rsconvert_async_writer_with_level()method toFileCompressionType(non-breaking API extension)convert_async_writer()as a convenience wrapper for backward compatibilityJsonWriterOptionsandCsvWriterOptionswithcompression_levelfieldObjectWriterBuilderto support compression leveludf.rs(conditional compilation for debug-only imports)Are these changes tested?
The changes follow the existing patterns used throughout the codebase. The implementation was verified by:
cargo buildcargo test --package datafusion-protoAre there any user-facing changes?
Yes, users can now specify compression level when writing JSON/CSV files:
Supported compression levels:
This is a non-breaking change - the original
convert_async_writer()method signature is preserved for backward compatibility.