-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: add compression level configuration for JSON/CSV writers #18954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1661,6 +1661,7 @@ config_field!(bool, value => default_config_transform(value.to_lowercase().as_st | |
| config_field!(usize); | ||
| config_field!(f64); | ||
| config_field!(u64); | ||
| config_field!(u32); | ||
|
|
||
| impl ConfigField for u8 { | ||
| fn visit<V: Visit>(&self, v: &mut V, key: &str, description: &'static str) { | ||
|
|
@@ -2786,6 +2787,14 @@ config_namespace! { | |
| /// The default behaviour depends on the `datafusion.catalog.newlines_in_values` setting. | ||
| pub newlines_in_values: Option<bool>, default = None | ||
| pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED | ||
| /// Compression level for the output file. The valid range depends on the | ||
| /// compression algorithm: | ||
| /// - ZSTD: 1 to 22 (default: 3) | ||
| /// - GZIP: 0 to 10 (default: varies by implementation) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch @Jefffrey ! You're right that "varies by implementation" is vague. Let me clarify: The GZIP compression in async-compression uses flate2 under the hood. Looking at the flate2 source code, the default is level 6: // From https://github.com/rust-lang/flate2-rs/blob/main/src/lib.rs#L220-L224
impl Default for Compression {
fn default() -> Compression {
Compression(6)
}
}This is the standard zlib/gzip default (going back to the original zlib implementation). The valid range is 0-9 (not 0-10 as I incorrectly wrote). I'll update the comment to be more precise: /// - GZIP: 0 to 9 (default: 6)
The reason I was initially cautious is that some compression libraries allow you to swap backends (e.g., flate2 can use miniz_oxide, zlib-rs, or native zlib), but they all follow the same 0-9 range and default to 6 for compatibility. Would you like me to also update the code to fix the comment and rerun CI workflows?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes we should specify the default if known; having it left as |
||
| /// - BZIP2: 0 to 9 (default: 6) | ||
| /// - XZ: 0 to 9 (default: 6) | ||
| /// If not specified, the default level for the compression algorithm is used. | ||
| pub compression_level: Option<u32>, default = None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we include level inside compression type This introduces some API changes, but I think it's cleaner and better for the long term 🤔 |
||
| pub schema_infer_max_rec: Option<usize>, default = None | ||
| pub date_format: Option<String>, default = None | ||
| pub datetime_format: Option<String>, default = None | ||
|
|
@@ -2908,6 +2917,14 @@ impl CsvOptions { | |
| self | ||
| } | ||
|
|
||
| /// Set the compression level for the output file. | ||
| /// The valid range depends on the compression algorithm. | ||
| /// If not specified, the default level for the algorithm is used. | ||
| pub fn with_compression_level(mut self, level: u32) -> Self { | ||
| self.compression_level = Some(level); | ||
| self | ||
| } | ||
|
|
||
| /// The delimiter character. | ||
| pub fn delimiter(&self) -> u8 { | ||
| self.delimiter | ||
|
|
@@ -2933,6 +2950,14 @@ config_namespace! { | |
| /// Options controlling JSON format | ||
| pub struct JsonOptions { | ||
| pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED | ||
| /// Compression level for the output file. The valid range depends on the | ||
| /// compression algorithm: | ||
| /// - ZSTD: 1 to 22 (default: 3) | ||
| /// - GZIP: 0 to 10 (default: varies by implementation) | ||
| /// - BZIP2: 0 to 9 (default: 6) | ||
| /// - XZ: 0 to 9 (default: 6) | ||
| /// If not specified, the default level for the compression algorithm is used. | ||
| pub compression_level: Option<u32>, default = None | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just realize that there is no |
||
| pub schema_infer_max_rec: Option<usize>, default = None | ||
| } | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,8 @@ pub struct CsvWriterOptions { | |
| /// Compression to apply after ArrowWriter serializes RecordBatches. | ||
| /// This compression is applied by DataFusion not the ArrowWriter itself. | ||
| pub compression: CompressionTypeVariant, | ||
| /// Compression level for the output file. | ||
| pub compression_level: Option<u32>, | ||
| } | ||
|
|
||
| impl CsvWriterOptions { | ||
|
|
@@ -41,6 +43,20 @@ impl CsvWriterOptions { | |
| Self { | ||
| writer_options, | ||
| compression, | ||
| compression_level: None, | ||
| } | ||
| } | ||
|
|
||
| /// Create a new `CsvWriterOptions` with the specified compression level. | ||
| pub fn new_with_level( | ||
| writer_options: WriterBuilder, | ||
| compression: CompressionTypeVariant, | ||
| compression_level: Option<u32>, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it better to just have
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at My rationale was config system integration. But I agree the public API could be cleaner. I could:
Would you prefer that approach?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this response LLM generated? It doesn't make sense with regards to the codebase; |
||
| ) -> Self { | ||
| Self { | ||
| writer_options, | ||
| compression, | ||
| compression_level, | ||
| } | ||
| } | ||
| } | ||
|
|
@@ -81,6 +97,7 @@ impl TryFrom<&CsvOptions> for CsvWriterOptions { | |
| Ok(CsvWriterOptions { | ||
| writer_options: builder, | ||
| compression: value.compression, | ||
| compression_level: value.compression_level, | ||
| }) | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.