Use compression type in CSV file suffices #16609

theirix · 2025-06-28T22:42:51Z

Which issue does this PR close?

Closes Adding correct file extension #16260.

Rationale for this change

As mentioned in that issue, it's reasonable to autodetect a file suffix. For example, for gzipped CSV it should be .csv.gz. It is used for both writing to a file and reading from a directory.

What changes are included in this PR?

Since the physical planner needs to understand whether compression was specified as a part of the abstract FileFormat trait, I extended it to provide a compression type. Based on this information, we're able to construct a new extension with a fallback to the original one (.csv).
A second change to TableProviderFactory is for reading – we construct a glob expression based on compression type. For a new factory call, there are no files to infer extensions (although some logic was previously in place), so detection is also added here.

Are these changes tested?

Added unit tests
Run a manual example program cargo run --example dataframe and got a gzipped file datafusion-examples/test_csv/kuz6EXRCgJqdmob3_0.csv.gz
Verified copy.slt with compression cases

Are there any user-facing changes?

If the user specifies a directory to write CSV files (not a single file path) and the compression type, then files will have .csv.gz extensions instead of .csv. Similarly, when reading from a directory, the .csv.gz filter will be used.

Not sure if we have to maintain the backward compatibility for existing directories of compressed .csv files.

- Add FileFormat::compression_type method - Specify meaningful values for CSV only - Use compression type as a part of extension for files

alamb

Thank you @theirix -- this looks good to me

alamb · 2025-07-07T11:28:26Z

Thanks again @theirix

theirix added 2 commits June 28, 2025 23:39

Use compression type in file suffices

d50b908

- Add FileFormat::compression_type method - Specify meaningful values for CSV only - Use compression type as a part of extension for files

Add CSV tests

414dda6

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jun 28, 2025

theirix added 3 commits June 29, 2025 18:57

Add glob dep, use env logging

a2f821a

Use a glob pattern with compression suffix for TableProviderFactory

4103276

Conform to clippy standards

c9e3ea1

theirix marked this pull request as ready for review June 29, 2025 18:55

alamb approved these changes Jul 5, 2025

View reviewed changes

Merge branch 'main' into ext-compress

290cf30

alamb merged commit 2741c60 into apache:main Jul 7, 2025
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use compression type in CSV file suffices #16609

Use compression type in CSV file suffices #16609

Uh oh!

theirix commented Jun 28, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

alamb commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use compression type in CSV file suffices #16609

Use compression type in CSV file suffices #16609

Uh oh!

Conversation

theirix commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theirix commented Jun 28, 2025 •

edited

Loading