Skip to content

Conversation

@theirix
Copy link
Contributor

@theirix theirix commented Jun 28, 2025

Which issue does this PR close?

Rationale for this change

As mentioned in that issue, it's reasonable to autodetect a file suffix. For example, for gzipped CSV it should be .csv.gz. It is used for both writing to a file and reading from a directory.

What changes are included in this PR?

  1. Since the physical planner needs to understand whether compression was specified as a part of the abstract FileFormat trait, I extended it to provide a compression type. Based on this information, we're able to construct a new extension with a fallback to the original one (.csv).

  2. A second change to TableProviderFactory is for reading – we construct a glob expression based on compression type. For a new factory call, there are no files to infer extensions (although some logic was previously in place), so detection is also added here.

Are these changes tested?

  1. Added unit tests
  2. Run a manual example program cargo run --example dataframe and got a gzipped file datafusion-examples/test_csv/kuz6EXRCgJqdmob3_0.csv.gz
  3. Verified copy.slt with compression cases

Are there any user-facing changes?

If the user specifies a directory to write CSV files (not a single file path) and the compression type, then files will have .csv.gz extensions instead of .csv. Similarly, when reading from a directory, the .csv.gz filter will be used.

Not sure if we have to maintain the backward compatibility for existing directories of compressed .csv files.

theirix added 2 commits June 28, 2025 23:39
- Add FileFormat::compression_type method
- Specify meaningful values for CSV only
- Use compression type as a part of extension for files
@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jun 28, 2025
@theirix theirix marked this pull request as ready for review June 29, 2025 18:55
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @theirix -- this looks good to me

@alamb alamb merged commit 2741c60 into apache:main Jul 7, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

Thanks again @theirix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding correct file extension

2 participants