Use compression type in CSV file suffices #16609
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
As mentioned in that issue, it's reasonable to autodetect a file suffix. For example, for gzipped CSV it should be
.csv.gz. It is used for both writing to a file and reading from a directory.What changes are included in this PR?
Since the physical planner needs to understand whether compression was specified as a part of the abstract
FileFormattrait, I extended it to provide a compression type. Based on this information, we're able to construct a new extension with a fallback to the original one (.csv).A second change to
TableProviderFactoryis for reading – we construct a glob expression based on compression type. For a new factory call, there are no files to infer extensions (although some logic was previously in place), so detection is also added here.Are these changes tested?
cargo run --example dataframeand got a gzipped filedatafusion-examples/test_csv/kuz6EXRCgJqdmob3_0.csv.gzcopy.sltwith compression casesAre there any user-facing changes?
If the user specifies a directory to write CSV files (not a single file path) and the compression type, then files will have
.csv.gzextensions instead of.csv. Similarly, when reading from a directory, the.csv.gzfilter will be used.Not sure if we have to maintain the backward compatibility for existing directories of compressed
.csvfiles.