Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed CSV/JSON support #3642

Merged
merged 5 commits into from
Oct 11, 2022
Merged

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Sep 28, 2022

Which issue does this PR close?

Closes #3641.

Rationale for this change

Explained in #3641.

What changes are included in this PR?

  • Add FileCompressionType as the text file compression type definition.
  • Add GZip/BZip2 converters for Read/Stream.
  • Add the COMPRESSION TYPE SQL token.
  • Modify the file type inference method so that it can determine a file type and its compression.

Are there any user-facing changes?

Yes.

I am not sure which are public APIs, but maybe yes.

If there are any breaking changes to public APIs, please add the api change label.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner labels Sep 28, 2022
@alamb
Copy link
Contributor

alamb commented Sep 28, 2022

This looks very cool @Licht-T -- thank you for the contribution. Hopefully we'll get a chance to review it in the next few days

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 29, 2022

Thanks, @alamb.

The test works on my Windows environment but still fails on the Windows CI. I will push a debug code.

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 29, 2022

I found Url::join cannot handle a drive letter.

url: Url { scheme: "file", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/", query: None, fragment: None }, path: "C:/Users/runneradmin/AppData/Local/Temp/.tmpGAMumf/partition-0.json.bz2"
joined url: Url { scheme: "c", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/Users/runneradmin/AppData/Local/Temp/.tmpGAMumf/partition-0.json.bz2", query: None, fragment: None }

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 29, 2022

Now all green. Ready to get reviewed.

async-trait = "0.1.41"
bytes = "1.1"
bzip2 = "0.4.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas how many dependencies we are adding? I guess some of them are already transitive dependencies through parquet.

@alamb
Copy link
Contributor

alamb commented Oct 3, 2022

I took a quick look through this PR and it looks great -- thank you @Licht-T

Can you please merge / rebase from master to resolve conflicts so that I can merge it?

Screen Shot 2022-10-03 at 9 50 46 AM

Thanks again!

@Licht-T
Copy link
Contributor Author

Licht-T commented Oct 10, 2022

@alamb Merged from master.

I believe we need some documentation about this, right? I am willing to do that.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2022

I believe we need some documentation about this, right? I am willing to do that.

Yes please @Licht-T -- that would be super helpful.

Perhaps https://arrow.apache.org/datafusion/user-guide/sql/ddl.html?highlight=external+table and https://arrow.apache.org/datafusion/user-guide/sql/index.html?highlight=external+table ?

If you are not able to do so, we should at least file a ticket to document the feature

@@ -101,6 +105,7 @@ ctor = "0.1.22"
doc-comment = "0.3"
env_logger = "0.9"
fuzz-utils = { path = "fuzz-utils" }
rstest = "0.15.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan what do we think about adding this new testing library?

@alamb alamb merged commit b8a3a78 into apache:master Oct 11, 2022
@alamb
Copy link
Contributor

alamb commented Oct 11, 2022

Thanks again @Licht-T

@ursabot
Copy link

ursabot commented Oct 11, 2022

Benchmark runs are scheduled for baseline = 58afdf7 and contender = b8a3a78. b8a3a78 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@janrito janrito mentioned this pull request Oct 11, 2022
@Bidek56
Copy link

Bidek56 commented Jan 2, 2023

@Licht-T Were you able to test this change? I am trying these csv options and it simple returns nothing on a valid .gz file.
Thanks

let csv_options = CsvReadOptions::default().has_header(true).file_compression_type(FileCompressionType::GZIP); ctx.register_csv("attrib", file, csv_options).await?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compressed CSV/JSON Read
5 participants