Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTTP object store example #7602

Merged
merged 2 commits into from
Sep 25, 2023
Merged

Add HTTP object store example #7602

merged 2 commits into from
Sep 25, 2023

Conversation

pka
Copy link
Contributor

@pka pka commented Sep 19, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

There's no example for accessing files over HTTP.

What changes are included in this PR?

Adds an example for reading a CSV file over HTTP.

DRAFT: requires apache/arrow-rs#4837 to be merged

Are these changes tested?

N/A

Are there any user-facing changes?

No

ctx.runtime_env()
.register_object_store(&base_url, Arc::new(http_store));

// register csv file with the execution context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool

@pka pka marked this pull request as ready for review September 24, 2023 21:25
// register csv file with the execution context
ctx.register_csv(
"aggregate_test_100",
"https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv",
Copy link
Contributor

@alamb alamb Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran this example, even after merging up from main to get the update to arrow 47 I see an error:

cargo run --example query-http-csv

Running `/Users/alamb/Software/target-df2/debug/examples/query-http-csv`
Error: ObjectStore(NotFound { path: "apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv"
...
xXCp+7drdDBCAdubm6eidX+2WwqT5komwh4YQLk+H4aE93h8Xg2gvHekQZOGSgLZTLyDTLJ4Lx9/KZWKBSainT4Iy3FqQBfnUZR42PKQFksBr9QKVXCPusD3OiA/RkQ5kP8qV/Jl1WywAp/6+dcmPM2zL1UrUahe4JqfnWWKXIul3uUbfP8njAFLW1OFr3gdFtZ72cNH+PtQT7/brW+NXqJAHh0y9V8/U/A1U7AfwIMAD7mS3pCbuWJAAAAAElFTkSuQmCC\">\n      </a>\n    </div>\n  </body>\n</html>\n", source: Some(reqwest::Error { kind: Status(404), url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("github.com")), port: None, path: "/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv", query: None, fragment: None } }), status: Some(404) } })

Did you have it work for you?

Update: it needs an object store release (tracked by apache/arrow-rs#4858)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pka -- I took the liberty of merging this branch from master, and then tested with the (unreleased) version of object_store and it works great!

I applied this patch:

diff --git a/Cargo.toml b/Cargo.toml
index 60ff770d0d..dae6e3c04f 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -74,3 +74,6 @@ opt-level = 3
 overflow-checks = false
 panic = 'unwind'
 rpath = false
+
+[patch.crates-io]
+object_store = {  git = "https://github.com/apache/arrow-rs.git" }

And then ran it like this

cargo run --example query-http-csv
   Compiling object_store v0.7.0 (https://github.com/apache/arrow-rs.git#2c9e2e9a)
   Compiling parquet v47.0.0
   Compiling datafusion-common v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/common)
   Compiling datafusion-expr v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/expr)
   Compiling datafusion-physical-expr v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/physical-expr)
   Compiling datafusion-execution v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/execution)
   Compiling datafusion-sql v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/sql)
   Compiling datafusion-physical-plan v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/physical-plan)
   Compiling datafusion-optimizer v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/optimizer)
   Compiling datafusion v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/core)
   Compiling datafusion-examples v31.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion-examples)
    Finished dev [unoptimized + debuginfo] target(s) in 1m 00s
     Running `/Users/alamb/Software/target-df2/debug/examples/query-http-csv`
+----+----+-----+
| c1 | c2 | c3  |
+----+----+-----+
| c  | 2  | 1   |
| d  | 5  | -40 |
| b  | 1  | 29  |
| a  | 1  | -85 |
| b  | 5  | -82 |
+----+----+-----+

@alamb alamb merged commit b1d134e into apache:main Sep 25, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 25, 2023

Thanks again @pka

@pka pka deleted the http-store-example branch September 25, 2023 21:39
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants