Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rust's DataFusion (arrow) #107

Open
andygrove opened this issue Oct 16, 2019 · 9 comments
Open

Add Rust's DataFusion (arrow) #107

andygrove opened this issue Oct 16, 2019 · 9 comments

Comments

@andygrove
Copy link

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.

DataFusion supports projection, selection, and simple aggregate queries.

https://github.com/apache/arrow/tree/master/rust/datafusion

@jangorecki
Copy link
Contributor

Thanks for filling the request.
I would appreciate if someone could ping me here when it will support joins.

@andygrove
Copy link
Author

Here are latest benchmarks for GROUP BY and I think this is mature enough to consider adding here, but it doesn't support JOIN yet. Is that a prerequisite to getting it on this site?

https://andygrove.io/rust_bigdata_benchmarks/

@jangorecki
Copy link
Contributor

Definitely not a prerequisite. Looks competitive. Should one expect to see similar performance comparing to other tools that uses Arrow as a backend? Then we would benchmarking Arrow via its Rust interface. Still make sense, just asking to for better understanding.

@andygrove
Copy link
Author

andygrove commented Oct 20, 2019 via email

@jangorecki jangorecki changed the title Add DataFusion Add Rust's DataFusion (arrow) Feb 23, 2020
@andygrove
Copy link
Author

Thanks for filling the request.
I would appreciate if someone could ping me here when it will support joins.

@jangorecki FYI DataFusion 3.0.0 (due to be released any day now) now supports joins

@jangorecki
Copy link
Contributor

@andygrove Thanks for update. Note that recently another rust-based solution was merged, Polars. The process was very smooth because the author of Polars submitted groupby and join benchmark scripts in PR. This helped a lot. Writing those scripts properly is not an easy job because I need not only to figure out how to answer questions, but how to answer questions in the most performant way.

@MrPowers
Copy link

@jangorecki - can I submit a pull request with the DataFusion script to help with the process?

@andygrove
Copy link
Author

If it helps, we could even publish a specific rust crate containing the datafusion h2o benchmarks.

@c21
Copy link

c21 commented Jul 14, 2022

It would be great to add DataFusion to the benchmark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants