Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include datafusion in the benchmark #5

Closed
kszlim opened this issue Apr 17, 2023 · 9 comments
Closed

Include datafusion in the benchmark #5

kszlim opened this issue Apr 17, 2023 · 9 comments
Labels
Solution Include new solution

Comments

@kszlim
Copy link

kszlim commented Apr 17, 2023

Datafusion is another stateless query engine/dataframe library I'd be interested in seeing results for.

https://github.com/apache/arrow-datafusion

@Tmonster
Copy link
Collaborator

Hi Kevin, thanks for the suggestion!

I currently don't have a lot of bandwidth to add a whole new solution to the benchmark, but if you want to open a PR that adds the necessary setup-datafusion.sh, ver-datafusion.sh, upg-datafusion.sh, groupby-datafusion.rs, and join-datafusion.rs then I'd be happy to review. Take a look at files in the other solution folders and that should give you a good idea of what is necessary. Although it may require more steps as datafusion doesn't have any R or python APIs, so you may also need to add/modify some files in _launcher and _helpers

See repro.sh for steps to run the benchmark either locally or on an AWS instance. If no errors are thrown for the 0.5GB & 5GB datasets I'd be happy to merge your PR and re-run the benchmark to include results for datafusion.

@Tmonster Tmonster added the Solution Include new solution label Apr 19, 2023
@kszlim
Copy link
Author

kszlim commented Apr 19, 2023

There is actually a python api, though it's not documented well:
https://github.com/apache/arrow-datafusion-python

If i have time i'll try to port the benchmarks to it.

@MrPowers
Copy link

MrPowers commented May 2, 2023

Looks like almost all of this work is done already: https://github.com/apache/arrow-datafusion/tree/main/benchmarks/db-benchmark

Would you like to add the PR @kszlim or would you like me to take a stab?

@kszlim
Copy link
Author

kszlim commented May 2, 2023

Go ahead, I don't have the time!

@Tmonster
Copy link
Collaborator

@MrPowers Was just looking at this again. Looks like the db benchmark for data fusion is here now?
https://github.com/apache/arrow-datafusion-python/tree/main/benchmarks/db-benchmark
Would you still like to open a PR? Some of the files have benchmark initialization setup, so that would need to be trimmed, but I don't think it would be much work

@hkpeaks
Copy link

hkpeaks commented Jun 6, 2023

@kszlim I feel interest to include datafusion in coming benchmarking #13 (comment), is it support streaming (data large than memory scenario)?

@kszlim
Copy link
Author

kszlim commented Jun 6, 2023

@kszlim I feel interest to include datafusion in coming benchmarking #13 (comment), is it support streaming (data large than memory scenario)?

The rust library does, I'm not sure if the python bindings expose it.

@Dandandan
Copy link

Closed by #18

@kszlim
Copy link
Author

kszlim commented Dec 6, 2023

Thanks!

@kszlim kszlim closed this as completed Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solution Include new solution
Projects
None yet
Development

No branches or pull requests

5 participants