-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
This is a crazy idea
Now that DataFusion is the fastest engine for Parquet in ClickBench
- Update ClickBench benchmarks with DataFusion
43.0.0#13099 - [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821
A natural follow-on question is “what would it take to make it the fastest overall engine”?
Describe the solution you'd like
TLDR is that I think it needs a special file format. A custom format is fine and consistent with other systems in ClickBench which use various proprietary formats.
So, as a fascinating experiment / academic project, someone could be design / hack up a “ClickBench” file format and DataFusion TableProvider, specifically designed for getting the fastest ClickBench results.
While I suspect this format would not be particularly general purpose, I think it would show How easy it is to make custom formats for particular use cases with DataFusion (don’t have to worry about all the rest of the query engine machinery)
Describe alternatives you've considered
No response
Additional context
This was inspired by @pauldix talking about using DataFusion to innovate at “the edges” of database design https://twitter.com/pauldix/status/1855330035974160483