Skip to content

[EPIC] Make DataFusion the top of the ClickBench Parquet leaderboard #18489

@alamb

Description

@alamb

"It is time to regain our rightful place at the top of the leaderboard" - me

Is your feature request related to a problem or challenge?

The ClickBench Benchmark measures the performance of filtering and aggregation, two of the core

Being on top of ClickBench is somewhat of a vanity benchmark: in my opinion all the engines within a factor of 2 of likely have similar user experiences (and the exact speed will depends on real user queries, etc)

That being said, the engine at the top of the benchmark is good for publicity and the DataFusion community is certainly not against using it as such (see see our blog here Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)

Also, ClickBench has more recently added more realistic benchmark machines

This ticket tracks improving the ClickBench performance even more

Here are results with DataFusion 47

[Results c6a.2xlarge(8 core, 16 GB)](https://benchmark.clickhouse.com/#system=+hBp|curp|fqo|ti%20rud|As(|kBP%20t|laDd|d(%20t|aseaa|Sa%20i&type=-&machine=+a2l&cluster_size=-&opensource=+s&tuned=-&metric=combined&queries=-)

Image

[Results c6a.4xlarge (16 core, 32 GB)](https://benchmark.clickhouse.com/#system=+hBp|curp|fqo|ti%20rud|As(|kBP%20t|laDd|d(%20t|aseaa|Sa%20i&type=-&machine=+ca4e&cluster_size=-&opensource=+s&tuned=-&metric=combined&queries=-):

Image

Here is where we stand with DataFusion 50 on the benchmark
(TODO: @pmcgleenon is running over the next few days, see #17721 (comment) -- and then I will update)

Describe the solution you'd like

Get DataFusion back on top of ClickBench for reading partitioned parquet

While being at the absolute top might seem appealing I think it is likely not general purpose enough

Describe alternatives you've considered

While we could clearly implement ClickBench specific optimizations, I don't think that is really a valuable exercise for users. I would very much like to focus our efforts on actually useful optimization -- if someone wants to go nuts with BenchMaxxing, check out

Real Improvements

Potential Benchmaxxing (only really helps ClickBench) improvements

Misc

What I would like is of people profile queries and try and find ways to improve the queries

Additional context

See related discussions on

Metadata

Metadata

Assignees

No one assigned

    Labels

    PROPOSAL EPICA proposal being discussed that is not yet fully underway

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions