-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
"It is time to regain our rightful place at the top of the leaderboard" - me
Is your feature request related to a problem or challenge?
The ClickBench Benchmark measures the performance of filtering and aggregation, two of the core
Being on top of ClickBench is somewhat of a vanity benchmark: in my opinion all the engines within a factor of 2 of likely have similar user experiences (and the exact speed will depends on real user queries, etc)
That being said, the engine at the top of the benchmark is good for publicity and the DataFusion community is certainly not against using it as such (see see our blog here Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)
Also, ClickBench has more recently added more realistic benchmark machines
This ticket tracks improving the ClickBench performance even more
Here are results with DataFusion 47
[Results c6a.2xlarge(8 core, 16 GB)](https://benchmark.clickhouse.com/#system=+hBp|curp|fqo|ti%20rud|As(|kBP%20t|laDd|d(%20t|aseaa|Sa%20i&type=-&machine=+a2l&cluster_size=-&opensource=+s&tuned=-&metric=combined&queries=-)
[Results c6a.4xlarge (16 core, 32 GB)](https://benchmark.clickhouse.com/#system=+hBp|curp|fqo|ti%20rud|As(|kBP%20t|laDd|d(%20t|aseaa|Sa%20i&type=-&machine=+ca4e&cluster_size=-&opensource=+s&tuned=-&metric=combined&queries=-):
Here is where we stand with DataFusion 50 on the benchmark
(TODO: @pmcgleenon is running over the next few days, see #17721 (comment) -- and then I will update)
Describe the solution you'd like
Get DataFusion back on top of ClickBench for reading partitioned parquet
While being at the absolute top might seem appealing I think it is likely not general purpose enough
Describe alternatives you've considered
While we could clearly implement ClickBench specific optimizations, I don't think that is really a valuable exercise for users. I would very much like to focus our efforts on actually useful optimization -- if someone wants to go nuts with BenchMaxxing, check out
Real Improvements
- Enable parquet filter pushdown (
filter_pushdown) by default #3463 - [Parquet] Pre-fetch the next row group when reading parquet files #18470
- TPCH q1 with no predicates is 2x slower than duckdb #18411
Potential Benchmaxxing (only really helps ClickBench) improvements
- Make Clickbench Q29 5x faster for datafusion by extracting
SUM(..)clauses #15524 - Improve performance of ClickBench Q18, Q35, #13449
Misc
What I would like is of people profile queries and try and find ways to improve the queries
Additional context
See related discussions on