Open
Description
Is your feature request related to a problem or challenge?
While looking at the results of the most recent clickbench run
Here is the ClickBench page (link)
I see there are a few queries where DataFusion is significantly slower
The queries are:
Q18:
- Improve vectorized operations of
GroupColumn
#13275 - Potential performance regression for TPCH q18 #13188
- Optimize date_part
Q35:
Describe the solution you'd like
I would like the queries to go faster
Describe alternatives you've considered
Both queries look like
SELECT COUNT(...) cnt ... ORDER BY cnt DESC LIMIT 10
In other words they are "top 10 count" style queries
By default, DataFusion will compute the counts for all groups, and then pick only the top 10.
I suspect there is some fancier way to do this, perhaps by finding the top 10 values of count when emitting from the group operator or something. It would be interesting to see if we can see what other engines like DuckDB do with this query
Additional context
No response