Skip to content

Improve performance of ClickBench Q18, Q35,  #13449

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

While looking at the results of the most recent clickbench run

Here is the ClickBench page (link)

I see there are a few queries where DataFusion is significantly slower
Screenshot 2024-11-16 at 7 56 17 AM

The queries are:

Q18:

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Q35:

SELECT "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3, COUNT(*) AS c FROM hits GROUP BY "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3 ORDER BY c DESC LIMIT 10;

Describe the solution you'd like

I would like the queries to go faster

Describe alternatives you've considered

Both queries look like

SELECT COUNT(...) cnt ... ORDER BY cnt DESC LIMIT 10

In other words they are "top 10 count" style queries

By default, DataFusion will compute the counts for all groups, and then pick only the top 10.

I suspect there is some fancier way to do this, perhaps by finding the top 10 values of count when emitting from the group operator or something. It would be interesting to see if we can see what other engines like DuckDB do with this query

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions