Skip to content

Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Queries like this (which compute distinct values for high cardinality columns) are currently relatively slow (if there are many values of UserID):

SELECT 
  SUM(EngineId), 
  COUNT(*) AS c, 
  COUNT(DISTINCT "UserID")
FROM 
  hits 
GROUP BY 
  "RegionID" 
ORDER BY 
  c DESC 
LIMIT 10;

Here is a specific clickbench query from the discussion on #5276

SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10;

Describe the solution you'd like
We could make this type of query faster. Hopefully we can collect ideas here

Describe alternatives you've considered
TBD

Additional context
There are thoughts on improving aggregate performance in general #4973

This is one area where clickhouse and duckdb are particularly strong

See #5276 and #5276 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions