Improve the performance of COUNT DISTINCT queries for high cardinality groups

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Queries like this (which compute distinct values for  high cardinality columns) are currently relatively slow (if there are many values of `UserID`):
```
SELECT 
  SUM(EngineId), 
  COUNT(*) AS c, 
  COUNT(DISTINCT "UserID")
FROM 
  hits 
GROUP BY 
  "RegionID" 
ORDER BY 
  c DESC 
LIMIT 10;
```

Here is a specific clickbench query from the discussion on https://github.com/apache/arrow-datafusion/issues/5276 
```sql
❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10;
```

**Describe the solution you'd like**
We could make this type of query faster. Hopefully we can collect ideas here

**Describe alternatives you've considered**
TBD

**Additional context**
There are thoughts on improving aggregate performance in general https://github.com/apache/arrow-datafusion/issues/4973

This is one area where clickhouse and duckdb are particularly strong

See https://github.com/apache/arrow-datafusion/issues/5276 and  https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1433650162

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions