Skip to content

Add support for COUNT(DISTINCT expr, expr1, ...) #2292

@andygrove

Description

@andygrove

What is the problem the feature request solves?

The expression COUNT(DISTINCT expr) is relatively common and it is used in TPC-H, so it would be good to be able to accelerate this in Comet.

Spark supports multiple expressions e.g. COUNT(DISTINCT a, b, c), but DataFusion does not, so we should only attempt to accelerate this if there is a single input expression.

Implementing this feature is not trivial because there are some design issues with how we currently support partial aggregates. Specifically, we do not report the correct output schema from the partial aggregate. For the aggregate expressions that we currently support it doesn't matter because the output of the partial and final aggregates is the same. For example SUM(int_column) will have the output type int for both partial and final. For COUNT(DISTINCT int_column) the output of the partial will be a list of int and the output of the final will be a long.

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions