-
Notifications
You must be signed in to change notification settings - Fork 268
Description
What is the problem the feature request solves?
The expression COUNT(DISTINCT expr) is relatively common and it is used in TPC-H, so it would be good to be able to accelerate this in Comet.
Spark supports multiple expressions e.g. COUNT(DISTINCT a, b, c), but DataFusion does not, so we should only attempt to accelerate this if there is a single input expression.
Implementing this feature is not trivial because there are some design issues with how we currently support partial aggregates. Specifically, we do not report the correct output schema from the partial aggregate. For the aggregate expressions that we currently support it doesn't matter because the output of the partial and final aggregates is the same. For example SUM(int_column) will have the output type int for both partial and final. For COUNT(DISTINCT int_column) the output of the partial will be a list of int and the output of the final will be a long.
Describe the potential solution
No response
Additional context
No response