Skip to content

avg(distinct) support #2408

Open
Open
@WinkerDu

Description

@WinkerDu

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

DataFusion does not avg(distinct ..):

> select avg(distinct a), avg(distinct b) from values(1, 2) t(a, b);
Execution error: avg(DISTINCT) aggregations are not available

Somewhat confusingly DataFusion does appear to support a single distinct avg

> select avg(distinct a) from values(1, 2) t(a, b);
+-------------------+
| avg(DISTINCT t.a) |
+-------------------+
| 1.0               |
+-------------------+
1 row(s) fetched.
Elapsed 0.015 seconds.

But it does this by rewriting effectively rewriting the query select avg(distinct a) to select avg(a) from (select distinct a ...) in the SingleDistinctToGroupBy rule

Describe the solution you'd like
Support multile distinct avg aggregates:

if acc_args.is_distinct {
return exec_err!("avg(DISTINCT) aggregations are not available");
}

Describe alternatives you've considered
Basically we could follow the model of COUNT(distinct ...) aggregates:

In this case, there is a separate accumulator type that is implemented for count distinct

https://github.com/apache/datafusion/blob/5d08325165c1a7b32e5e35164919e83d46735e98/datafusion/functions-aggregate/src/count.rs#L234-L233

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions