-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
For many of the same reasons as listed on #8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:
- There are some features not available to User Defined Aggregate Functions (such as the faster
GroupsAccumulatorinterface) - Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions
The second also ends up causing pushback on adding new aggregates like ARRAY_SUM in #8325 and geospatial support #7859.
Describe the solution you'd like
I propose moving DataFusion to only use AggregateUDFs and remove the built in list of AggregateFunctions for the same reasons as #8045
We will keep the existing AggregateUDF interface as much as possible, while also potentially providing an easier way to define them.
New AggregateUDF is in functions-aggregate crate
Old Aggregate functions are in datafusion/physical-expr/src/aggregate
Describe alternatives you've considered
Additional context
Proposed implementation steps:
- Use lowercase
namefor aggregate function #10695 - Implement trait based API for defining
AggregateUDF#8710 - Support
GroupsAccumulatoraccumulator for User Defined Functions #8793 - Support ORDER BY in AggregateUDF #8984
- Move
AggregateExpr,PhysicalExprandPhysicalSortExprto physical-expr-core #9926 - Simplify API Add a
AggregateUDFImpl::simplfy()API #9526 - Covariance Sample Move
Covariance(Sample)covar/covar_sampto be a User Defined Aggregate Function #10372 - Covariance Population Move Covariance (Population) covar_pop to be a User Defined Aggregate Function #10418
- Count Move
Counttofunctions-aggregate, update MSRV to rust 1.75 #10484 - Median Move Median to
functions-aggregateand Introduce Numeric signature #10644 - First and Last Make FirstValue an UDAF, Change
AggregateUDFImpl::accumulatorsignature, support ORDER BY for UDAFs #9874 Convert first, last aggregate function to UDAF #10648 - Sum Introduce Sum UDAF #10651
- Expression API Introduce expr builder for aggregate function #10560
- Variance Sample Convert
Variance Sampleto UDAF #10667 - Variance Population Convert Variance Population to UDAF #10668
- Stddev Convert
stddevto udaf #10827 - Approx Distinct Convert approx_distinct to UDAF #10851
- approx_median Convert
approx_medianto UDAF #10838 - approx_percentile_cont approx_percentile_cont_with_weight Convert
ApproxPercentileContand `ApproxPercentileContWithWeightto UDAF #10870 - regr Convert
Regrto UDAF #10883 - correlation Convert
Correlationto UDAF #10884 - BitAnd / Or / Xor Convert
BitAnd,BitOr,BitXorto UDAF #10907 - Average Convert
Averageto UDAF #10942 - Min Max Convert
MinandMaxto UDAF #10943 - String agg Convert
StringAggto UDAF #10945 - ArrayAgg Convert
ArrayAggto UDAF #10999 - BoolAndOR Convert
BoolAndOrto UDAF #11008 - Convert
MinandMaxto UDAF #10943 - Remove special casting of
Min/Maxbuilt inAggregateFunctions#11151
Move rust test to sqllogictest if possible #10384
Good first issue list
- grouping Convert
Groupingto UDAF #10906
Pending
- nth_value
- array_agg_distinct
- array_agg_ordered
impl FromStr for AggregateFunction {
type Err = DataFusionError;
fn from_str(name: &str) -> Result<AggregateFunction> {
Ok(match name {
// general
"avg" => AggregateFunction::Avg,
"bit_and" => AggregateFunction::BitAnd,
"bit_or" => AggregateFunction::BitOr,
"bit_xor" => AggregateFunction::BitXor,
"bool_and" => AggregateFunction::BoolAnd,
"bool_or" => AggregateFunction::BoolOr,
"max" => AggregateFunction::Max,
"mean" => AggregateFunction::Avg,
"min" => AggregateFunction::Min,
"array_agg" => AggregateFunction::ArrayAgg,
"nth_value" => AggregateFunction::NthValue,
"string_agg" => AggregateFunction::StringAgg,
// statistical
"corr" => AggregateFunction::Correlation,
// other
"grouping" => AggregateFunction::Grouping,
_ => {
return plan_err!("There is no built-in function named {name}");
}
})
}
}Feel free to file an issue if you are interested in working on any of the above in the pending list.