Skip to content

Commit b477816

Browse files
authored
Enforce explicit opt-in for WITHIN GROUP syntax in aggregate UDAFs (#18607)
## Which issue does this PR close? Closes #18109. ## Rationale for this change Previously, the SQL planner accepted `WITHIN GROUP` clauses for all aggregate UDAFs, even those that did not explicitly support ordered-set semantics. This behavior was too permissive and inconsistent with PostgreSQL. For example, queries such as `SUM(x) WITHIN GROUP (ORDER BY x)` were allowed, even though `SUM` is not an ordered-set aggregate. This PR enforces stricter validation so that only UDAFs that explicitly return `true` from `supports_within_group_clause()` may use `WITHIN GROUP`. All other aggregates now produce a clear planner error when this syntax is used. ## What changes are included in this PR? * Added type alias `WithinGroupExtraction` to simplify complex tuple return types used by helper functions. * Introduced a new helper method `extract_and_prepend_within_group_args` to centralize logic for handling `WITHIN GROUP` argument rewriting. * Updated the planner to: * Validate that only UDAFs with `supports_within_group_clause()` can accept `WITHIN GROUP`. * Prepend `WITHIN GROUP` ordering expressions to function arguments only for supported ordered-set aggregates. * Produce clear error messages when `WITHIN GROUP` is used incorrectly. * Added comprehensive unit tests verifying correct behavior and failure cases: * `WITHIN GROUP` rejected for non-ordered-set aggregates (`MIN`, `SUM`, etc.). * `WITHIN GROUP` accepted for ordered-set aggregates such as `percentile_cont`. * Validation for named arguments, multiple ordering expressions, and semantic conflicts with `OVER` clauses. * Updated SQL logic tests (`aggregate.slt`) to reflect new rejection behavior. * Updated documentation: * `aggregate_functions.md` and developer docs to clarify when and how `WITHIN GROUP` can be used. * `upgrading.md` to inform users of this stricter enforcement and migration guidance. ## Are these changes tested? ✅ Yes. * New tests in `sql_integration.rs` validate acceptance, rejection, and argument behavior of `WITHIN GROUP` for both valid and invalid cases. * SQL logic tests (`aggregate.slt`) include negative test cases confirming planner rejections. ## Are there any user-facing changes? ✅ Yes. * Users attempting to use `WITHIN GROUP` with regular aggregates (e.g. `SUM`, `AVG`, `MIN`, `MAX`) will now see a planner error: > `WITHIN GROUP is only supported for ordered-set aggregate functions` * Documentation has been updated to clearly describe `WITHIN GROUP` semantics and provide examples of valid and invalid usage. No API-breaking changes were introduced; only stricter planner validation and improved error messaging.
1 parent 37dbf9e commit b477816

File tree

6 files changed

+180
-38
lines changed

6 files changed

+180
-38
lines changed

datafusion/sql/src/expr/function.rs

Lines changed: 64 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,12 @@ use datafusion_common::{
2222
internal_datafusion_err, internal_err, not_impl_err, plan_datafusion_err, plan_err,
2323
DFSchema, Dependency, Diagnostic, Result, Span,
2424
};
25-
use datafusion_expr::expr::{
26-
NullTreatment, ScalarFunction, Unnest, WildcardOptions, WindowFunction,
25+
use datafusion_expr::{
26+
expr,
27+
expr::{NullTreatment, ScalarFunction, Unnest, WildcardOptions, WindowFunction},
28+
planner::{PlannerResult, RawAggregateExpr, RawWindowExpr},
29+
Expr, ExprSchemable, SortExpr, WindowFrame, WindowFunctionDefinition,
2730
};
28-
use datafusion_expr::planner::{PlannerResult, RawAggregateExpr, RawWindowExpr};
29-
use datafusion_expr::{expr, Expr, ExprSchemable, WindowFrame, WindowFunctionDefinition};
3031
use sqlparser::ast::{
3132
DuplicateTreatment, Expr as SQLExpr, Function as SQLFunction, FunctionArg,
3233
FunctionArgExpr, FunctionArgumentClause, FunctionArgumentList, FunctionArguments,
@@ -212,6 +213,9 @@ impl FunctionArgs {
212213
}
213214
}
214215

216+
// Helper type for extracting WITHIN GROUP ordering and prepended args
217+
type WithinGroupExtraction = (Vec<SortExpr>, Vec<Expr>, Vec<Option<String>>);
218+
215219
impl<S: ContextProvider> SqlToRel<'_, S> {
216220
pub(super) fn sql_function_to_expr(
217221
&self,
@@ -490,31 +494,30 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
490494
let (mut args, mut arg_names) =
491495
self.function_args_to_expr_with_names(args, schema, planner_context)?;
492496

493-
let order_by = if fm.supports_within_group_clause() {
494-
let within_group = self.order_by_to_sort_expr(
495-
within_group,
496-
schema,
497-
planner_context,
498-
false,
499-
None,
500-
)?;
501-
502-
// Add the WITHIN GROUP ordering expressions to the front of the argument list
503-
// So function(arg) WITHIN GROUP (ORDER BY x) becomes function(x, arg)
504-
if !within_group.is_empty() {
505-
// Prepend None arg names for each WITHIN GROUP expression
506-
let within_group_count = within_group.len();
507-
arg_names = std::iter::repeat_n(None, within_group_count)
508-
.chain(arg_names)
509-
.collect();
510-
511-
args = within_group
512-
.iter()
513-
.map(|sort| sort.expr.clone())
514-
.chain(args)
515-
.collect::<Vec<_>>();
516-
}
517-
within_group
497+
// UDAFs must opt-in via `supports_within_group_clause()` to
498+
// accept a WITHIN GROUP clause.
499+
let supports_within_group = fm.supports_within_group_clause();
500+
501+
if !within_group.is_empty() && !supports_within_group {
502+
return plan_err!(
503+
"WITHIN GROUP is only supported for ordered-set aggregate functions"
504+
);
505+
}
506+
507+
// If the UDAF supports WITHIN GROUP, convert the ordering into
508+
// sort expressions and prepend them as unnamed function args.
509+
let order_by = if supports_within_group {
510+
let (within_group_sorts, new_args, new_arg_names) = self
511+
.extract_and_prepend_within_group_args(
512+
within_group,
513+
args,
514+
arg_names,
515+
schema,
516+
planner_context,
517+
)?;
518+
args = new_args;
519+
arg_names = new_arg_names;
520+
within_group_sorts
518521
} else {
519522
let order_by = if !order_by.is_empty() {
520523
order_by
@@ -807,6 +810,38 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
807810
Ok((exprs, names))
808811
}
809812

813+
fn extract_and_prepend_within_group_args(
814+
&self,
815+
within_group: Vec<OrderByExpr>,
816+
mut args: Vec<Expr>,
817+
mut arg_names: Vec<Option<String>>,
818+
schema: &DFSchema,
819+
planner_context: &mut PlannerContext,
820+
) -> Result<WithinGroupExtraction> {
821+
let within_group = self.order_by_to_sort_expr(
822+
within_group,
823+
schema,
824+
planner_context,
825+
false,
826+
None,
827+
)?;
828+
829+
if !within_group.is_empty() {
830+
let within_group_count = within_group.len();
831+
arg_names = std::iter::repeat_n(None, within_group_count)
832+
.chain(arg_names)
833+
.collect();
834+
835+
args = within_group
836+
.iter()
837+
.map(|sort| sort.expr.clone())
838+
.chain(args)
839+
.collect::<Vec<_>>();
840+
}
841+
842+
Ok((within_group, args, arg_names))
843+
}
844+
810845
pub(crate) fn check_unnest_arg(arg: &Expr, schema: &DFSchema) -> Result<()> {
811846
// Check argument type, array types are supported
812847
match arg.get_type(schema)? {

datafusion/sql/tests/sql_integration.rs

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,12 @@ use datafusion_sql::{
3838
use crate::common::{CustomExprPlanner, CustomTypePlanner, MockSessionState};
3939
use datafusion_functions::core::planner::CoreFunctionPlanner;
4040
use datafusion_functions_aggregate::{
41-
approx_median::approx_median_udaf, count::count_udaf, min_max::max_udaf,
42-
min_max::min_udaf,
41+
approx_median::approx_median_udaf,
42+
average::avg_udaf,
43+
count::count_udaf,
44+
grouping::grouping_udaf,
45+
min_max::{max_udaf, min_udaf},
4346
};
44-
use datafusion_functions_aggregate::{average::avg_udaf, grouping::grouping_udaf};
4547
use datafusion_functions_nested::make_array::make_array_udf;
4648
use datafusion_functions_window::{rank::rank_udwf, row_number::row_number_udwf};
4749
use insta::{allow_duplicates, assert_snapshot};
@@ -233,6 +235,22 @@ fn parse_ident_normalization_4() {
233235
);
234236
}
235237

238+
#[test]
239+
fn within_group_rejected_for_non_ordered_set_udaf() {
240+
// MIN is order-sensitive by nature but does not implement the
241+
// ordered-set `WITHIN GROUP` opt-in. The planner must reject
242+
// explicit `WITHIN GROUP` syntax for functions that do not
243+
// advertise `supports_within_group_clause()`.
244+
let sql = "SELECT min(c1) WITHIN GROUP (ORDER BY c1) FROM person";
245+
let err = logical_plan(sql)
246+
.expect_err("expected planning to fail for MIN WITHIN GROUP")
247+
.to_string();
248+
assert_contains!(
249+
err,
250+
"WITHIN GROUP is only supported for ordered-set aggregate functions"
251+
);
252+
}
253+
236254
#[test]
237255
fn parse_ident_normalization_5() {
238256
let sql = "SELECT AGE FROM PERSON";

datafusion/sqllogictest/test_files/aggregate.slt

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,16 @@ CREATE TABLE group_median_table_nullable (
129129
# Error tests
130130
#######
131131

132+
statement error DataFusion error: Error during planning: WITHIN GROUP is only supported for ordered-set aggregate functions
133+
SELECT SUM(c2) WITHIN GROUP (ORDER BY c2) FROM aggregate_test_100
134+
135+
# WITHIN GROUP rejected for non-ordered-set UDAF
136+
# MIN does not implement ordered-set semantics (`supports_within_group_clause()`),
137+
# so the planner should reject the WITHIN GROUP syntax.
138+
statement error DataFusion error: Error during planning: WITHIN GROUP is only supported for ordered-set aggregate functions
139+
SELECT MIN(c) WITHIN GROUP (ORDER BY c) FROM (VALUES (1),(2)) as t(c);
140+
141+
132142
# https://github.com/apache/datafusion/issues/3353
133143
statement error DataFusion error: Schema error: Schema contains duplicate unqualified field name "approx_distinct\(aggregate_test_100\.c9\)"
134144
SELECT approx_distinct(c9) count_c9, approx_distinct(cast(c9 as varchar)) count_c9_str FROM aggregate_test_100
@@ -7867,17 +7877,15 @@ VALUES
78677877
----
78687878
x 1
78697879

7870-
query ?
7880+
query error Error during planning: WITHIN GROUP is only supported for ordered-set aggregate functions
78717881
SELECT array_agg(a_varchar) WITHIN GROUP (ORDER BY a_varchar)
78727882
FROM (VALUES ('a'), ('d'), ('c'), ('a')) t(a_varchar);
7873-
----
7874-
[a, a, c, d]
78757883

7876-
query ?
7884+
7885+
query error Error during planning: WITHIN GROUP is only supported for ordered-set aggregate functions
78777886
SELECT array_agg(DISTINCT a_varchar) WITHIN GROUP (ORDER BY a_varchar)
78787887
FROM (VALUES ('a'), ('d'), ('c'), ('a')) t(a_varchar);
7879-
----
7880-
[a, c, d]
7888+
78817889

78827890
query error Error during planning: ORDER BY and WITHIN GROUP clauses cannot be used together in the same aggregate function
78837891
SELECT array_agg(a_varchar order by a_varchar) WITHIN GROUP (ORDER BY a_varchar)

dev/update_function_docs.sh

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,36 @@ FROM employees;
7878
```
7979
8080
Note: When no rows pass the filter, `COUNT` returns `0` while `SUM`/`AVG`/`MIN`/`MAX` return `NULL`.
81+
82+
## WITHIN GROUP / Ordered-set aggregates
83+
84+
Some aggregate functions accept the SQL `WITHIN GROUP (ORDER BY ...)` clause to specify the ordering the
85+
aggregate relies on. In DataFusion this is opt-in: only aggregate functions whose implementation returns
86+
`true` from `AggregateUDFImpl::supports_within_group_clause()` accept the `WITHIN GROUP` clause. Attempting to
87+
use `WITHIN GROUP` with a regular aggregate (for example, `SELECT SUM(x) WITHIN GROUP (ORDER BY x)`) will fail
88+
during planning with an error: "WITHIN GROUP is only supported for ordered-set aggregate functions".
89+
90+
Currently, the built-in aggregate functions that support `WITHIN GROUP` are:
91+
92+
- `percentile_cont` — exact percentile aggregate (also available as `percentile_cont(column, percentile)`)
93+
- `approx_percentile_cont` — approximate percentile using the t-digest algorithm
94+
- `approx_percentile_cont_with_weight` — approximate weighted percentile using the t-digest algorithm
95+
96+
Note: rank-like functions such as `rank()`, `dense_rank()`, and `percent_rank()` are window functions and
97+
use the `OVER (...)` clause; they are not ordered-set aggregates that accept `WITHIN GROUP` in DataFusion.
98+
99+
Example (ordered-set aggregate):
100+
101+
```sql
102+
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
103+
```
104+
105+
Example (invalid usage — planner will error):
106+
107+
```sql
108+
-- This will fail: SUM is not an ordered-set aggregate
109+
SELECT SUM(x) WITHIN GROUP (ORDER BY x) FROM t;
110+
```
81111
EOF
82112

83113
echo "Running CLI and inserting aggregate function docs table"

docs/source/library-user-guide/upgrading.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,27 @@
2525

2626
You can see the current [status of the `52.0.0` release here](https://github.com/apache/datafusion/issues/18566)
2727

28+
### Planner now requires explicit opt-in for WITHIN GROUP syntax
29+
30+
The SQL planner now enforces the aggregate UDF contract more strictly: the
31+
`WITHIN GROUP (ORDER BY ...)` syntax is accepted only if the aggregate UDAF
32+
explicitly advertises support by returning `true` from
33+
`AggregateUDFImpl::supports_within_group_clause()`.
34+
35+
Previously the planner forwarded a `WITHIN GROUP` clause to order-sensitive
36+
aggregates even when they did not implement ordered-set semantics, which could
37+
cause queries such as `SUM(x) WITHIN GROUP (ORDER BY x)` to plan successfully.
38+
This behavior was too permissive and has been changed to match PostgreSQL and
39+
the documented semantics.
40+
41+
Migration: If your UDAF intentionally implements ordered-set semantics and
42+
wants to accept the `WITHIN GROUP` SQL syntax, update your implementation to
43+
return `true` from `supports_within_group_clause()` and handle the ordering
44+
semantics in your accumulator implementation. If your UDAF is merely
45+
order-sensitive (but not an ordered-set aggregate), do not advertise
46+
`supports_within_group_clause()` and clients should use alternative function
47+
signatures (for example, explicit ordering as a function argument) instead.
48+
2849
### `AggregateUDFImpl::supports_null_handling_clause` now defaults to `false`
2950

3051
This method specifies whether an aggregate function allows `IGNORE NULLS`/`RESPECT NULLS`

docs/source/user-guide/sql/aggregate_functions.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,36 @@ FROM employees;
4848

4949
Note: When no rows pass the filter, `COUNT` returns `0` while `SUM`/`AVG`/`MIN`/`MAX` return `NULL`.
5050

51+
## WITHIN GROUP / Ordered-set aggregates
52+
53+
Some aggregate functions accept the SQL `WITHIN GROUP (ORDER BY ...)` clause to specify the ordering the
54+
aggregate relies on. In DataFusion this is opt-in: only aggregate functions whose implementation returns
55+
`true` from `AggregateUDFImpl::supports_within_group_clause()` accept the `WITHIN GROUP` clause. Attempting to
56+
use `WITHIN GROUP` with a regular aggregate (for example, `SELECT SUM(x) WITHIN GROUP (ORDER BY x)`) will fail
57+
during planning with an error: "WITHIN GROUP is only supported for ordered-set aggregate functions".
58+
59+
Currently, the built-in aggregate functions that support `WITHIN GROUP` are:
60+
61+
- `percentile_cont` — exact percentile aggregate (also available as `percentile_cont(column, percentile)`)
62+
- `approx_percentile_cont` — approximate percentile using the t-digest algorithm
63+
- `approx_percentile_cont_with_weight` — approximate weighted percentile using the t-digest algorithm
64+
65+
Note: rank-like functions such as `rank()`, `dense_rank()`, and `percent_rank()` are window functions and
66+
use the `OVER (...)` clause; they are not ordered-set aggregates that accept `WITHIN GROUP` in DataFusion.
67+
68+
Example (ordered-set aggregate):
69+
70+
```sql
71+
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
72+
```
73+
74+
Example (invalid usage — planner will error):
75+
76+
```sql
77+
-- This will fail: SUM is not an ordered-set aggregate
78+
SELECT SUM(x) WITHIN GROUP (ORDER BY x) FROM t;
79+
```
80+
5181
## General Functions
5282

5383
- [array_agg](#array_agg)

0 commit comments

Comments
 (0)