Skip to content

Support old syntax for approx_percentile_cont and approx_percentile_cont_with_weight #16955

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We recently rolled out DataFusion 48.0.1 to production and had several customers report that queries that previously ran without issue

We traced it back to this PR from @Garamda which updated DataFusion SQL to use the standard WITHIN GROUP syntax:

For example, given this data

> create table test(c1 int, c2 float) as values (1, 100), (2, 200), (5, 500), (4, 400), (3, 300), (2, 200);
0 row(s) fetched.

> select * from test;
+----+-------+
| c1 | c2    |
+----+-------+
| 1  | 100.0 |
| 2  | 200.0 |
| 5  | 500.0 |
| 4  | 400.0 |
| 3  | 300.0 |
| 2  | 200.0 |
+----+-------+
6 row(s) fetched.

DataFusion 47.0.0

> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
+-------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) |
+-------------------------------------------------------------------+
| 5.0                                                               |
+-------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

Same query errors DataFusion 48.0.0

> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
Error during planning: WITHIN GROUP clause is required when calling ordered set aggregate function(approx_percentile_cont_with_weight)

Instead you must use a different syntax

> SELECT approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c1) FROM test;
+-------------------------------------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c2,Float64(0.95)) WITHIN GROUP [test.c1 ASC NULLS LAST] |
+-------------------------------------------------------------------------------------------------+
| 5.0                                                                                             |
+-------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

The functions affected are:

  • approx_percentile_cont
    • 47.0.0 : approx_percentile_cont(expression, percentile, centroids)
    • 48.0.0 : approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
  • approx_percentile_cont_with_weight
    • 47.0.0 : approx_percentile_cont_with_weight(expression, weight, percentile)
    • 48.0.0 : approx_percentile_cont_with_weight(weight, percentile) WITHIN GROUP (ORDER BY expression)

Describe the solution you'd like

I would like DataFusion to support both the new and old syntax for approx_percentile_cont and approx_percentile_cont_with_weight

For example, given this table:

> create table test(c1 int, c2 float) as values (1, 100), (2, 200), (5, 500), (4, 400), (3, 300), (2, 200);


For example, that means in addition to supporting the  new syntax from DataFusion 48.0.0:
```sql
> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
+-------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) |
+-------------------------------------------------------------------+
| 5.0                                                               |
+-------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.

A user should also be able to write this (as they were able in DataFusion 47)

> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
+-------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) |
+-------------------------------------------------------------------+
| 5.0                                                               |
+-------------------------------------------------------------------+
1 row(s) fetched.

And get the same answer

Describe alternatives you've considered

We could only make this change on our own fork

Additional context

Looking at other issues linked to that PR, this change seems to have caused downstream breakage for others too:

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions