-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
We recently rolled out DataFusion 48.0.1 to production and had several customers report that queries that previously ran without issue
We traced it back to this PR from @Garamda which updated DataFusion SQL to use the standard WITHIN GROUP syntax:
For example, given this data
> create table test(c1 int, c2 float) as values (1, 100), (2, 200), (5, 500), (4, 400), (3, 300), (2, 200);
0 row(s) fetched.
> select * from test;
+----+-------+
| c1 | c2 |
+----+-------+
| 1 | 100.0 |
| 2 | 200.0 |
| 5 | 500.0 |
| 4 | 400.0 |
| 3 | 300.0 |
| 2 | 200.0 |
+----+-------+
6 row(s) fetched.DataFusion 47.0.0
> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
+-------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) |
+-------------------------------------------------------------------+
| 5.0 |
+-------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.Same query errors DataFusion 48.0.0
> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
Error during planning: WITHIN GROUP clause is required when calling ordered set aggregate function(approx_percentile_cont_with_weight)Instead you must use a different syntax
> SELECT approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c1) FROM test;
+-------------------------------------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c2,Float64(0.95)) WITHIN GROUP [test.c1 ASC NULLS LAST] |
+-------------------------------------------------------------------------------------------------+
| 5.0 |
+-------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.The functions affected are:
- approx_percentile_cont
- 47.0.0 :
approx_percentile_cont(expression, percentile, centroids) - 48.0.0 :
approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)
- 47.0.0 :
- approx_percentile_cont_with_weight
- 47.0.0 :
approx_percentile_cont_with_weight(expression, weight, percentile) - 48.0.0 :
approx_percentile_cont_with_weight(weight, percentile) WITHIN GROUP (ORDER BY expression)
- 47.0.0 :
Describe the solution you'd like
I would like DataFusion to support both the new and old syntax for approx_percentile_cont and approx_percentile_cont_with_weight
For example, given this table:
> create table test(c1 int, c2 float) as values (1, 100), (2, 200), (5, 500), (4, 400), (3, 300), (2, 200);
For example, that means in addition to supporting the new syntax from DataFusion 48.0.0:
```sql
> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
+-------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) |
+-------------------------------------------------------------------+
| 5.0 |
+-------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.A user should also be able to write this (as they were able in DataFusion 47)
> SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test;
+-------------------------------------------------------------------+
| approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) |
+-------------------------------------------------------------------+
| 5.0 |
+-------------------------------------------------------------------+
1 row(s) fetched.And get the same answer
Describe alternatives you've considered
We could only make this change on our own fork
Additional context
Looking at other issues linked to that PR, this change seems to have caused downstream breakage for others too: