-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Keep aggregate udaf schema names unique when missing an order-by #17731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Keep aggregate udaf schema names unique when missing an order-by #17731
Conversation
approx_percentile_cont
names unique when missing an order-byapprox_percentile_cont
schema names unique when missing an order-by
Failing test is due to github flake. If someone could please initiate a restart? 🙏🏼 |
approx_percentile_cont
schema names unique when missing an order-byThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- this clearly fixes a bug so looks good to me
I am somewhat confused about how order_by
can be empty in the first place, however (see my other comments)
e 115 | ||
|
||
|
||
# using approx_percentile_cont on 2 columns with same signature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this test fails as follows without the code change
1. query failed: DataFusion error: Schema error: Schema contains duplicate unqualified field name "approx_percentile_cont(Float64(0.95))"
[SQL] SELECT c1, approx_percentile_cont(c2, 0.95) AS c2, approx_percentile_cont(c3, 0.95) AS c3 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/aggregate.slt:1840
// exclude the first function argument(= column) in ordered set aggregate function, | ||
// because it is duplicated with the WITHIN GROUP clause in schema name. | ||
let args = if func.is_ordered_set_aggregate() { | ||
let args = if func.is_ordered_set_aggregate() && !order_by.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clearly fixes a bug so I think we can merge this. However, it seems somewhat strange to me given that my understanding of ordered set aggregate functions is that they require an ORDER BY
clause 🤔
I wonder if this PR is just fixing a symptom, and the root cause is that approx_percentile_cont should actually be created with an order by expression 🤔
I tried to update the docs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is because we are currently supporting the corrected syntax (which includes an order-by), as well as the older syntax which does not include an order-by in the UDAF clause itself.
Docs: https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#approx-percentile-cont
New, corrected syntax:
SELECT approx_percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) FROM table_name;
Older syntax, still supported:
SELECT approx_percentile_cont(column_name, 0.75) FROM table_name;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this bug occurred with customers using the older syntax, since it was shortening the schema names in such a way that they were non-unique is only the 1st argument was different.
Which issue does this PR close?
approx_percentile_cont
query fails: "Schema contains duplicate unqualified field name #17730Rationale for this change
The schema names are kept unique using the
WITHIN GROUP clause
. But since we permit this clause to be missing, we need to maintain uniqueness by not shaving off the column identifier. See test case for details.Also, note the commit with initial failure: influxdata@5cbeb1c
What changes are included in this PR?
Adds to a conditional.
Are these changes tested?
Yes.
Are there any user-facing changes?
Fixes a regression.