Skip to content

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Sep 22, 2025

Which issue does this PR close?

Rationale for this change

The schema names are kept unique using the WITHIN GROUP clause. But since we permit this clause to be missing, we need to maintain uniqueness by not shaving off the column identifier. See test case for details.

Also, note the commit with initial failure: influxdata@5cbeb1c

What changes are included in this PR?

Adds to a conditional.

Are these changes tested?

Yes.

Are there any user-facing changes?

Fixes a regression.

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Sep 22, 2025
@wiedld wiedld changed the title Keep approx_percentile_cont names unique when missing an order-by Keep approx_percentile_cont schema names unique when missing an order-by Sep 22, 2025
@wiedld
Copy link
Contributor Author

wiedld commented Sep 22, 2025

Failing test is due to github flake. If someone could please initiate a restart? 🙏🏼

@alamb alamb changed the title Keep approx_percentile_cont schema names unique when missing an order-by Keep aggregate udaf schema names unique when missing an order-by Sep 23, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- this clearly fixes a bug so looks good to me

I am somewhat confused about how order_by can be empty in the first place, however (see my other comments)

e 115


# using approx_percentile_cont on 2 columns with same signature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this test fails as follows without the code change


1. query failed: DataFusion error: Schema error: Schema contains duplicate unqualified field name "approx_percentile_cont(Float64(0.95))"
[SQL] SELECT c1, approx_percentile_cont(c2, 0.95) AS c2, approx_percentile_cont(c3, 0.95) AS c3 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1
at /Users/andrewlamb/Software/datafusion/datafusion/sqllogictest/test_files/aggregate.slt:1840

// exclude the first function argument(= column) in ordered set aggregate function,
// because it is duplicated with the WITHIN GROUP clause in schema name.
let args = if func.is_ordered_set_aggregate() {
let args = if func.is_ordered_set_aggregate() && !order_by.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clearly fixes a bug so I think we can merge this. However, it seems somewhat strange to me given that my understanding of ordered set aggregate functions is that they require an ORDER BY clause 🤔

I wonder if this PR is just fixing a symptom, and the root cause is that approx_percentile_cont should actually be created with an order by expression 🤔

I tried to update the docs here

Copy link
Contributor Author

@wiedld wiedld Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is because we are currently supporting the corrected syntax (which includes an order-by), as well as the older syntax which does not include an order-by in the UDAF clause itself.

Docs: https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#approx-percentile-cont

New, corrected syntax:
SELECT approx_percentile_cont(0.75) WITHIN GROUP (ORDER BY column_name) FROM table_name;

Older syntax, still supported:
SELECT approx_percentile_cont(column_name, 0.75) FROM table_name;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this bug occurred with customers using the older syntax, since it was shortening the schema names in such a way that they were non-unique is only the 1st argument was different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

approx_percentile_cont query fails: "Schema contains duplicate unqualified field name
2 participants