Skip to content

Conversation

@nuno-faria
Copy link
Contributor

@nuno-faria nuno-faria commented Aug 6, 2025

Which issue does this PR close?

Rationale for this change

There was a regression in 49.0.0 in the string_agg function, causing it to not consider the ORDER BY clauses.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Aug 6, 2025
@findepi
Copy link
Member

findepi commented Aug 7, 2025

This PR adds new feature code.
Would be good to understand the source of regression #17011.
Is there a smaller & safer fix we could make in the 49 maintenance branch?

@nuno-faria
Copy link
Contributor Author

Would be good to understand the source of regression #17011.

The issue comes from #16217, however I am not sure of the exact cause as that PR is very large.

Is there a smaller & safer fix we could make in the 49 maintenance branch?

I had found that the minimum/easiest fix would be to add the sort method and self.sort() in the execute method, but decided to backport is_input_pre_ordered just in case.

@alamb alamb changed the title fix: string_agg not respecting ORDER BY [branch-49] fix: string_agg not respecting ORDER BY Aug 7, 2025
The `ArrayAgg` struct is stateful, therefore it must implement
`AggregateUDFImpl::equals` and `hash_value` functions.
@findepi
Copy link
Member

findepi commented Aug 8, 2025

The issue comes from #16217, however I am not sure of the exact cause as that PR is very large.

I don't know either and the naive idea to revert that change on the release branch sounds much much less safe than adding a new but small feature work in lie of a bug fix.

I am therefore fine with this backport in current shape. @nuno-faria can you please make sure the build is green?

@nuno-faria
Copy link
Contributor Author

I am therefore fine with this backport in current shape. @nuno-faria can you please make sure the build is green?

@findepi The build error comes from the cherry-picked #17065, since udf_equals_hash! does not exist in version 49. Should we backport udf_equals_hash! to 49.0.1 or cancel cherry picking #17065?

@findepi
Copy link
Member

findepi commented Aug 8, 2025

I think it's best to replace usage of udf_equals_hash with a direct implementation of equals and hash_value, delegating to Eq and Hash that are derived on the struct.

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me, thank you @nuno-faria, but it doesn’t fully address the change introduced in #16217.

I think you can focus on the changes in AggregateExec::try_new_with_schema() from #16217, because that PR modifies the planning results, and you are reacting to it by adapting the accumulator behavior (but I’m not sure if this approach causes any issues or actually improves the situation)

As I said, anyone can debug the point where the order requirements are calculated in AggregateExec::try_new_with_schema(), comparing the behavior before and after #16217, to see the real change that triggers this issue

--

before that PR, a SortExec emerges just before AggregateExec, and sends the sorted data. That SortExec is lost now

@nuno-faria
Copy link
Contributor Author

Thanks @berkaysynnada for the pointers. I did a little more digging and found that the get_finer_aggregate_exprs_requirement function returns a different value for string_agg and array_agg. The source of the difference is that AggregateUDFImpl::reverse_expr is not defined for string_agg. This means the SortExec will not be created for string_agg, and that is still true in the main branch of datafusion.

# string_agg
| physical_plan after CombinePartialFinalAggregate           | OutputRequirementExec: order_by=[], dist_by=Unspecified
|                                                            |   AggregateExec: mode=Single, gby=[], aggr=[string_agg(t.k,Utf8(",")) ORDER BY [t.v ASC NULLS LAST]]
|                                                            |     DataSourceExec: partitions=1, partition_sizes=[1]
|                                                            |
| physical_plan after EnforceSorting                         | SAME TEXT AS ABOVE


# array_agg
| physical_plan after CombinePartialFinalAggregate           | OutputRequirementExec: order_by=[], dist_by=Unspecified
|                                                            |   AggregateExec: mode=Single, gby=[], aggr=[array_agg(t.k) ORDER BY [t.v ASC NULLS LAST]]
|                                                            |     DataSourceExec: partitions=1, partition_sizes=[1]
|                                                            |
| physical_plan after EnforceSorting                         | OutputRequirementExec: order_by=[], dist_by=Unspecified
|                                                            |   AggregateExec: mode=Single, gby=[], aggr=[array_agg(t.k) ORDER BY [t.v ASC NULLS LAST]]
|                                                            |     SortExec: expr=[v@1 ASC NULLS LAST], preserve_partitioning=[false]
|                                                            |       DataSourceExec: partitions=1, partition_sizes=[1]

The fix is to simply implement AggregateUDFImpl::reverse_expr for StringAgg. @alamb @findepi what do you think of this approach? With it there is no need cherry pick a new feature into 49.0.1.

@berkaysynnada
Copy link
Contributor

The fix is to simply implement AggregateUDFImpl::reverse_expr for StringAgg

I prefer this option. There’s no need to overwhelm operators if there wouldn’t be any benefit.

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

I am hoping we can make a release candidate with this fix in it today or tomorrow. Are we waiting on a PR for an alternate approach "The fix is to simply implement AggregateUDFImpl::reverse_expr for StringAgg" ?

@alamb alamb mentioned this pull request Aug 12, 2025
5 tasks
@nuno-faria
Copy link
Contributor Author

I am hoping we can make a release candidate with this fix in it today or tomorrow. Are we waiting on a PR for an alternate approach "The fix is to simply implement AggregateUDFImpl::reverse_expr for StringAgg" ?

* [Release DataFusion `49.0.1` (patch) #17036](https://github.com/apache/datafusion/issues/17036)

I will push the alternative solution then if no one disagrees.

@nuno-faria
Copy link
Contributor Author

Fix pushed. It might also be worth thinking about pushing this fix to main, since currently the SortExec is also missing there for string_agg:

> create table t (k varchar, v int);
0 row(s) fetched.
Elapsed 0.005 seconds.

> insert into t values ('a', 2), ('b', 3), ('c', 1);
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.011 seconds.

> explain select string_agg(k, ',' order by v) from t;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │  string_agg(t.k, ,) ORDER │ |
|               | │   BY [t.v ASC NULLS LAST] │ |
|               | │                           │ |
|               | │        mode: Single       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 296        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> explain select array_agg(k order by v) from t;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │            aggr           │ |
|               | │                           │ |
|               | │        mode: Single       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │          SortExec         │ |
|               | │    --------------------   │ |
|               | │     v@1 ASC NULLS LAST    │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 296        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

@findepi findepi requested a review from berkaysynnada August 12, 2025 11:26
@findepi
Copy link
Member

findepi commented Aug 12, 2025

Fix pushed. It might also be worth thinking about pushing this fix to main, since currently the SortExec is also missing there for string_agg:

I don't think it's imperative to implement the reverse_expr for string_agg on the maintenance branch. Is it required here, @berkaysynnada ?
I would feel more comfortable approving a smaller PR, especially when it targets a maintenance branch.

I am all for improving the state of things on main. I would prefer improvements such as adding reverse_expr for string_agg are reviewed and merged on main first, and then backported to maintenance branch.

@berkaysynnada
Copy link
Contributor

@findepi I think it's fine to fix 49.0 with this patch, as it is indeed a bug fix rather than an improvement (if anyone requests it to be fixed). Besides, this should be fixed on main as well in any case.

@nuno-faria thank you for spotting this issue. I also recommend adding another slt test to assert the planning result. Otherwise, it’s good to go (after deciding where to land)

BTW, how does #16217 affect this? Were you able to see that those queries used array_agg before, and after that PR they now use string_agg? I’m a bit confused about that.

@nuno-faria
Copy link
Contributor Author

I would feel more comfortable approving a smaller PR, especially when it targets a maintenance branch.

The new fix makes the PR smaller, since it only implements AggregateUDFImpl::reverse_expr for StringAgg.

I also recommend adding another slt test to assert the planning result. Otherwise, it’s good to go (after deciding where to land)

Updated. I think the fix should be good to go.

BTW, how does #16217 affect this? Were you able to see that those queries used array_agg before, and after that PR they now use string_agg? I’m a bit confused about that.

They still used string_agg, but the SortExec was dropped after #16217. I just used array_agg as a comparison since the order on that function still worked after #16217.

)))
}

fn reverse_expr(&self) -> datafusion_expr::ReversedUDAF {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this on main first?
This would make it impossible that anyone updating from 49.0.1 to a newer DF version experiences a behavioral regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR open: #17165

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Aug 13, 2025

Other than https://github.com/apache/datafusion/pull/17058/files#r2272648661, this LGTM

@findepi is it ok with you to merge this PR now, or should we wait for the forward port to main to merge first?

@findepi
Copy link
Member

findepi commented Aug 13, 2025

if there is urgency to merge & release 49.0.1, i am fine
if there isn't, I'd like to merge to main first.

@alamb
Copy link
Contributor

alamb commented Aug 14, 2025

if there is urgency to merge & release 49.0.1, i am fine if there isn't, I'd like to merge to main first.

I would like to get the 49.0.1 release candidate (as there is then still a 3 day voting period before we can release) so I will merge this one in and follow along on #17165 and try and help as possible

@alamb alamb merged commit f05b128 into apache:branch-49 Aug 14, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 14, 2025

Thanks again @nuno-faria @findepi and @berkaysynnada for getting this done for the 49.0.1 release

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Aug 14, 2025
* fix: string_agg not respecting ORDER BY

* Fix equality of parametrizable ArrayAgg function (apache#17065)

The `ArrayAgg` struct is stateful, therefore it must implement
`AggregateUDFImpl::equals` and `hash_value` functions.

* Implement AggregateUDFImpl::equals and AggregateUDFImpl::hash_value for ArrayAgg

* Implement alternative fix

* Remove 'use std::any::Any'

* Add sqllogictest for string_agg plan

* Revert as_any to their original implementations

---------

Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
(cherry picked from commit f05b128)
@nuno-faria nuno-faria deleted the fix_string_agg_regression branch August 14, 2025 12:46
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Aug 14, 2025
* fix: string_agg not respecting ORDER BY

* Fix equality of parametrizable ArrayAgg function (apache#17065)

The `ArrayAgg` struct is stateful, therefore it must implement
`AggregateUDFImpl::equals` and `hash_value` functions.

* Implement AggregateUDFImpl::equals and AggregateUDFImpl::hash_value for ArrayAgg

* Implement alternative fix

* Remove 'use std::any::Any'

* Add sqllogictest for string_agg plan

* Revert as_any to their original implementations

---------



(cherry picked from commit f05b128)

Co-authored-by: Nuno Faria <nunofpfaria@gmail.com>
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants