-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[branch-49] fix: string_agg not respecting ORDER BY #17058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR adds new feature code. |
The issue comes from #16217, however I am not sure of the exact cause as that PR is very large.
I had found that the minimum/easiest fix would be to add the |
The `ArrayAgg` struct is stateful, therefore it must implement `AggregateUDFImpl::equals` and `hash_value` functions.
I don't know either and the naive idea to revert that change on the release branch sounds much much less safe than adding a new but small feature work in lie of a bug fix. I am therefore fine with this backport in current shape. @nuno-faria can you please make sure the build is green? |
@findepi The build error comes from the cherry-picked #17065, since |
|
I think it's best to replace usage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me, thank you @nuno-faria, but it doesn’t fully address the change introduced in #16217.
I think you can focus on the changes in AggregateExec::try_new_with_schema() from #16217, because that PR modifies the planning results, and you are reacting to it by adapting the accumulator behavior (but I’m not sure if this approach causes any issues or actually improves the situation)
As I said, anyone can debug the point where the order requirements are calculated in AggregateExec::try_new_with_schema(), comparing the behavior before and after #16217, to see the real change that triggers this issue
--
before that PR, a SortExec emerges just before AggregateExec, and sends the sorted data. That SortExec is lost now
|
Thanks @berkaysynnada for the pointers. I did a little more digging and found that the The fix is to simply implement |
I prefer this option. There’s no need to overwhelm operators if there wouldn’t be any benefit. |
|
I am hoping we can make a release candidate with this fix in it today or tomorrow. Are we waiting on a PR for an alternate approach "The fix is to simply implement AggregateUDFImpl::reverse_expr for StringAgg" ? |
I will push the alternative solution then if no one disagrees. |
|
Fix pushed. It might also be worth thinking about pushing this fix to main, since currently the > create table t (k varchar, v int);
0 row(s) fetched.
Elapsed 0.005 seconds.
> insert into t values ('a', 2), ('b', 3), ('c', 1);
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.011 seconds.
> explain select string_agg(k, ',' order by v) from t;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ string_agg(t.k, ,) ORDER │ |
| | │ BY [t.v ASC NULLS LAST] │ |
| | │ │ |
| | │ mode: Single │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 296 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.
> explain select array_agg(k order by v) from t;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr │ |
| | │ │ |
| | │ mode: Single │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ SortExec │ |
| | │ -------------------- │ |
| | │ v@1 ASC NULLS LAST │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 296 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+ |
I don't think it's imperative to implement the I am all for improving the state of things on |
|
@findepi I think it's fine to fix 49.0 with this patch, as it is indeed a bug fix rather than an improvement (if anyone requests it to be fixed). Besides, this should be fixed on main as well in any case. @nuno-faria thank you for spotting this issue. I also recommend adding another slt test to assert the planning result. Otherwise, it’s good to go (after deciding where to land) BTW, how does #16217 affect this? Were you able to see that those queries used |
The new fix makes the PR smaller, since it only implements
Updated. I think the fix should be good to go.
They still used |
| ))) | ||
| } | ||
|
|
||
| fn reverse_expr(&self) -> datafusion_expr::ReversedUDAF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this on main first?
This would make it impossible that anyone updating from 49.0.1 to a newer DF version experiences a behavioral regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR open: #17165
findepi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than https://github.com/apache/datafusion/pull/17058/files#r2272648661, this LGTM
@findepi is it ok with you to merge this PR now, or should we wait for the forward port to main to merge first? |
|
if there is urgency to merge & release 49.0.1, i am fine |
I would like to get the 49.0.1 release candidate (as there is then still a 3 day voting period before we can release) so I will merge this one in and follow along on #17165 and try and help as possible |
|
Thanks again @nuno-faria @findepi and @berkaysynnada for getting this done for the 49.0.1 release |
* fix: string_agg not respecting ORDER BY * Fix equality of parametrizable ArrayAgg function (apache#17065) The `ArrayAgg` struct is stateful, therefore it must implement `AggregateUDFImpl::equals` and `hash_value` functions. * Implement AggregateUDFImpl::equals and AggregateUDFImpl::hash_value for ArrayAgg * Implement alternative fix * Remove 'use std::any::Any' * Add sqllogictest for string_agg plan * Revert as_any to their original implementations --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit f05b128)
* fix: string_agg not respecting ORDER BY * Fix equality of parametrizable ArrayAgg function (apache#17065) The `ArrayAgg` struct is stateful, therefore it must implement `AggregateUDFImpl::equals` and `hash_value` functions. * Implement AggregateUDFImpl::equals and AggregateUDFImpl::hash_value for ArrayAgg * Implement alternative fix * Remove 'use std::any::Any' * Add sqllogictest for string_agg plan * Revert as_any to their original implementations --------- (cherry picked from commit f05b128) Co-authored-by: Nuno Faria <nunofpfaria@gmail.com> Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
string_aggdoes not respectORDER BYon49.0.0#17011.49.0.1(patch) #17036Rationale for this change
There was a regression in
49.0.0in thestring_aggfunction, causing it to not consider theORDER BYclauses.What changes are included in this PR?
array_aggaggregations #16625 to version 49 (I don't think the entire PR can be backported since it appears to have breaking changes).Are these changes tested?
Yes.
Are there any user-facing changes?
No.