-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Removed incorrect union check in enforce_sorting and updated tests #18661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Removed incorrect union check in enforce_sorting and updated tests #18661
Conversation
|
Cc: @NGA-TRAN - leaving this in draft while I finish writing documentation on the enforce sorting rule, think this is the fix though 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@rgehan : this will fix your request #18380
Can you help review it?
@gene-bordegaray : You may not need any long document explaining this unless you really want to do so. a short description is good enough
| AggregateExec: mode=Partial, gby=[id@0 as id], aggr=[], ordering_mode=Sorted | ||
| UnionExec | ||
| DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], output_ordering=[id@0 ASC NULLS LAST], file_type=parquet | ||
| SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great push down
| SortExec: expr=[nullable_col@0 ASC], preserve_partitioning=[false] | ||
| DataSourceExec: file_groups={1 group: [[x]]}, projection=[nullable_col, non_nullable_col], file_type=parquet | ||
| SortExec: expr=[nullable_col@0 ASC], preserve_partitioning=[false] | ||
| DataSourceExec: file_groups={1 group: [[x]]}, projection=[nullable_col, non_nullable_col], file_type=parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is no longer needed? Because the input and output are now the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was confused at first but this is what it means.
| || !plan.maintains_input_order()[idx] | ||
| || is_union(plan) | ||
| { | ||
| } else if physical_ordering.is_none() || !plan.maintains_input_order()[idx] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is simple.
I wonder why this is_union is here in the first place 🤔
All the tests pass and @rgehan's reproducer now also pass means this is likely the right work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was very confused as well when stepping through the logic. The only thing I could think of is if it was thought that UnionExec was never able to maintain input order but it is implemented for the operator.
Interesting though since this is quoted from the documentation online about the operator: "UnionExec combines multiple inputs with the same schema by concatenating the partitions. It does not mix or copy data within or across partitions. Thus if the input partitions are sorted, the output partitions of the union are also sorted."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice analysis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is simple. I wonder why this
is_unionis here in the first place 🤔
I guess when implementing optimizer rules, it’s more common to be conservative than comprehensive if not sure, to avoid bugs.
Ok sounds good. More doing so for my own understanding and to give to others but shouldn't hold back this PR 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I'm not knowledgeable enough to understand the impact of dropping this is_union() condition.
In addition to fixing the tests I previously added, I can confirm it also fixes my production use-cases 👍
2010YOUY01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, thank you!
| || !plan.maintains_input_order()[idx] | ||
| || is_union(plan) | ||
| { | ||
| } else if physical_ordering.is_none() || !plan.maintains_input_order()[idx] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is simple. I wonder why this
is_unionis here in the first place 🤔
I guess when implementing optimizer rules, it’s more common to be conservative than comprehensive if not sure, to avoid bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gene-bordegaray @rgehan, and @NGA-TRAN
I reviewed this plan and code carefully and I agree there is no reason to force the sort after a union if the inputs are already sorted.
| assert_snapshot!( | ||
| union_with_mix_of_presorted_and_explicitly_resorted_inputs_impl(false).await?, | ||
| @r#" | ||
| @r" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this plan carefully and it looks good to me
Which issue does this PR close?
UnionExecinputs instead of introducing a suboptimal top-level sort #18380.Rationale for this change
There was an overly aggressive condition enforce_sorting rule was not handling UnionExec correctly. This conditions assumed that Unions did not maintain order causing SortExec nodes to be removed and then eventually added at a higher level, less efficiently.
What changes are included in this PR?
I removed this condition that now has changed the logic to properly take into account UnionExec's ability to maintain input ordering.
Are these changes tested?
Yes, previously failing tests were ignored and now are unignored and passing.
Are there any user-facing changes?
No