Skip to content

[SPARK-33183][SQL][2.4] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts #30194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

allisonwang-db
Copy link
Contributor

Backport #30093 for branch-2.4.

What changes were proposed in this pull request?

This PR aims to fix a correctness bug in the optimizer rule EliminateSorts. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

Why are the changes needed?

A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.

Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)

Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Unit tests

…al rule to remove redundant sorts

This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.

```
Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)
```

Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

Yes

Unit tests

Closes apache#30093 from allisonwang-db/fix-sort.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9fb4536)
Signed-off-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35036/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35036/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130431 has finished for PR 30194 at commit 3bd5aff.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

GA passed, merging to 2.4!

cloud-fan pushed a commit that referenced this pull request Oct 30, 2020
…hysical rule to remove redundant sorts

Backport #30093 for branch-2.4.

### What changes were proposed in this pull request?
This PR aims to fix a correctness bug in the optimizer rule EliminateSorts. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

### Why are the changes needed?
A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.
```
Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)
```
Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Unit tests

Closes #30194 from allisonwang-db/SPARK-33183-branch-2.4.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Oct 30, 2020
dongjoon-hyun pushed a commit that referenced this pull request Nov 19, 2020
… version

### What changes were proposed in this pull request?
This PR is a follow up for #30093 to updates the config `spark.sql.execution.removeRedundantSorts` version to 2.4.8.

### Why are the changes needed?
To update the rule version it has been backported to 2.4. #30194

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes #30420 from allisonwang-db/spark-33183-follow-up.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@allisonwang-db allisonwang-db deleted the SPARK-33183-branch-2.4 branch January 19, 2024 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants