Skip to content

[SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved #28881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

prakharjain09
Copy link
Contributor

@prakharjain09 prakharjain09 commented Jun 21, 2020

What changes were proposed in this pull request?

Create a single post-order rule for ReuseExchange and ReuseSubquery which traverses the plan in 1 single post order and replaces duplicated nodes with ReusedExchangeExec, ReuseSubqueryExec.

This fixes the ReusedExchangeExec Reference issue where a ReusedExchangeExec points to an Exchange which doesn't exist in entire query plan.

Why are the changes needed?

Currently Spark do 3 iterations on plan to identify and replace nodes which can be ReusedExchangeExec and ReusedSubqueryExec:
Phase-1: First one is done in ReuseExchange rule to replace Exchange with ReusedExchangeExec.
Phase-2: Seconds one is introduces by DPP in ReuseExchange rule to find out all the InSubqueryExec and traverse the plans inside it and replace relevant Exchange with ReusedSubqueryExec.
Phase-3: Third we do in ReuseSubquery rule to identify ExecSubqueryExpression which are reusable and replace them with ReuseSubqueryExec.

When any change is done by Phase-2/Phase-3 in a subtree of Exchange, then the id of exchange will change. and sometimes this leads to another ReusedExchangeExec pointing to Exchange which doesn't exist in plan.

Example: Suppose this is the plan after Phase-1 when we try to do self join of a view.

                                 SORTMERGEJOIN         
       Exchange (id=1234)                          ReusedExchangeExec (points-to-id=1234)
                      |
                 ChildSubtree

Suppose ChildSubtree has DPP applied inside it. So Phase-2 will try to convert plan inside InSubqueryExec to use ReuseBroadcast and in that process, complete hierarchy of ChildSubtree will also change. i.e.

                                 SORTMERGEJOIN         
       Exchange (id=1878)                        ReusedExchangeExec (points-to-id=1234)
                      |
                NewChildSubtree

But the ReusedExchangeExec (points-to-id=1234) is still pointing to id 1234 and so no reuse will happen.

This PR fixes this issue by merging Phase1,Phase2 and Phase3 into a single post order traversal.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UTs.

Also tried the fix on TPCDS 1000 scale with Spark 2.4.5 + DPP backported.

  Time taken before Time taken after Improvement
query14a 166991 129895 22.21
query14b 168852 114782 32.02
query23a 656295 495019 24.57
query23b 604754 414849 31.4
query47 53506 39816 25.59
query57 37825 29619 21.69

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@peter-toth
Copy link
Contributor

@prakharjain09 , it seems we both opened PRs (#28885 is mine) to fix the issue with exchange and subquery reuse. It looks like we came to the same conclusion that the separate reuse rules needs to be unified. My PR does a bit more that that and actually does the combined reuse in a bit different way than yours. I also see that you opened the ticket SPARK-32041 for the issue. If you don't mind I would add that ticket to my PR as well.

@HyukjinKwon
Copy link
Member

Closing as a dup

@prakharjain09
Copy link
Contributor Author

@peter-toth sure. Lets collaborate on #28885 to fix SPARK-32041/SPARK-28940.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants