[SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved #28881

prakharjain09 · 2020-06-21T05:44:05Z

What changes were proposed in this pull request?

Create a single post-order rule for ReuseExchange and ReuseSubquery which traverses the plan in 1 single post order and replaces duplicated nodes with ReusedExchangeExec, ReuseSubqueryExec.

This fixes the ReusedExchangeExec Reference issue where a ReusedExchangeExec points to an Exchange which doesn't exist in entire query plan.

Why are the changes needed?

Currently Spark do 3 iterations on plan to identify and replace nodes which can be ReusedExchangeExec and ReusedSubqueryExec:
Phase-1: First one is done in ReuseExchange rule to replace Exchange with ReusedExchangeExec.
Phase-2: Seconds one is introduces by DPP in ReuseExchange rule to find out all the InSubqueryExec and traverse the plans inside it and replace relevant Exchange with ReusedSubqueryExec.
Phase-3: Third we do in ReuseSubquery rule to identify ExecSubqueryExpression which are reusable and replace them with ReuseSubqueryExec.

When any change is done by Phase-2/Phase-3 in a subtree of Exchange, then the id of exchange will change. and sometimes this leads to another ReusedExchangeExec pointing to Exchange which doesn't exist in plan.

Example: Suppose this is the plan after Phase-1 when we try to do self join of a view.

                                 SORTMERGEJOIN         
       Exchange (id=1234)                          ReusedExchangeExec (points-to-id=1234)
                      |
                 ChildSubtree

Suppose ChildSubtree has DPP applied inside it. So Phase-2 will try to convert plan inside InSubqueryExec to use ReuseBroadcast and in that process, complete hierarchy of ChildSubtree will also change. i.e.

                                 SORTMERGEJOIN         
       Exchange (id=1878)                        ReusedExchangeExec (points-to-id=1234)
                      |
                NewChildSubtree

But the ReusedExchangeExec (points-to-id=1234) is still pointing to id 1234 and so no reuse will happen.

This PR fixes this issue by merging Phase1,Phase2 and Phase3 into a single post order traversal.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UTs.

Also tried the fix on TPCDS 1000 scale with Spark 2.4.5 + DPP backported.

	Time taken before	Time taken after	Improvement
query14a	166991	129895	22.21
query14b	168852	114782	32.02
query23a	656295	495019	24.57
query23b	604754	414849	31.4
query47	53506	39816	25.59
query57	37825	29619	21.69

AmplabJenkins · 2020-06-21T05:49:13Z

Can one of the admins verify this patch?

peter-toth · 2020-06-21T09:19:17Z

@prakharjain09 , it seems we both opened PRs (#28885 is mine) to fix the issue with exchange and subquery reuse. It looks like we came to the same conclusion that the separate reuse rules needs to be unified. My PR does a bit more that that and actually does the combined reuse in a bit different way than yours. I also see that you opened the ticket SPARK-32041 for the issue. If you don't mind I would add that ticket to my PR as well.

HyukjinKwon · 2020-06-22T02:52:38Z

Closing as a dup

prakharjain09 · 2020-06-22T08:15:37Z

@peter-toth sure. Lets collaborate on #28885 to fix SPARK-32041/SPARK-28940.

Fix ReuseExchange issues when subqueries are involved

937c90b

probot-autolabeler bot added the SQL label Jun 21, 2020

peter-toth mentioned this pull request Jun 21, 2020

[SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse #28885

Closed

HyukjinKwon closed this Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved #28881

[SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved #28881

Uh oh!

prakharjain09 commented Jun 21, 2020 •

edited

Loading

Uh oh!

AmplabJenkins commented Jun 21, 2020

Uh oh!

peter-toth commented Jun 21, 2020

Uh oh!

HyukjinKwon commented Jun 22, 2020

Uh oh!

prakharjain09 commented Jun 22, 2020

Uh oh!

Uh oh!

[SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved #28881

[SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved #28881

Uh oh!

Conversation

prakharjain09 commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jun 21, 2020

Uh oh!

peter-toth commented Jun 21, 2020

Uh oh!

HyukjinKwon commented Jun 22, 2020

Uh oh!

prakharjain09 commented Jun 22, 2020

Uh oh!

Uh oh!

prakharjain09 commented Jun 21, 2020 •

edited

Loading