[SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule) #40266

mskapilks · 2023-03-03T07:15:58Z

What changes were proposed in this pull request?

We should run InferFiltersFromConstraints again after running RewritePredicateSubquery rule. RewritePredicateSubquery rewrite IN and EXISTS queries to LEFT SEMI/LEFT ANTI joins. But we don't infer filters for these newly generated joins. We noticed in TPCH 1TB q21 by inferring filter for these new joins, one lineitem table scan can be reduced as ReusedExchange got introduce. Previously due to mismatch in filter predicates reuse was not happening.

Why are the changes needed?

Can improve query performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

PlanStability test

mskapilks · 2023-03-03T07:31:43Z

TPCH q21 plan change

Before	After

mskapilks · 2023-03-03T08:22:30Z

cc @cloud-fan

mskapilks · 2023-03-09T04:36:51Z

cc: @wangyum @peter-toth

peter-toth · 2023-03-09T14:15:57Z

This change makes sense to me and new plans look ok to me.
However, seemingly InferFiltersFromConstraints has a dedicated place in the optimizer and so there are 2 special batches Operator Optimization before Inferring Filters and Operator Optimization after Inferring Filters before and after the rule to make sure the inferred filtes are optimized. It also seems like the RewriteSubquery batch slowly becomes larger and larger with rules from those batches (see SPARK-39511, SPARK-22662, SPARK-36280). And now you want to add InferFiltersFromConstraints too. So I wonder if RewritePredicateSubquery is at the right place or what else would make sense to be executed after RewritePredicateSubquery? Maybe rerunning a full operatorOptimizationBatch would make sense despite it comes with a cost?

wangyum · 2023-03-10T03:21:25Z

I had a change like this before: #22778.

peter-toth · 2023-03-10T08:02:12Z

I had a change like this before: #22778.

Ah ok, thanks @wangyum! It looks like the very same discussuion has come up before: #22778 (comment)

mskapilks · 2023-03-15T06:33:16Z

@wangyum @peter-toth Thanks for pointing on previous attempts.

It does seem moving RewritePredicateSubquery rule is right way so that in future we don't add anymore rule to that batch (RewriteSubquery).

In this pr #17520, they tried to put RewritePredicateSubquery right after Subquery batch (of OptimizeSubqueries). operatorOptimizationBatch will run after this. They also added one rule to push LeftSemi/LeftAnti through join, but that has been added in 3.0 by SPARK-19712. So now we only need to change rule position.

If this seems right to you guys, I can update this PR to move RewritePredicateSubquery after Subqury batch?

peter-toth · 2023-03-20T13:08:47Z

Looks like there are a few failures after moving the rule (22e7886). @mskapilks, do you think you can look into those failures?

mskapilks · 2023-03-21T03:55:17Z

Looks like there are a few failures after moving the rule (22e7886). @mskapilks, do you think you can look into those failures?

Yup I am working on them. I had wrong SPARK_HOME setup so missed the plan changes

mskapilks · 2023-03-22T07:30:59Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

@@ -1158,12 +1158,12 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
      var joinExec = assertJoin((
        "select * from testData where key not in (select a from testData2)",
        classOf[BroadcastHashJoinExec]))
-      assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)
+      assert(!joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)


These two queries don't need NWAJ now due to more inferred filters.

Can you please ellaborate on this a bit more?

Plan for this query before this change:

Join LeftAnti, ((key#13 = a#23) OR isnull((key#13 = a#23))) :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14] : +- ExternalRDD [obj#12] +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23] +- ExternalRDD [obj#22]

New plan

Join LeftAnti, (key#13 = a#23) :- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, true) AS value#14] : +- ExternalRDD [obj#12] +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23] +- ExternalRDD [obj#22]

isnull((key#13 = a#23)) condition got removed by NullPropagation rule (as now all optimization rules will run after subquery rewrite).

So now the join does get convert to Null Aware Anti Join as that's only happens when condition like previous plan exists. LeftAnti(condition: Or(EqualTo(a=b), IsNull(EqualTo(a=b))) Code

Hm, then I think we need to fix the test query (and not the expected result) as not in can't be rewritten to a simple (not null-aware) BroadcastHashJoinExec if we don't know the key's and a's nullability. I think the problem here is that we use TestData and TestData2 where key and a are Ints and not Integers.

mskapilks · 2023-03-23T06:17:10Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q10.sf100/explain.txt

-                        +- * ColumnarToRow (42)
-                           +- Scan parquet spark_catalog.default.customer_demographics (41)
+            +- * Filter (46)
+               +- * SortMergeJoin ExistenceJoin(exists#1) (45)


Seems pushdown is not happening? Need to check this

Seems PushLeftSemiLeftAntiThroughJoin PushDownLeftSemiAntiJoin doesn't consider ExistenceJoin. Might need to update these rules or do predicate pushdown before subquery rewrite (this may not be ideal)?

Filter (46)'s condition is exists#2 OR exists#1, that can't be pushed down. But that's ok as it is basically the same as the old Filter (30) was.
In the new plan the order of joins are a bit different, but I'm not sure the new plan would be worse. Actually we have 3 SMJ + 5 BHJ now whereas we had 4 + 4...

Can you please run a TPCDS benchmark to make sure we don't introduce performance regression?

mskapilks · 2023-03-23T09:18:36Z

More failures. Seems this might take real effort to make it work like other rules modifications.

peter-toth · 2023-03-23T11:21:17Z

More failures. Seems this might take real effort to make it work like other rules modifications.

Why was the latest commit (b1ed7be) needed?

peter-toth · 2023-04-21T12:56:24Z

@mskapilks, do you have any update on this? I can take over this PR and investigate the idea further if you don't have time for it.

github-actions · 2023-07-31T00:20:25Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Run infer filter after RewritePredicateSubquery

eb0096a

github-actions bot added the SQL label Mar 3, 2023

mskapilks changed the title ~~[SPARK-42660] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)~~ [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule) Mar 3, 2023

Fix test

f524af7

mskapilks added 2 commits March 17, 2023 17:13

Move rule before

22e7886

Update plans

58fcfbc

Fix tests

c591da3

mskapilks commented Mar 22, 2023

View reviewed changes

Fix test

b1ed7be

mskapilks commented Mar 23, 2023

View reviewed changes

github-actions bot added the Stale label Jul 31, 2023

github-actions bot closed this Aug 1, 2023

[SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule) #40266

[SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule) #40266

Uh oh!

Conversation

mskapilks commented Mar 3, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

mskapilks commented Mar 3, 2023

Uh oh!

mskapilks commented Mar 3, 2023

Uh oh!

mskapilks commented Mar 9, 2023

Uh oh!

peter-toth commented Mar 9, 2023

Uh oh!

wangyum commented Mar 10, 2023

Uh oh!

peter-toth commented Mar 10, 2023

Uh oh!

mskapilks commented Mar 15, 2023

Uh oh!

peter-toth commented Mar 20, 2023

Uh oh!

mskapilks commented Mar 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mskapilks commented Mar 23, 2023

Uh oh!

peter-toth commented Mar 23, 2023

Uh oh!

peter-toth commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2023

Uh oh!

Uh oh!

peter-toth commented Apr 21, 2023 •

edited

Loading