Commit 5e62e3a
committed
[SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations
Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those attributes (in this example due to the projection of `value` to `id`).
This behaviour already exists for `Project` and `Union`, but `Aggregate` lacks this safety guard.
Without this change, the optimizer creates an incorrect plan.
This example fails with `distinct()` (an aggregation), and succeeds without `distinct()`, but both queries are identical:
```scala
val ids = Seq(1, 2, 3).toDF("id").distinct()
val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect()
assert(result.length == 1)
```
With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(value#907 + 1) = value#907`, which can never be true. This effectively removes the anti-join.
**Before this PR:**
The anti-join is fully removed from the plan.
```
== Physical Plan ==
AdaptiveSparkPlan (16)
+- == Final Plan ==
LocalTableScan (1)
(16) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```
This is caused by `PushDownLeftSemiAntiJoin` adding join condition `(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 1) AS id#912` exists in the right child of the join as well as in the left grandchild:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
!Join LeftAnti, (id#912 = id#910) Aggregate [id#910], [(id#910 + 1) AS id#912]
!:- Aggregate [id#910], [(id#910 + 1) AS id#912] +- Project [value#907 AS id#910]
!: +- Project [value#907 AS id#910] +- Join LeftAnti, ((value#907 + 1) = value#907)
!: +- LocalRelation [value#907] :- LocalRelation [value#907]
!+- Aggregate [id#910], [id#910] +- Aggregate [id#910], [id#910]
! +- Project [value#914 AS id#910] +- Project [value#914 AS id#910]
! +- LocalRelation [value#914] +- LocalRelation [value#914]
```
The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition.
**After this PR:**
Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous as both sides of the prospect join contain `id#910`. Hence, the join is not pushed down. The rule is then not applied any more.
The final plan contains the anti-join:
```
== Physical Plan ==
AdaptiveSparkPlan (24)
+- == Final Plan ==
* BroadcastHashJoin LeftSemi BuildRight (14)
:- * HashAggregate (7)
: +- AQEShuffleRead (6)
: +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, rowCount=3)
: +- Exchange (4)
: +- * HashAggregate (3)
: +- * Project (2)
: +- * LocalTableScan (1)
+- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, rowCount=3)
+- BroadcastExchange (12)
+- * HashAggregate (11)
+- AQEShuffleRead (10)
+- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, rowCount=3)
+- ReusedExchange (8)
(8) ReusedExchange [Reuses operator id: 4]
Output [1]: [id#898]
(24) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```
It fixes correctness.
Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`.
Closes apache#39131 from EnricoMi/branch-antijoin-selfjoin-fix.
Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent 61e0348 commit 5e62e3a
File tree
3 files changed
+63
-25
lines changed- sql
- catalyst/src
- main/scala/org/apache/spark/sql/catalyst/optimizer
- test/scala/org/apache/spark/sql/catalyst/optimizer
- core/src/test/scala/org/apache/spark/sql
3 files changed
+63
-25
lines changedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala
Lines changed: 8 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
55 | | - | |
| 55 | + | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
| 58 | + | |
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
| |||
100 | 101 | | |
101 | 102 | | |
102 | 103 | | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
108 | 109 | | |
109 | 110 | | |
110 | 111 | | |
| |||
Lines changed: 37 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
| 70 | + | |
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
85 | | - | |
| 85 | + | |
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
97 | | - | |
| 97 | + | |
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
111 | | - | |
| 111 | + | |
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| |||
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
144 | | - | |
| 144 | + | |
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
153 | | - | |
| 153 | + | |
154 | 154 | | |
155 | 155 | | |
156 | 156 | | |
| |||
165 | 165 | | |
166 | 166 | | |
167 | 167 | | |
168 | | - | |
| 168 | + | |
169 | 169 | | |
170 | 170 | | |
171 | 171 | | |
| |||
183 | 183 | | |
184 | 184 | | |
185 | 185 | | |
186 | | - | |
| 186 | + | |
187 | 187 | | |
188 | 188 | | |
189 | 189 | | |
| |||
223 | 223 | | |
224 | 224 | | |
225 | 225 | | |
226 | | - | |
| 226 | + | |
227 | 227 | | |
228 | 228 | | |
229 | 229 | | |
| |||
249 | 249 | | |
250 | 250 | | |
251 | 251 | | |
252 | | - | |
| 252 | + | |
253 | 253 | | |
254 | 254 | | |
255 | 255 | | |
| |||
264 | 264 | | |
265 | 265 | | |
266 | 266 | | |
267 | | - | |
| 267 | + | |
268 | 268 | | |
269 | 269 | | |
270 | 270 | | |
| |||
279 | 279 | | |
280 | 280 | | |
281 | 281 | | |
282 | | - | |
| 282 | + | |
283 | 283 | | |
284 | 284 | | |
285 | 285 | | |
| |||
295 | 295 | | |
296 | 296 | | |
297 | 297 | | |
298 | | - | |
| 298 | + | |
299 | 299 | | |
300 | 300 | | |
301 | 301 | | |
| |||
305 | 305 | | |
306 | 306 | | |
307 | 307 | | |
308 | | - | |
| 308 | + | |
309 | 309 | | |
310 | 310 | | |
311 | 311 | | |
| |||
315 | 315 | | |
316 | 316 | | |
317 | 317 | | |
318 | | - | |
| 318 | + | |
319 | 319 | | |
320 | 320 | | |
321 | 321 | | |
| |||
421 | 421 | | |
422 | 422 | | |
423 | 423 | | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
424 | 443 | | |
425 | 444 | | |
426 | 445 | | |
| |||
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
288 | 288 | | |
289 | 289 | | |
290 | 290 | | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
291 | 309 | | |
292 | 310 | | |
293 | 311 | | |
| |||
0 commit comments