Commit 7eca60d
[SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations
### What changes were proposed in this pull request?
Backport #39131 to branch-3.3.
Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those attributes (in this example due to the projection of `value` to `id`).
This behaviour already exists for `Project` and `Union`, but `Aggregate` lacks this safety guard.
### Why are the changes needed?
Without this change, the optimizer creates an incorrect plan.
This example fails with `distinct()` (an aggregation), and succeeds without `distinct()`, but both queries are identical:
```scala
val ids = Seq(1, 2, 3).toDF("id").distinct()
val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect()
assert(result.length == 1)
```
With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(value#907 + 1) = value#907`, which can never be true. This effectively removes the anti-join.
**Before this PR:**
The anti-join is fully removed from the plan.
```
== Physical Plan ==
AdaptiveSparkPlan (16)
+- == Final Plan ==
LocalTableScan (1)
(16) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```
This is caused by `PushDownLeftSemiAntiJoin` adding join condition `(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 1) AS id#912` exists in the right child of the join as well as in the left grandchild:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
!Join LeftAnti, (id#912 = id#910) Aggregate [id#910], [(id#910 + 1) AS id#912]
!:- Aggregate [id#910], [(id#910 + 1) AS id#912] +- Project [value#907 AS id#910]
!: +- Project [value#907 AS id#910] +- Join LeftAnti, ((value#907 + 1) = value#907)
!: +- LocalRelation [value#907] :- LocalRelation [value#907]
!+- Aggregate [id#910], [id#910] +- Aggregate [id#910], [id#910]
! +- Project [value#914 AS id#910] +- Project [value#914 AS id#910]
! +- LocalRelation [value#914] +- LocalRelation [value#914]
```
The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition.
**After this PR:**
Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous as both sides of the prospect join contain `id#910`. Hence, the join is not pushed down. The rule is then not applied any more.
The final plan contains the anti-join:
```
== Physical Plan ==
AdaptiveSparkPlan (24)
+- == Final Plan ==
* BroadcastHashJoin LeftSemi BuildRight (14)
:- * HashAggregate (7)
: +- AQEShuffleRead (6)
: +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, rowCount=3)
: +- Exchange (4)
: +- * HashAggregate (3)
: +- * Project (2)
: +- * LocalTableScan (1)
+- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, rowCount=3)
+- BroadcastExchange (12)
+- * HashAggregate (11)
+- AQEShuffleRead (10)
+- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, rowCount=3)
+- ReusedExchange (8)
(8) ReusedExchange [Reuses operator id: 4]
Output [1]: [id#898]
(24) AdaptiveSparkPlan
Output [1]: [id#900]
Arguments: isFinalPlan=true
```
### Does this PR introduce _any_ user-facing change?
It fixes correctness.
### How was this patch tested?
Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`.
Closes #39409 from EnricoMi/branch-antijoin-selfjoin-fix-3.3.
Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b97f79d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent 0f5e231 commit 7eca60d
File tree
3 files changed
+63
-25
lines changed- sql
- catalyst/src
- main/scala/org/apache/spark/sql/catalyst/optimizer
- test/scala/org/apache/spark/sql/catalyst/optimizer
- core/src/test/scala/org/apache/spark/sql
3 files changed
+63
-25
lines changedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala
Lines changed: 7 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
| |||
105 | 106 | | |
106 | 107 | | |
107 | 108 | | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
113 | 114 | | |
114 | 115 | | |
115 | 116 | | |
| |||
Lines changed: 38 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
62 | | - | |
| 62 | + | |
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | | - | |
| 71 | + | |
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
| |||
83 | 83 | | |
84 | 84 | | |
85 | 85 | | |
86 | | - | |
| 86 | + | |
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
| 98 | + | |
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
112 | | - | |
| 112 | + | |
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
145 | | - | |
| 145 | + | |
146 | 146 | | |
147 | 147 | | |
148 | 148 | | |
| |||
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
154 | | - | |
| 154 | + | |
155 | 155 | | |
156 | 156 | | |
157 | 157 | | |
| |||
166 | 166 | | |
167 | 167 | | |
168 | 168 | | |
169 | | - | |
| 169 | + | |
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
| |||
184 | 184 | | |
185 | 185 | | |
186 | 186 | | |
187 | | - | |
| 187 | + | |
188 | 188 | | |
189 | 189 | | |
190 | 190 | | |
| |||
224 | 224 | | |
225 | 225 | | |
226 | 226 | | |
227 | | - | |
| 227 | + | |
228 | 228 | | |
229 | 229 | | |
230 | 230 | | |
| |||
240 | 240 | | |
241 | 241 | | |
242 | 242 | | |
243 | | - | |
| 243 | + | |
244 | 244 | | |
245 | 245 | | |
246 | 246 | | |
| |||
259 | 259 | | |
260 | 260 | | |
261 | 261 | | |
262 | | - | |
| 262 | + | |
263 | 263 | | |
264 | 264 | | |
265 | 265 | | |
| |||
274 | 274 | | |
275 | 275 | | |
276 | 276 | | |
277 | | - | |
| 277 | + | |
278 | 278 | | |
279 | 279 | | |
280 | 280 | | |
| |||
289 | 289 | | |
290 | 290 | | |
291 | 291 | | |
292 | | - | |
| 292 | + | |
293 | 293 | | |
294 | 294 | | |
295 | 295 | | |
| |||
305 | 305 | | |
306 | 306 | | |
307 | 307 | | |
308 | | - | |
| 308 | + | |
309 | 309 | | |
310 | 310 | | |
311 | 311 | | |
| |||
315 | 315 | | |
316 | 316 | | |
317 | 317 | | |
318 | | - | |
| 318 | + | |
319 | 319 | | |
320 | 320 | | |
321 | 321 | | |
| |||
325 | 325 | | |
326 | 326 | | |
327 | 327 | | |
328 | | - | |
| 328 | + | |
329 | 329 | | |
330 | 330 | | |
331 | 331 | | |
| |||
431 | 431 | | |
432 | 432 | | |
433 | 433 | | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
434 | 453 | | |
435 | 454 | | |
436 | 455 | | |
| |||
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
288 | 288 | | |
289 | 289 | | |
290 | 290 | | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
291 | 309 | | |
292 | 310 | | |
293 | 311 | | |
| |||
0 commit comments