[SPARK-45846][SQL] optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold #53670
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR fixes an issue where null-aware anti-joins (enabled via
spark.sql.optimizeNullAwareAntiJoin) were unconditionally usingBroadcastHashJoinExecwithout checking if the right side was small enough to broadcast according tospark.sql.autoBroadcastJoinThreshold.Why are the changes needed?
When
spark.sql.optimizeNullAwareAntiJoinis enabled, queries usingNOT INwith a subquery would always attempt to broadcast the right side, even when it exceeded the broadcast threshold. This could lead to OOM errors with large datasets.Does this PR introduce any user-facing change?
Yes. When
spark.sql.autoBroadcastJoinThresholdis set to -1 (or a small value), null-aware anti-joins will now respect this configuration and fall back toBroadcastNestedLoopJoinExecinstead of attempting to broadcast large tables withBroadcastHashJoinExec.Join Strategy Selection:
BroadcastHashJoinExecwithisNullAwareAntiJoin=true(optimized O(M) hash lookup, but risk of OOM)BroadcastHashJoinExecwithisNullAwareAntiJoin=true(optimized)BroadcastNestedLoopJoinExec(slower O(M*N), but avoids OOM)How was this patch tested?
Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do not use BroadcastHashJoinExec when broadcast is disabled.