Skip to content

Conversation

@callmepandey
Copy link

@callmepandey callmepandey commented Jan 4, 2026

What changes were proposed in this pull request?

This PR fixes an issue where null-aware anti-joins (enabled via spark.sql.optimizeNullAwareAntiJoin) were unconditionally using BroadcastHashJoinExec without checking if the right side was small enough to broadcast according to spark.sql.autoBroadcastJoinThreshold.

Why are the changes needed?

When spark.sql.optimizeNullAwareAntiJoin is enabled, queries using NOT IN with a subquery would always attempt to broadcast the right side, even when it exceeded the broadcast threshold. This could lead to OOM errors with large datasets.

Does this PR introduce any user-facing change?

Yes. When spark.sql.autoBroadcastJoinThreshold is set to -1 (or a small value), null-aware anti-joins will now respect this configuration and fall back to BroadcastNestedLoopJoinExec instead of attempting to broadcast large tables with BroadcastHashJoinExec.

Join Strategy Selection:

  • Before fix: Always uses BroadcastHashJoinExec with isNullAwareAntiJoin=true (optimized O(M) hash lookup, but risk of OOM)
  • After fix:
    • If right side is small enough to broadcast: Uses BroadcastHashJoinExec with isNullAwareAntiJoin=true (optimized)
    • If broadcast is disabled or right side is too large: Falls back to BroadcastNestedLoopJoinExec (slower O(M*N), but avoids OOM)

How was this patch tested?

Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do not use BroadcastHashJoinExec when broadcast is disabled.

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

JIRA Issue Information

=== Improvement SPARK-45846 ===
Summary: spark.sql.optimizeNullAwareAntiJoin should respect spark.sql.autoBroadcastJoinThreshold
Assignee: None
Status: Open
Affected: ["4.0.0"]


This comment was automatically generated by GitHub Actions

@github-actions github-actions bot added the SQL label Jan 4, 2026
@callmepandey callmepandey force-pushed the SPARK-45846 branch 3 times, most recently from 9a879d0 to 58a32ae Compare January 4, 2026 18:33
…castJoinThreshold

### What changes were proposed in this pull request?

This PR fixes an issue where null-aware anti-joins (enabled via spark.sql.optimizeNullAwareAntiJoin) were unconditionally using BroadcastHashJoinExec without checking if the right side was small enough to broadcast according to spark.sql.autoBroadcastJoinThreshold.

### Why are the changes needed?

When spark.sql.optimizeNullAwareAntiJoin is enabled, queries using NOT IN with a subquery would always attempt to broadcast the right side, even when it exceeded the broadcast threshold. This could lead to OOM errors with large datasets.

### Does this PR introduce any user-facing change?

Yes. When spark.sql.autoBroadcastJoinThreshold is set to -1 (or a small value), null-aware anti-joins will now respect this configuration and fall back to alternative join strategies instead of attempting to broadcast large tables.

### How was this patch tested?

Added a new test case "SPARK-45846: optimizeNullAwareAntiJoin should respect autoBroadcastJoinThreshold" in JoinSuite that verifies null-aware anti-joins do not use BroadcastHashJoinExec when broadcast is disabled.
@callmepandey
Copy link
Author

Hi @cloud-fan,

I've opened this PR to fix an issue where null-aware anti-joins (from SPARK-32290) are not respecting the spark.sql.autoBroadcastJoinThreshold configuration, which can cause OOM errors when attempting to broadcast large tables.

The Problem:
When spark.sql.optimizeNullAwareAntiJoin is enabled, NOT IN subqueries always use BroadcastHashJoinExec regardless of the broadcast threshold setting, even when the right side is too large to broadcast safely.

The Fix:
Added a canBroadcastBySize(j.right, conf) check in SparkStrategies.scala:332 so that:

  • With broadcast enabled: Uses BroadcastHashJoinExec (optimized O(M) hash lookup)
  • With broadcast disabled: Falls back to BroadcastNestedLoopJoinExec (slower O(M×N), but avoids OOM)

Testing:
Added test case in JoinSuite.scala that verifies null-aware anti-joins respect the threshold configuration when broadcast is disabled.

The PR is rebased on latest master and CI is passing. Would greatly appreciate your review since you signed-off on the original SPARK-32290 implementation.

Thank you!

@callmepandey
Copy link
Author

cc @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant