Skip to content

Conversation

@csbiy
Copy link

@csbiy csbiy commented Feb 2, 2026

Summary

This PR addresses issue #3175 by documenting the specific null handling incompatibility between Spark and Comet for the arrays_overlap function.

Changes Made

1. Code Documentation (arrays.scala)

Updated CometArraysOverlap.getSupportLevel() to return Incompatible(Some(...)) with detailed explanation and concrete example.

Before:

override def getSupportLevel(expr: ArraysOverlap): SupportLevel = Incompatible(None)

After:

override def getSupportLevel(expr: ArraysOverlap): SupportLevel = Incompatible(Some(
  "Null handling differs from Spark: DataFusion's array_has_any returns false when no " +
    "common elements are found, even if null elements exist. Spark returns null in such " +
    "cases following three-valued logic (SQL standard). Example: " +
    "arrays_overlap(array(1, null, 3), array(4, 5)) returns null in Spark but false in Comet."))

2. Test Coverage (CometArrayExpressionSuite.scala)

Added comprehensive test arrays_overlap - null handling behavior verification with 6 test cases:

  1. Common element exists - returns true
  2. No common elements, no nulls - returns false
  3. No common elements, null exists - Spark: null, Comet: false (documented incompatibility)
  4. Common element with null present - returns true
  5. Both arrays with null, no common elements - behavior documented
  6. Empty array cases - edge case covered

3. User Documentation (expressions.md)

Updated the Array Expressions table with specific explanation and example showing the three-valued logic difference.

Root Cause Analysis

Spark Behavior (Three-Valued Logic)

Spark follows SQL's three-valued logic (true, false, null):

  • Returns true if common elements found
  • Returns false if no common elements AND no nulls
  • Returns null if no common elements BUT nulls exist (indeterminate)

Comet Behavior

Comet uses DataFusion's array_has_any function:

  • Returns true if common elements found
  • Returns false in all other cases (no three-valued logic support)

Example Demonstrating Incompatibility

SELECT arrays_overlap(array(1, null, 3), array(4, 5))
System Result Reason
Spark null No common elements, but null exists - indeterminate
Comet false DataFusion doesn't implement three-valued logic

Why This Matters

Users who enable arrays_overlap with spark.comet.expression.ArraysOverlap.allowIncompatible=true need to understand:

  1. Query results may differ when nulls are present
  2. Downstream logic relying on null distinction (vs false) may break
  3. JOIN conditions or filters using this function may behave differently

Testing Notes

Local test execution encountered environment Java version compatibility issues (unrelated to code changes). Test code is syntactically correct and follows existing patterns. CI environment should run tests successfully with proper Java configuration.

Files Modified

spark/src/main/scala/org/apache/comet/serde/arrays.scala
spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala
docs/source/user-guide/latest/expressions.md

Closes

Closes #3175

Checklist

  • Documented specific incompatibility reason in code
  • Added test cases verifying behavior
  • Updated user-facing documentation
  • Followed existing code style and patterns
  • Added concrete example for user clarity

This commit addresses issue apache#3175 by documenting the specific null handling
incompatibility in the arrays_overlap function between Spark and Comet.

Changes:
1. Updated CometArraysOverlap.getSupportLevel() in arrays.scala to provide
   a detailed explanation of the incompatibility with a concrete example.

2. Added comprehensive test cases in CometArrayExpressionSuite.scala to
   verify and document the null handling behavior, including:
   - Common element exists (returns true)
   - No common elements, no nulls (returns false)
   - No common elements with null present (Spark: null, Comet: false)
   - Common element with null present (returns true)
   - Both arrays with null but no common elements
   - Empty array cases

3. Updated expressions.md documentation to inform users of the specific
   three-valued logic difference with a clear example.

Root Cause:
Comet uses DataFusion's array_has_any function, which returns false when
no common elements are found, regardless of null presence. Spark follows
SQL's three-valued logic and returns null when the result is indeterminate.

Example:
- arrays_overlap(array(1, null, 3), array(4, 5))
  * Spark: null (indeterminate due to null)
  * Comet: false

Related to apache#3175
@coderfender
Copy link
Contributor

Thank you very much for documenting these @csbiy .

@coderfender
Copy link
Contributor

coderfender commented Feb 3, 2026

Please take a look at the comet contributor guide which should help you with PR :
https://datafusion.apache.org/comet/contributor-guide/index.html

I generally run the following commands to clean up the code before raising the PR:

1. make clean
2. make format
3. npx prettier "**/*.md" --write
4. cd native && cargo clippy --fix --allow-dirty 
5. make release

"Null handling differs from Spark: DataFusion's array_has_any returns false when no " +
"common elements are found, even if null elements exist. Spark returns null in such " +
"cases following three-valued logic (SQL standard). Example: " +
"arrays_overlap(array(1, null, 3), array(4, 5)) returns null in Spark but false in Comet."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this. I would probably keep the message short and route to the documentation to help declutter the logs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Incompatibility] Document arrays_overlap null handling differences

2 participants