Skip to content

Conversation

@martin-augment
Copy link
Owner

2643: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Nov 11, 2025

Walkthrough

Adds null checks in Java vector classes to return null for null entries. Rewrites the array_insert Rust function to align with Spark semantics using proper offset ranges, null padding, and final length validation. Adds a test case for array_reverse on binary arrays.

Changes

Cohort / File(s) Summary
Null-check guards in vector classes
common/src/main/java/org/apache/comet/vector/CometListVector.java, CometMapVector.java, CometPlainVector.java, CometVector.java
Added early null checks in array/map/decimal accessor methods. Each method now returns null immediately when isNullAt() is true, bypassing offset computation and object construction for null entries.
Array insertion semantics rewrite
native/spark-expr/src/array_funcs/array_insert.rs
Rewrote core array_insert logic to implement Spark semantics with proper Arrow half-open offset ranges. Added per-row final length tracking (MAX_ROUNDED_ARRAY_LENGTH), separated positive/negative index handling, implemented null padding, and revised validity bitmap updates. Updated test coverage with edge case validation.
Test case for array_reverse
spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala
Added new test "array_reverse 2" validating reverse() on array<binary> columns with nullable elements. Includes ArrayType import and Parquet data generation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-2643-2025-11-11-20-25-18

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@claude
Copy link

claude bot commented Nov 11, 2025

Code Review for PR #19: Fix null handling in CometVector implementations

Summary

This PR addresses null handling issues in CometVector implementations and refactors the array_insert function. The changes fix issue apache#2643 related to null handling in binary arrays and other data types.

Code Quality & Best Practices

Java Changes (CometVector classes)

Positive aspects:

  • ✅ Consistent null checks added across all affected methods
  • ✅ Simple, defensive programming approach - checking isNullAt() before processing
  • ✅ Follows the Spark ColumnVector API contract which requires returning null for null entries
  • ✅ Minimal changes that reduce risk of introducing new bugs

Observations:

  • The null checks in CometPlainVector.java:126, CometPlainVector.java:151, CometVector.java:88, CometListVector.java:48, and CometMapVector.java:68 are straightforward and correct
  • These align with Spark's ColumnVector behavior where accessor methods should return null when isNullAt(rowId) is true

Rust Changes (array_insert.rs)

Positive aspects:

  • ✅ Significant improvement in code clarity and documentation
  • ✅ More accurate implementation of Spark's array_insert semantics
  • ✅ Better handling of edge cases (positive/negative indices, out-of-bounds positions)
  • ✅ Clearer variable naming (final_len, corrected, padding)
  • ✅ Improved comments explaining the logic
  • ✅ Proper handling of half-open offset ranges in Arrow

Concerns:

  1. Large refactoring scope: This is a substantial rewrite of the function logic (143 additions, 71 deletions). While the new code appears cleaner, such large changes increase risk:

    • Consider breaking this into smaller, reviewable chunks if possible
    • The mixing of bug fixes with refactoring makes it harder to verify correctness
  2. Logic changes: The offset calculation logic has changed significantly:

    • Old: corrected_pos = if pos > 0 { (pos - 1).as_usize() } else { ... }
    • New: Multiple branches with different handling for positive/negative indices
    • Recommendation: Ensure extensive testing covers all edge cases
  3. Removed imports: The removal of Field and ArrowNativeType imports suggests code simplification, which is good

Potential Bugs or Issues

Critical

  1. Array length calculation (lines 197-297 in array_insert.rs):

    • The new logic has multiple branches for calculating final_len
    • Verify that final_len is always correct, especially for:
      • Negative indices beyond array start
      • Positive indices beyond array end
      • Both legacy and non-legacy modes
  2. Null handling consistency:

    • The Java changes add null checks, but verify that the Rust side properly handles nulls in the items_array
    • The old code had is_item_null checks that set validity - ensure the new code maintains correct validity bits

Medium

  1. Padding calculation (e.g., line 258):

    • let padding = corrected - len; - ensure this never underflows
    • let padding = target_len.saturating_sub(len + 1); - good use of saturating_sub, but verify logic
  2. Off-by-one errors: The refactoring changes multiple offset calculations. Key areas to verify:

    • Line 248: mutable_values.extend(0, start, start + corrected);
    • Line 267: let corrected = len - k + base_offset;
    • Legacy mode offset differences

Performance Considerations

Positive:

  • The new code pre-allocates capacity correctly
  • Uses Vec::with_capacity for better performance
  • Maintains zero-copy operations where possible

Neutral:

  • Performance should be similar to the old implementation
  • No obvious performance regressions expected

Security Concerns

Good:

  • ✅ Proper bounds checking with MAX_ROUNDED_ARRAY_LENGTH validation (lines 283-287)
  • ✅ Validates position != 0 (line 234)
  • ✅ Uses saturating_sub to prevent underflow

No major security issues identified

Test Coverage

Existing Tests

  • The PR includes existing test coverage for array_insert in the test suite
  • Tests cover: positive indices, negative indices, null items, out-of-bounds positions

New Tests

Positive:

  • ✅ New test test_array_insert_bug_repro_null_item_pos1_fixed() (lines 454-485) specifically tests the null handling bug
  • ✅ Test validates null item insertion at position 1
  • ✅ Uses proper assertions comparing Arrow array data

Recommendations:

  1. Add more edge case tests:

    • Test insertion at array boundaries (position = length + 1)
    • Test very large positive/negative indices
    • Test legacy vs non-legacy mode differences explicitly
    • Test with all-null arrays
    • Test with mixed null/non-null items
  2. Add the new "array_reverse 2" test (lines 778-800):

    • Good: Tests with randomly generated data
    • Good: Tests various array types including binary
    • This appears to be testing array_reverse, not array_insert - verify this is intentional

Recommendations

Must Fix

  1. Add more comprehensive unit tests for the refactored array_insert logic covering all branches
  2. Document the legacy mode behavior more clearly in comments

Should Consider

  1. Split the PR: Consider separating the Java null-check fixes from the Rust refactoring

    • This makes review easier and reduces risk
    • Java changes are straightforward bug fixes
    • Rust changes are a significant refactor
  2. Add inline examples in the Rust code comments showing what happens for specific inputs

  3. Verify Spark compatibility: Run the full Spark compatibility test suite to ensure behavior matches exactly

Nice to Have

  1. Add property-based tests (quickcheck/proptest) to verify invariants
  2. Add benchmarks to ensure performance hasn't regressed
  3. Consider adding debug assertions for intermediate calculations

Overall Assessment

Strengths:

  • Addresses real bugs in null handling
  • Improves code clarity and documentation
  • Good test coverage for the specific bug

Concerns:

  • Large refactoring scope makes verification challenging
  • Would benefit from more comprehensive tests
  • Mixing bug fixes with refactoring

Recommendation: The changes appear correct but would benefit from:

  1. Additional edge case testing
  2. Verification against Spark's reference implementation
  3. Consider splitting into smaller PRs if possible

The Java changes look good to merge. The Rust changes need careful verification of the offset calculation logic against all edge cases before merging.


Reviewed by Claude Code

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da3c57b and 6a1eb5e.

📒 Files selected for processing (6)
  • common/src/main/java/org/apache/comet/vector/CometListVector.java (1 hunks)
  • common/src/main/java/org/apache/comet/vector/CometMapVector.java (1 hunks)
  • common/src/main/java/org/apache/comet/vector/CometPlainVector.java (2 hunks)
  • common/src/main/java/org/apache/comet/vector/CometVector.java (1 hunks)
  • native/spark-expr/src/array_funcs/array_insert.rs (3 hunks)
  • spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().
🧬 Code graph analysis (1)
spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala (2)
common/src/main/scala/org/apache/comet/CometConf.scala (1)
  • CometConf (46-767)
spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala (3)
  • SchemaGenOptions (238-254)
  • SchemaGenOptions (256-260)
  • DataGenOptions (262-268)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: claude-review
🔇 Additional comments (2)
common/src/main/java/org/apache/comet/vector/CometPlainVector.java (2)

126-126: LGTM! Proper null handling added.

The null check prevents unnecessary offset computation and memory access for null values. This defensive pattern is consistent with similar changes in other vector classes mentioned in the PR.


151-151: LGTM! Consistent null handling.

The null check prevents unnecessary offset computation and memory copying for null values. This mirrors the pattern added to getUTF8String at line 126 and aligns with the PR's objective.

Comment on lines +221 to 238
for (row_index, window) in offsets.windows(2).enumerate() {
let start = window[0].as_usize();
let end = window[1].as_usize();
let len = end - start;
let pos = pos_data.value(row_index);

if list_array.is_null(row_index) {
// In Spark if value of the array is NULL than nothing happens
mutable_values.extend_nulls(1);
new_offsets.push(new_offsets[row_index] + O::one());
new_nulls.push(false);
// Top-level list row is NULL: do not write any child values and do not advance offset
new_offsets.push(new_offsets[row_index]);
list_valid.push(false);
continue;
}

if pos == 0 {
return Err(DataFusionError::Internal(
"Position for array_insert should be greter or less than zero".to_string(),
"Position for array_insert should be greater or less than zero".to_string(),
));
}
Copy link

@coderabbitai coderabbitai bot Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle NULL positions before calling Int32Array::value

Int32Array::value(row_index) panics when the slot is NULL. Spark allows array_insert to receive NULL positions (the entire result should be NULL in that row), but this implementation dereferences the value before checking for validity, so any NULL index triggers an abort instead of returning a NULL row. Please guard pos_data in the same way we already guard the list itself.

-        let pos = pos_data.value(row_index);
-
-        if list_array.is_null(row_index) {
+        if list_array.is_null(row_index) {
             // Top-level list row is NULL: do not write any child values and do not advance offset
             new_offsets.push(new_offsets[row_index]);
             list_valid.push(false);
             continue;
         }
 
+        if pos_data.is_null(row_index) {
+            new_offsets.push(new_offsets[row_index]);
+            list_valid.push(false);
+            continue;
+        }
+
+        let pos = pos_data.value(row_index);
+
         if pos == 0 {
             return Err(DataFusionError::Internal(
                 "Position for array_insert should be greater or less than zero".to_string(),
             ));
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (row_index, window) in offsets.windows(2).enumerate() {
let start = window[0].as_usize();
let end = window[1].as_usize();
let len = end - start;
let pos = pos_data.value(row_index);
if list_array.is_null(row_index) {
// In Spark if value of the array is NULL than nothing happens
mutable_values.extend_nulls(1);
new_offsets.push(new_offsets[row_index] + O::one());
new_nulls.push(false);
// Top-level list row is NULL: do not write any child values and do not advance offset
new_offsets.push(new_offsets[row_index]);
list_valid.push(false);
continue;
}
if pos == 0 {
return Err(DataFusionError::Internal(
"Position for array_insert should be greter or less than zero".to_string(),
"Position for array_insert should be greater or less than zero".to_string(),
));
}
for (row_index, window) in offsets.windows(2).enumerate() {
let start = window[0].as_usize();
let end = window[1].as_usize();
let len = end - start;
if list_array.is_null(row_index) {
// Top-level list row is NULL: do not write any child values and do not advance offset
new_offsets.push(new_offsets[row_index]);
list_valid.push(false);
continue;
}
if pos_data.is_null(row_index) {
new_offsets.push(new_offsets[row_index]);
list_valid.push(false);
continue;
}
let pos = pos_data.value(row_index);
if pos == 0 {
return Err(DataFusionError::Internal(
"Position for array_insert should be greater or less than zero".to_string(),
));
}
🤖 Prompt for AI Agents
In native/spark-expr/src/array_funcs/array_insert.rs around lines 221 to 238,
the code calls pos_data.value(row_index) before checking whether the position
slot is NULL which panics on NULL positions; instead, first test
pos_data.is_null(row_index) and if true push the current new_offsets entry, push
false to list_valid and continue (so the result row is NULL), only then call
pos_data.value(row_index) and proceed with the existing pos==0 check and
subsequent logic.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that a check for null is needed before using PrimitiveArray::value(): Note: This method does not check for nulls and the value is arbitrary if [is_null](Self::is_null) returns true for the index. (https://github.com/apache/arrow-rs/blob/a0db1985c3a0f3190cfc5166b428933a28c740f9/arrow-array/src/array/primitive_array.rs#L766-L767)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-augment, thank you for confirming the issue and providing the Arrow documentation reference! You're absolutely right—the docs clearly state that value() doesn't perform null checks and returns arbitrary data for null slots.

The fix I suggested will ensure that when pos_data contains a NULL at any row, the code will properly produce a NULL result row instead of panicking or producing undefined behavior. This aligns with Spark's semantics where a NULL position should yield a NULL result.


✏️ Learnings added
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:04.950Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants