2643: Fix: Fix null handling in CometVector implementations #19

martin-augment · 2025-11-11T20:25:22Z

2643: To review by AI

coderabbitai · 2025-11-11T20:25:37Z

Walkthrough

Adds null checks in Java vector classes to return null for null entries. Rewrites the array_insert Rust function to align with Spark semantics using proper offset ranges, null padding, and final length validation. Adds a test case for array_reverse on binary arrays.

Changes

Cohort / File(s)	Summary
Null-check guards in vector classes `common/src/main/java/org/apache/comet/vector/CometListVector.java`, `CometMapVector.java`, `CometPlainVector.java`, `CometVector.java`	Added early null checks in array/map/decimal accessor methods. Each method now returns null immediately when `isNullAt()` is true, bypassing offset computation and object construction for null entries.
Array insertion semantics rewrite `native/spark-expr/src/array_funcs/array_insert.rs`	Rewrote core array_insert logic to implement Spark semantics with proper Arrow half-open offset ranges. Added per-row final length tracking (MAX_ROUNDED_ARRAY_LENGTH), separated positive/negative index handling, implemented null padding, and revised validity bitmap updates. Updated test coverage with edge case validation.
Test case for array_reverse `spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala`	Added new test "array_reverse 2" validating reverse() on `array<binary>` columns with nullable elements. Includes ArrayType import and Parquet data generation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-2643-2025-11-11-20-25-18

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

claude · 2025-11-11T20:28:45Z

Code Review for PR #19: Fix null handling in CometVector implementations

Summary

This PR addresses null handling issues in CometVector implementations and refactors the array_insert function. The changes fix issue apache#2643 related to null handling in binary arrays and other data types.

Code Quality & Best Practices

Java Changes (CometVector classes)

Positive aspects:

✅ Consistent null checks added across all affected methods
✅ Simple, defensive programming approach - checking isNullAt() before processing
✅ Follows the Spark ColumnVector API contract which requires returning null for null entries
✅ Minimal changes that reduce risk of introducing new bugs

Observations:

The null checks in CometPlainVector.java:126, CometPlainVector.java:151, CometVector.java:88, CometListVector.java:48, and CometMapVector.java:68 are straightforward and correct
These align with Spark's ColumnVector behavior where accessor methods should return null when isNullAt(rowId) is true

Rust Changes (array_insert.rs)

Positive aspects:

✅ Significant improvement in code clarity and documentation
✅ More accurate implementation of Spark's array_insert semantics
✅ Better handling of edge cases (positive/negative indices, out-of-bounds positions)
✅ Clearer variable naming (final_len, corrected, padding)
✅ Improved comments explaining the logic
✅ Proper handling of half-open offset ranges in Arrow

Concerns:

Large refactoring scope: This is a substantial rewrite of the function logic (143 additions, 71 deletions). While the new code appears cleaner, such large changes increase risk:
- Consider breaking this into smaller, reviewable chunks if possible
- The mixing of bug fixes with refactoring makes it harder to verify correctness
Logic changes: The offset calculation logic has changed significantly:
- Old: corrected_pos = if pos > 0 { (pos - 1).as_usize() } else { ... }
- New: Multiple branches with different handling for positive/negative indices
- Recommendation: Ensure extensive testing covers all edge cases
Removed imports: The removal of Field and ArrowNativeType imports suggests code simplification, which is good

Potential Bugs or Issues

Critical

Array length calculation (lines 197-297 in array_insert.rs):
- The new logic has multiple branches for calculating final_len
- Verify that final_len is always correct, especially for:
  - Negative indices beyond array start
  - Positive indices beyond array end
  - Both legacy and non-legacy modes
Null handling consistency:
- The Java changes add null checks, but verify that the Rust side properly handles nulls in the items_array
- The old code had is_item_null checks that set validity - ensure the new code maintains correct validity bits

Medium

Padding calculation (e.g., line 258):
- let padding = corrected - len; - ensure this never underflows
- let padding = target_len.saturating_sub(len + 1); - good use of saturating_sub, but verify logic
Off-by-one errors: The refactoring changes multiple offset calculations. Key areas to verify:
- Line 248: mutable_values.extend(0, start, start + corrected);
- Line 267: let corrected = len - k + base_offset;
- Legacy mode offset differences

Performance Considerations

Positive:

The new code pre-allocates capacity correctly
Uses Vec::with_capacity for better performance
Maintains zero-copy operations where possible

Neutral:

Performance should be similar to the old implementation
No obvious performance regressions expected

Security Concerns

Good:

✅ Proper bounds checking with MAX_ROUNDED_ARRAY_LENGTH validation (lines 283-287)
✅ Validates position != 0 (line 234)
✅ Uses saturating_sub to prevent underflow

No major security issues identified

Test Coverage

Existing Tests

The PR includes existing test coverage for array_insert in the test suite
Tests cover: positive indices, negative indices, null items, out-of-bounds positions

New Tests

Positive:

✅ New test test_array_insert_bug_repro_null_item_pos1_fixed() (lines 454-485) specifically tests the null handling bug
✅ Test validates null item insertion at position 1
✅ Uses proper assertions comparing Arrow array data

Recommendations:

Add more edge case tests:
- Test insertion at array boundaries (position = length + 1)
- Test very large positive/negative indices
- Test legacy vs non-legacy mode differences explicitly
- Test with all-null arrays
- Test with mixed null/non-null items
Add the new "array_reverse 2" test (lines 778-800):
- Good: Tests with randomly generated data
- Good: Tests various array types including binary
- This appears to be testing array_reverse, not array_insert - verify this is intentional

Recommendations

Must Fix

Add more comprehensive unit tests for the refactored array_insert logic covering all branches
Document the legacy mode behavior more clearly in comments

Should Consider

Split the PR: Consider separating the Java null-check fixes from the Rust refactoring
- This makes review easier and reduces risk
- Java changes are straightforward bug fixes
- Rust changes are a significant refactor
Add inline examples in the Rust code comments showing what happens for specific inputs
Verify Spark compatibility: Run the full Spark compatibility test suite to ensure behavior matches exactly

Nice to Have

Add property-based tests (quickcheck/proptest) to verify invariants
Add benchmarks to ensure performance hasn't regressed
Consider adding debug assertions for intermediate calculations

Overall Assessment

Strengths:

Addresses real bugs in null handling
Improves code clarity and documentation
Good test coverage for the specific bug

Concerns:

Large refactoring scope makes verification challenging
Would benefit from more comprehensive tests
Mixing bug fixes with refactoring

Recommendation: The changes appear correct but would benefit from:

Additional edge case testing
Verification against Spark's reference implementation
Consider splitting into smaller PRs if possible

The Java changes look good to merge. The Rust changes need careful verification of the offset calculation logic against all edge cases before merging.

Reviewed by Claude Code

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da3c57b and 6a1eb5e.

📒 Files selected for processing (6)

common/src/main/java/org/apache/comet/vector/CometListVector.java (1 hunks)
common/src/main/java/org/apache/comet/vector/CometMapVector.java (1 hunks)
common/src/main/java/org/apache/comet/vector/CometPlainVector.java (2 hunks)
common/src/main/java/org/apache/comet/vector/CometVector.java (1 hunks)
native/spark-expr/src/array_funcs/array_insert.rs (3 hunks)
spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

🧬 Code graph analysis (1)

spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala (2)

common/src/main/scala/org/apache/comet/CometConf.scala (1)

CometConf (46-767)

spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala (3)

SchemaGenOptions (238-254)

SchemaGenOptions (256-260)

DataGenOptions (262-268)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: claude-review

🔇 Additional comments (2)

common/src/main/java/org/apache/comet/vector/CometPlainVector.java (2)

126-126: LGTM! Proper null handling added.

The null check prevents unnecessary offset computation and memory access for null values. This defensive pattern is consistent with similar changes in other vector classes mentioned in the PR.

151-151: LGTM! Consistent null handling.

The null check prevents unnecessary offset computation and memory copying for null values. This mirrors the pattern added to getUTF8String at line 126 and aligns with the PR's objective.

coderabbitai · 2025-11-11T20:33:44Z

native/spark-expr/src/array_funcs/array_insert.rs

+    for (row_index, window) in offsets.windows(2).enumerate() {
+        let start = window[0].as_usize();
+        let end = window[1].as_usize();
+        let len = end - start;
+        let pos = pos_data.value(row_index);

        if list_array.is_null(row_index) {
-            // In Spark if value of the array is NULL than nothing happens
-            mutable_values.extend_nulls(1);
-            new_offsets.push(new_offsets[row_index] + O::one());
-            new_nulls.push(false);
+            // Top-level list row is NULL: do not write any child values and do not advance offset
+            new_offsets.push(new_offsets[row_index]);
+            list_valid.push(false);
            continue;
        }

        if pos == 0 {
            return Err(DataFusionError::Internal(
-                "Position for array_insert should be greter or less than zero".to_string(),
+                "Position for array_insert should be greater or less than zero".to_string(),
            ));
        }


⚠️ Potential issue | 🔴 Critical

Handle NULL positions before calling Int32Array::value

Int32Array::value(row_index) panics when the slot is NULL. Spark allows array_insert to receive NULL positions (the entire result should be NULL in that row), but this implementation dereferences the value before checking for validity, so any NULL index triggers an abort instead of returning a NULL row. Please guard pos_data in the same way we already guard the list itself.

- let pos = pos_data.value(row_index); - - if list_array.is_null(row_index) { + if list_array.is_null(row_index) { // Top-level list row is NULL: do not write any child values and do not advance offset new_offsets.push(new_offsets[row_index]); list_valid.push(false); continue; } + if pos_data.is_null(row_index) { + new_offsets.push(new_offsets[row_index]); + list_valid.push(false); + continue; + } + + let pos = pos_data.value(row_index); + if pos == 0 { return Err(DataFusionError::Internal( "Position for array_insert should be greater or less than zero".to_string(), )); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (row_index, window) in offsets.windows(2).enumerate() {

let start = window[0].as_usize();

let end = window[1].as_usize();

let len = end - start;

let pos = pos_data.value(row_index);

if list_array.is_null(row_index) {

// In Spark if value of the array is NULL than nothing happens

mutable_values.extend_nulls(1);

new_offsets.push(new_offsets[row_index] + O::one());

new_nulls.push(false);

// Top-level list row is NULL: do not write any child values and do not advance offset

new_offsets.push(new_offsets[row_index]);

list_valid.push(false);

continue;

}

if pos == 0 {

return Err(DataFusionError::Internal(

"Position for array_insert should be greter or less than zero".to_string(),

"Position for array_insert should be greater or less than zero".to_string(),

));

}

for (row_index, window) in offsets.windows(2).enumerate() {

let start = window[0].as_usize();

let end = window[1].as_usize();

let len = end - start;

if list_array.is_null(row_index) {

// Top-level list row is NULL: do not write any child values and do not advance offset

new_offsets.push(new_offsets[row_index]);

list_valid.push(false);

continue;

}

if pos_data.is_null(row_index) {

new_offsets.push(new_offsets[row_index]);

list_valid.push(false);

continue;

}

let pos = pos_data.value(row_index);

if pos == 0 {

return Err(DataFusionError::Internal(

"Position for array_insert should be greater or less than zero".to_string(),

));

}

🤖 Prompt for AI Agents

In native/spark-expr/src/array_funcs/array_insert.rs around lines 221 to 238, the code calls pos_data.value(row_index) before checking whether the position slot is NULL which panics on NULL positions; instead, first test pos_data.is_null(row_index) and if true push the current new_offsets entry, push false to list_valid and continue (so the result row is NULL), only then call pos_data.value(row_index) and proceed with the existing pos==0 check and subsequent logic.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that a check for null is needed before using PrimitiveArray::value(): Note: This method does not check for nulls and the value is arbitrary if [is_null](Self::is_null) returns true for the index. (https://github.com/apache/arrow-rs/blob/a0db1985c3a0f3190cfc5166b428933a28c740f9/arrow-array/src/array/primitive_array.rs#L766-L767)

@martin-augment, thank you for confirming the issue and providing the Arrow documentation reference! You're absolutely right—the docs clearly state that value() doesn't perform null checks and returns arbitrary data for null slots.

The fix I suggested will ensure that when pos_data contains a NULL at any row, the code will properly produce a NULL result row instead of panicking or producing undefined behavior. This aligns with Spark's semantics where a NULL position should yield a NULL result.

✏️ Learnings added

Learnt from: martin-augment Repo: martin-augment/datafusion-comet PR: 19 File: native/spark-expr/src/array_funcs/array_insert.rs:221-238 Timestamp: 2025-11-11T20:44:04.950Z Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: martin-augment Repo: martin-augment/datafusion-comet PR: 7 File: native/spark-expr/src/math_funcs/abs.rs:201-302 Timestamp: 2025-11-04T14:26:48.750Z Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().

cfmcgrady added 9 commits October 24, 2025 10:08

Fix: Return null for null entries in CometPlainVector.getBinary

9f8d823

rebase main

2b5efc1

address comment

b06fb5b

Align with the APIs documentation of ColumnVector's get methods

af75384

fix array_insert

910501a

comment

c12b5b7

add ut

3d1b288

fix rs style

557293b

fix rs style

aa7287b

Merge branch 'main' into pr-2643-2025-11-11-20-25-18

6a1eb5e

augmentcode bot reviewed Nov 11, 2025

View reviewed changes

coderabbitai bot reviewed Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2643: Fix: Fix null handling in CometVector implementations #19

2643: Fix: Fix null handling in CometVector implementations #19

Uh oh!

martin-augment commented Nov 11, 2025

Uh oh!

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

augmentcode bot left a comment

Uh oh!

claude bot commented Nov 11, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 11, 2025 •

edited

Loading

Uh oh!

martin-augment Nov 11, 2025

Uh oh!

coderabbitai bot Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2643: Fix: Fix null handling in CometVector implementations #19

Are you sure you want to change the base?

2643: Fix: Fix null handling in CometVector implementations #19

Uh oh!

Conversation

martin-augment commented Nov 11, 2025

Uh oh!

coderabbitai bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 11, 2025

Code Review for PR #19: Fix null handling in CometVector implementations

Summary

Code Quality & Best Practices

Java Changes (CometVector classes)

Rust Changes (array_insert.rs)

Potential Bugs or Issues

Critical

Medium

Performance Considerations

Security Concerns

Test Coverage

Existing Tests

New Tests

Recommendations

Must Fix

Should Consider

Nice to Have

Overall Assessment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

coderabbitai bot Nov 11, 2025 •

edited

Loading