-
Notifications
You must be signed in to change notification settings - Fork 0
2643: Fix: Fix null handling in CometVector implementations #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds null checks in Java vector classes to return null for null entries. Rewrites the array_insert Rust function to align with Spark semantics using proper offset ranges, null padding, and final length validation. Adds a test case for array_reverse on binary arrays. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review completed. No suggestions at this time.
Comment augment review to trigger a new review at any time.
Code Review for PR #19: Fix null handling in CometVector implementationsSummaryThis PR addresses null handling issues in CometVector implementations and refactors the Code Quality & Best PracticesJava Changes (CometVector classes)Positive aspects:
Observations:
Rust Changes (array_insert.rs)Positive aspects:
Concerns:
Potential Bugs or IssuesCritical
Medium
Performance ConsiderationsPositive:
Neutral:
Security ConcernsGood:
No major security issues identified Test CoverageExisting Tests
New TestsPositive:
Recommendations:
RecommendationsMust Fix
Should Consider
Nice to Have
Overall AssessmentStrengths:
Concerns:
Recommendation: The changes appear correct but would benefit from:
The Java changes look good to merge. The Rust changes need careful verification of the offset calculation logic against all edge cases before merging. Reviewed by Claude Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
common/src/main/java/org/apache/comet/vector/CometListVector.java(1 hunks)common/src/main/java/org/apache/comet/vector/CometMapVector.java(1 hunks)common/src/main/java/org/apache/comet/vector/CometPlainVector.java(2 hunks)common/src/main/java/org/apache/comet/vector/CometVector.java(1 hunks)native/spark-expr/src/array_funcs/array_insert.rs(3 hunks)spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().
🧬 Code graph analysis (1)
spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala (2)
common/src/main/scala/org/apache/comet/CometConf.scala (1)
CometConf(46-767)spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala (3)
SchemaGenOptions(238-254)SchemaGenOptions(256-260)DataGenOptions(262-268)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: claude-review
🔇 Additional comments (2)
common/src/main/java/org/apache/comet/vector/CometPlainVector.java (2)
126-126: LGTM! Proper null handling added.The null check prevents unnecessary offset computation and memory access for null values. This defensive pattern is consistent with similar changes in other vector classes mentioned in the PR.
151-151: LGTM! Consistent null handling.The null check prevents unnecessary offset computation and memory copying for null values. This mirrors the pattern added to
getUTF8Stringat line 126 and aligns with the PR's objective.
| for (row_index, window) in offsets.windows(2).enumerate() { | ||
| let start = window[0].as_usize(); | ||
| let end = window[1].as_usize(); | ||
| let len = end - start; | ||
| let pos = pos_data.value(row_index); | ||
|
|
||
| if list_array.is_null(row_index) { | ||
| // In Spark if value of the array is NULL than nothing happens | ||
| mutable_values.extend_nulls(1); | ||
| new_offsets.push(new_offsets[row_index] + O::one()); | ||
| new_nulls.push(false); | ||
| // Top-level list row is NULL: do not write any child values and do not advance offset | ||
| new_offsets.push(new_offsets[row_index]); | ||
| list_valid.push(false); | ||
| continue; | ||
| } | ||
|
|
||
| if pos == 0 { | ||
| return Err(DataFusionError::Internal( | ||
| "Position for array_insert should be greter or less than zero".to_string(), | ||
| "Position for array_insert should be greater or less than zero".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle NULL positions before calling Int32Array::value
Int32Array::value(row_index) panics when the slot is NULL. Spark allows array_insert to receive NULL positions (the entire result should be NULL in that row), but this implementation dereferences the value before checking for validity, so any NULL index triggers an abort instead of returning a NULL row. Please guard pos_data in the same way we already guard the list itself.
- let pos = pos_data.value(row_index);
-
- if list_array.is_null(row_index) {
+ if list_array.is_null(row_index) {
// Top-level list row is NULL: do not write any child values and do not advance offset
new_offsets.push(new_offsets[row_index]);
list_valid.push(false);
continue;
}
+ if pos_data.is_null(row_index) {
+ new_offsets.push(new_offsets[row_index]);
+ list_valid.push(false);
+ continue;
+ }
+
+ let pos = pos_data.value(row_index);
+
if pos == 0 {
return Err(DataFusionError::Internal(
"Position for array_insert should be greater or less than zero".to_string(),
));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (row_index, window) in offsets.windows(2).enumerate() { | |
| let start = window[0].as_usize(); | |
| let end = window[1].as_usize(); | |
| let len = end - start; | |
| let pos = pos_data.value(row_index); | |
| if list_array.is_null(row_index) { | |
| // In Spark if value of the array is NULL than nothing happens | |
| mutable_values.extend_nulls(1); | |
| new_offsets.push(new_offsets[row_index] + O::one()); | |
| new_nulls.push(false); | |
| // Top-level list row is NULL: do not write any child values and do not advance offset | |
| new_offsets.push(new_offsets[row_index]); | |
| list_valid.push(false); | |
| continue; | |
| } | |
| if pos == 0 { | |
| return Err(DataFusionError::Internal( | |
| "Position for array_insert should be greter or less than zero".to_string(), | |
| "Position for array_insert should be greater or less than zero".to_string(), | |
| )); | |
| } | |
| for (row_index, window) in offsets.windows(2).enumerate() { | |
| let start = window[0].as_usize(); | |
| let end = window[1].as_usize(); | |
| let len = end - start; | |
| if list_array.is_null(row_index) { | |
| // Top-level list row is NULL: do not write any child values and do not advance offset | |
| new_offsets.push(new_offsets[row_index]); | |
| list_valid.push(false); | |
| continue; | |
| } | |
| if pos_data.is_null(row_index) { | |
| new_offsets.push(new_offsets[row_index]); | |
| list_valid.push(false); | |
| continue; | |
| } | |
| let pos = pos_data.value(row_index); | |
| if pos == 0 { | |
| return Err(DataFusionError::Internal( | |
| "Position for array_insert should be greater or less than zero".to_string(), | |
| )); | |
| } |
🤖 Prompt for AI Agents
In native/spark-expr/src/array_funcs/array_insert.rs around lines 221 to 238,
the code calls pos_data.value(row_index) before checking whether the position
slot is NULL which panics on NULL positions; instead, first test
pos_data.is_null(row_index) and if true push the current new_offsets entry, push
false to list_valid and continue (so the result row is NULL), only then call
pos_data.value(row_index) and proceed with the existing pos==0 check and
subsequent logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that a check for null is needed before using PrimitiveArray::value(): Note: This method does not check for nulls and the value is arbitrary if [is_null](Self::is_null) returns true for the index. (https://github.com/apache/arrow-rs/blob/a0db1985c3a0f3190cfc5166b428933a28c740f9/arrow-array/src/array/primitive_array.rs#L766-L767)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martin-augment, thank you for confirming the issue and providing the Arrow documentation reference! You're absolutely right—the docs clearly state that value() doesn't perform null checks and returns arbitrary data for null slots.
The fix I suggested will ensure that when pos_data contains a NULL at any row, the code will properly produce a NULL result row instead of panicking or producing undefined behavior. This aligns with Spark's semantics where a NULL position should yield a NULL result.
✏️ Learnings added
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 19
File: native/spark-expr/src/array_funcs/array_insert.rs:221-238
Timestamp: 2025-11-11T20:44:04.950Z
Learning: In Rust code using Apache Arrow arrays, always check `is_null(index)` before calling `value(index)` on `PrimitiveArray` types (such as `Int32Array`, `Int64Array`, etc.), because `value()` does not check for nulls and returns arbitrary values for null slots. This applies to functions in `native/spark-expr/src` that process Arrow arrays.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 7
File: native/spark-expr/src/math_funcs/abs.rs:201-302
Timestamp: 2025-11-04T14:26:48.750Z
Learning: In the abs function in native/spark-expr/src/math_funcs/abs.rs (Rust), NULL values for signed integers (Int8, Int16, Int32, Int64) and decimals (Decimal128, Decimal256) should return the argument as-is (e.g., ColumnarValue::Scalar(ScalarValue::Int8(None))) rather than panicking on unwrap().
2643: To review by AI