-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix: Map functions crash on out of bounds cases #16203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Detect if array elements contain structs with null-typed fields and emit a proper null array instead of trying to extract malformed data. This prevents Arrow's NullArray buffer validation from panicking.
1. Reproduce the state in the issue and test the fix has worked. 2. Case where the array has mixed null fields. 3. Case with out-of-bounds array access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR resolves a crash occurring during array indexing when map functions encounter structs with Null-typed fields. Key changes include adding detection of structs containing Null fields in extract.rs and returning a properly typed null array, along with comprehensive tests for the crash scenario, mixed null/valid fields, and out-of-bounds access.
let mut null_array_data = Vec::with_capacity(array.len()); | ||
for _ in 0..array.len() { | ||
null_array_data.push(None::<i32>); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'null_array_data' is initialized and populated but never used, which may lead to confusion. Consider removing this unused code to simplify the function.
let mut null_array_data = Vec::with_capacity(array.len()); | |
for _ in 0..array.len() { | |
null_array_data.push(None::<i32>); | |
} |
Copilot uses AI. Check for mistakes.
@@ -225,6 +225,23 @@ where | |||
return Ok(Arc::new(NullArray::new(array.len()))); | |||
} | |||
|
|||
if let DataType::Struct(fields) = values.data_type() { | |||
let has_null_fields = fields.iter().any(|field| field.data_type() == &Null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me to understand why we check by structure field types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, now I realize the problem happens only if the struct has with with null
as the value
let has_null_fields = fields.iter().any(|field| field.data_type() == &Null); | ||
if has_null_fields { | ||
// Instead of trying to extract from malformed struct data and return appropriate nulls | ||
let mut null_array_data = Vec::with_capacity(array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to return I think we can return NullArray like here?
if values.data_type().is_null() {
return Ok(Arc::new(NullArray::new(array.len())));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @krishvishal
Please add SQL tests as well to array.slt
file
what comes to my mind is
select [named_struct('a', 1, 'b', null)][0];
select [named_struct('a', 1, 'b', null)][1];
select [named_struct('a', 1, 'b', null)][-1];
select [named_struct('a', 1, 'b', null)][2]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[0]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[1]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[-1]
select map_values(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[2]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[0]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[1]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[-1]
select map_keys(map([named_struct('a', 1, 'b', null)], [named_struct('a', 2, 'b', null)]))[2]
;
@krishvishal are you still planning to wrap this PR up? |
returning all nulls Improves upon the previous fix that detected null-typed fields in structs but incorrectly returned all null results. Now properly extracts valid struct elements while handling null-typed fields gracefully. Changes: - Replace blanket null return with handle_struct_with_null_fields() - Process each struct field individually using MutableArrayData for valid fields - Create proper NullArrays only for null-typed fields - Maintain correct null buffer for the overall result This allows queries like [named_struct('a', 1, 'b', null)][1] to correctly return {a: 1, b: NULL} instead of NULL. Fixes: apache#16187
acess of struct arrays with null fields
@comphead I've changed the implementation a bit to handle nulls properly. Previous fix just outputs I've also added tests to I think its ready. Apologies for the delay. |
Thanks @krishvishal the latest version becomes much more complicated compared to prev one. This can be a subject to check the performance. What is the reason for adding the specific handler, some tests not passing? |
@comphead I've had to add the handler because the previous fix caused wrong behavior. For example the following query currently returns: > select [named_struct('a', 1, 'b', null)][1];
+-----------------------------------------------------------------------+
| make_array(named_struct(Utf8("a"),Int64(1),Utf8("b"),NULL))[Int64(1)] |
+-----------------------------------------------------------------------+
| {a: 1, b: NULL} |
+-----------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds. This is correct. But my previous fix (commit: d9a699f) returned |
for (row_index, offset_window) in array.offsets().windows(2).enumerate() { | ||
let start = offset_window[0]; | ||
let end = offset_window[1]; | ||
let len = end - start; | ||
|
||
// array is null | ||
if len == O::usize_as(0) { | ||
if array.is_null(row_index) || len == O::usize_as(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems these 2 if statements can be merged into 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding
is_null()
checks if an element at an index is NULL
.
2nd condition checks if the element is an empty list.
They are checking different things right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I had a feeling we can merge them like
if array.is_null(row_index) || len == O::usize_as(0) || indexes.is_null(row_index) {
mutable.extend_nulls(1);
continue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I understand what you mean. Yes they can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
i64: TryInto<O>, | ||
{ | ||
let index: O = index.try_into().map_err(|_| { | ||
DataFusionError::Execution(format!("array_element got invalid index: {index}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusionError::Execution(format!("array_element got invalid index: {index}")) | |
exec_err!("array_element got invalid index: {index}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- merge conditionals - use `exec_err!` - Fix warning raised by clippy
@comphead I've pushed additional fixes you've asked. I think the PR is ready. Let me know if it is okay. |
Thanks @krishvishal let me quickly double check if we can do it with less data massaging. Since this will happen on execution layer it would be called for every batch of data possibly hitting the performance |
@comphead, can you please tell if there is something I can do to move this forward? Are there any relevant benches I could either run or adapt for this case? |
Sorry I didn't have a chance to look into that, I'm planning to do it this week |
Which issue does this PR close?
null
value #16187Rationale for this change
This PR fixes a panic crash in array indexing operations when accessing elements from
map_values()
results that contain structs withNull
-typed fields.What changes are included in this PR?
datafusion/functions-nested/src/extract.rs
: Added null field detection and safe handlingAre these changes tested?
Yes - unit tests are added covering:
Null
-typed fieldsAre there any user-facing changes?
No, only behavior change.