-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Support nulls
in array_replace
, avoid a copy
#8054
Conversation
let from_array = &args[1]; | ||
let to_array = &args[2]; | ||
|
||
/// For each element of `list_array[i]`, replaces up to `arr_n[i]` occurences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, having a clear description of what the function does, especially when it is mind bending likegeneral_replace
helps a lot. I also find that writing such a description often helps improve the code as well as results in better naming
For example, I struggled to explain how this function worked when it took &[ArrayRef]
because then the explanations were in terms fof arg[0]
, arg[1]
, and arg[2]
. Giving the parameters names made things much clearer
|
||
for (row_index, (arr, n)) in list_array.iter().zip(arr_n.iter()).enumerate() { | ||
// n is the number of elements to replace in this row | ||
for (row_index, (list_array_row, n)) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rename these variables to better explain what they are (rather than arr
a shorthand for array
which is already overloaded in this function
bool_values.push(Some(a.eq(&from_a))); | ||
} else { | ||
return internal_err!( | ||
"Null value is not supported in array_replace" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why nulls aren't supported -- all of the kernels support them well, and the code is more concise without the special case
} | ||
}; | ||
|
||
// Use MutableArrayData to build the replaced array | ||
let original_data = list_array_row.to_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using a new array with two indexes, I found naming them helped a lot
new_empty_array(&data_type) | ||
} else { | ||
let new_values: Vec<_> = new_values.iter().map(|a| a.as_ref()).collect(); | ||
arrow::compute::concat(&new_values)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old logic called concat
for each row, which I think is O(N^2)
as it keeps copying the same values over and over again. I changed it to call concat
just once
I think it is probably possible to avoid calling concat
entirely (use MutableArrayData
to build up values directly). We can do that as a follow on PR perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to replace MutableArrayData with ListArray, as it is more straightforward. However, perhaps we can also construct ListArray using MutableArrayData and still maintain readability. Particularly in this case, doing so can enhance performance. We should certainly consider building with MutableArrayData!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what I was getting at is that the contents of values
are each already created using MutableArrayData
-- and so I think rather than making many small arrays and then concat
ing them together, the could could just use the same MutableArrayData
to build the results directly
@@ -2763,6 +2790,52 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_array_replace_with_null() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the null cases were not covered, I added new coverage. Maybe this style of test could help as we clean up the other aray functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having test in slt file if possible, we can cleanup test in the follow on PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slt is a good idea -- I will move the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c6dabcf
nulls
in array_replace
, avoid a copy
#[test] | ||
fn test_array_replace_with_null() { | ||
// ([3, 1, NULL, 3], 3, 4, 2) => [4, 1, NULL, 4] NULL not matched | ||
// ([3, 1, NULL, 3], NULL, 5, 2) => [3, 1, NULL, 3] NULL not replaced (not eq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why null is not replaced with 5 => [3, 1, 5, 3]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that arrow_ord::cmp::eq([3,1,null,3], null) return [null,null,null,null] which I expect to see [false,false,true,false]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's reasonable for arrow-rs to handle null comparisons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because in SQL, NULL = NULL
is NOT true
like you might expect, but is actually NULL
If we want the NULL = NULL
= true behavior, we could use the not_distinct
kernel instead: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.not_distinct.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because in SQL,
NULL = NULL
is NOTtrue
like you might expect, but is actuallyNULL
If we want the
NULL = NULL
= true behavior, we could use thenot_distinct
kernel instead: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.not_distinct.html
I would prefer having not_distinct
here, it can let us replace null value with another, if we use eq
here, replace with null value is meaningless, also there is no other DB have the similar function #7072, so I think we can have our own definition of what array_replace
do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good. I will update this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in acbffa0
@@ -2763,6 +2790,52 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_array_replace_with_null() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having test in slt file if possible, we can cleanup test in the follow on PR.
nulls
in array_replace
, avoid a copynulls
in array_replace
, avoid a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@jackwener or @liukun4515 can I trouble you for an approval of this PR so I can merge it? @jayzhan211 has already reviewed it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! Thanks @alamb
Thank you for the review @jackwener and @jayzhan211 -- very much appreciated. |
Co-authored-by: jakevin <jakevingoo@gmail.com>
Which issue does this PR close?
Follow on to #8050 from @jayzhan211 --
Rationale for this change
The change in #8050 is a huge improvement, and while reviewing it I saw a few more things to improve, specifically have it handle nulls handing and avoiding some copies.
I also wanted to try and help illustrate some patterns for other array function rewrites that I think @jayzhan211 is planning
What changes are included in this PR?
general_array
and change names of parameters and variables so it is easier (in my opinion) to follow the logicAre these changes tested?
Are there any user-facing changes?
array_replace
now properly handlesNULL
values