Improve unnest_column performance#6903
Conversation
|
Thanks @vincev -- I plan to review this PR tomorrow |
There was a problem hiding this comment.
This code looks really nice @vincev -- one concern I have is that it seems to no longer handle StringArray (which is not a PrimitiveArray. However, since all the CI tests pass that either means I don't fully understand how this code works or we have a gap in testing. I will review this a bit more later today to try and figure that out
Thank you @alamb , just doubled checked the tests here, we cover for |
| DataType::Int32 => { | ||
| let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?; | ||
| let indices = create_take_indices(list_lengths, unnested_array.len()); | ||
| batch_from_indices(batch, schema, column.index(), &unnested_array, &indices) |
There was a problem hiding this comment.
I am struggling to understand why the batch is still needed here -- once we have the list array it seems like the code could just unest that directly (and not have to do anything to the other columns).
@tustvold any chance you have some time to help review this code too? If not I'll do it tomorrow when my mind is a little fresher
There was a problem hiding this comment.
We need this final step so that all the other columns have the same number of values as the unnest_array. Let's say we have the following batch:
+----+-----------------+
| id | tags |
+----+-----------------+
| 1 | [t11] |
| 2 | [t21, t22] |
| 3 | [t31, t32, t33] |
+----+-----------------+
and we want to unnest the tags column, the unnest_array will have 6 values:
+-------+
| tags |
+-------+
| t11 |
| t21 |
| t22 |
| t31 |
| t32 |
| t33 |
+-------+
now we need to adjust the id column so that it has the same number of values as the unnested tags column while maintaining the association with the unnested list values, so that we have:
+----+-------+
| id | tags |
+----+-------+
| 1 | t11 |
| 2 | t21 |
| 2 | t22 |
| 3 | t31 |
| 3 | t32 |
| 3 | t33 |
+----+-------+
The batch_from_indices function expands all the other unnested columns using the take kernel to replicate their values so that they match the unnested array, in this example the indices array contains:
[0, 1, 1, 2, 2, 2]
so that the row with id 2 is repeated twice and the row with id 3 is repeated 3 times. Hopefully this clarify the behavior.
There was a problem hiding this comment.
Thanks @vincev -- what I was confused about is that if I look at this description:
// Create an array with the unnested values of the list array, given the list
// array:
//
// [1], null, [2, 3, 4], null, [5, 6]
//
// the result array is:
//
// 1, null, 2, 3, 4, null, 5, 6
//
let unnested_array = unnest_array(list_array)?;
``
This looks very much the same to me as calling `list_array.vaules()` to get access to the underlying values: https://docs.rs/arrow/latest/arrow/array/struct.GenericListArray.html#method.values
In this case the values array would be more like
[1, 2, 3, 4, 5, 6]
And the offsets of the list array would be would be like (I think):
[0, 1, 1, 3, 3, 6]
With a null mask showing the second and fourth element are null
So I was thinking you could calculate the take indices directly from the offsets / nulls without having to copy all the values out of the underlying array
There was a problem hiding this comment.
I see, if you prefer I can try and use this approach, please let me know.
There was a problem hiding this comment.
Well, I think it would make unest even faster than it currently is. If you would like to try it great, but I think we can also leave it a follow on work.
I filed #6961 to track it the idea
|
Added a test for |
alamb
left a comment
There was a problem hiding this comment.
Thank you @vincev -- I reviewed this code and while I think there may be a way to make it even faster (I left a detailed comment) this seems much better than what is on main. Also I found it very well commented and tested.
I also ran the changes in this PR against the IOx tests (https://github.com/influxdata/influxdb_iox/pull/8229) and everything passed 👍
I'll file a ticket upstream to add support for FixedSizeBinary to the length kernel
| DataType::Int32 => { | ||
| let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?; | ||
| let indices = create_take_indices(list_lengths, unnested_array.len()); | ||
| batch_from_indices(batch, schema, column.index(), &unnested_array, &indices) |
There was a problem hiding this comment.
Thanks @vincev -- what I was confused about is that if I look at this description:
// Create an array with the unnested values of the list array, given the list
// array:
//
// [1], null, [2, 3, 4], null, [5, 6]
//
// the result array is:
//
// 1, null, 2, 3, 4, null, 5, 6
//
let unnested_array = unnest_array(list_array)?;
``
This looks very much the same to me as calling `list_array.vaules()` to get access to the underlying values: https://docs.rs/arrow/latest/arrow/array/struct.GenericListArray.html#method.values
In this case the values array would be more like
[1, 2, 3, 4, 5, 6]
And the offsets of the list array would be would be like (I think):
[0, 1, 1, 3, 3, 6]
With a null mask showing the second and fourth element are null
So I was thinking you could calculate the take indices directly from the offsets / nulls without having to copy all the values out of the underlying array
alamb
left a comment
There was a problem hiding this comment.
@vincev can you please merge this PR up with the latest main and then we can merge it in?
If possible could you also add a link to apache/arrow-rs#4517 (I left a suggestion) in the comments so we can potentially clean that up later?
Thanks again
|
There is a clippy error after merging main into this branch but it doesn't look related to the changes in this PR. |
|
Thank you @alamb for your review. |
Which issue does this PR close?
Closes #5198.
Rationale for this change
This PR implements the suggestion in #5198 to avoid copying unnested values using
ScalarValueand use instead thetakekernel.With the changes in this PR unnesting a 5M rows parquet file that unnests to 250M rows takes ~6.5secs while the current version takes around ~10secs.
Current version with
ScalarValuecopy:this PR version with the
takekernel:What changes are included in this PR?
Use the
takekernel to replicate values in the unnested columns.Are these changes tested?
Run tests locally and they all passed.
Are there any user-facing changes?
No