Skip to content

Improve unnest_column performance#6903

Merged
alamb merged 8 commits intoapache:mainfrom
vincev:unnest
Jul 13, 2023
Merged

Improve unnest_column performance#6903
alamb merged 8 commits intoapache:mainfrom
vincev:unnest

Conversation

@vincev
Copy link
Contributor

@vincev vincev commented Jul 10, 2023

Which issue does this PR close?

Closes #5198.

Rationale for this change

This PR implements the suggestion in #5198 to avoid copying unnested values using ScalarValue and use instead the take kernel.

With the changes in this PR unnesting a 5M rows parquet file that unnests to 250M rows takes ~6.5secs while the current version takes around ~10secs.

Current version with ScalarValue copy:

$ time ./target/release/datafusion-unnest
+-----------+
| points    |
+-----------+
| 247563120 |
+-----------+

real    0m10.111s
user    0m40.949s
sys     0m6.411s

this PR version with the take kernel:

$ time ./target/release/datafusion-unnest
+-----------+
| points    |
+-----------+
| 247563120 |
+-----------+

real    0m6.497s
user    0m25.862s
sys     0m4.777s

What changes are included in this PR?

Use the take kernel to replicate values in the unnested columns.

Are these changes tested?

Run tests locally and they all passed.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jul 10, 2023
@vincev vincev changed the title Improve unnest performance Improve unnest_column performance Jul 10, 2023
@alamb
Copy link
Contributor

alamb commented Jul 10, 2023

Thanks @vincev -- I plan to review this PR tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks really nice @vincev -- one concern I have is that it seems to no longer handle StringArray (which is not a PrimitiveArray. However, since all the CI tests pass that either means I don't fully understand how this code works or we have a gap in testing. I will review this a bit more later today to try and figure that out

@vincev
Copy link
Contributor Author

vincev commented Jul 11, 2023

This code looks really nice @vincev -- one concern I have is that it seems to no longer handle StringArray (which is not a PrimitiveArray. However, since all the CI tests pass that either means I don't fully understand how this code works or we have a gap in testing.

Thank you @alamb , just doubled checked the tests here, we cover for StringArray. As far I can see the PrimitiveArray is only used for the list lengths and the indices that are used by the take kernel, I'll improve the comments on how it works after your review.

DataType::Int32 => {
let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?;
let indices = create_take_indices(list_lengths, unnested_array.len());
batch_from_indices(batch, schema, column.index(), &unnested_array, &indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling to understand why the batch is still needed here -- once we have the list array it seems like the code could just unest that directly (and not have to do anything to the other columns).

@tustvold any chance you have some time to help review this code too? If not I'll do it tomorrow when my mind is a little fresher

Copy link
Contributor Author

@vincev vincev Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this final step so that all the other columns have the same number of values as the unnest_array. Let's say we have the following batch:

+----+-----------------+
| id | tags            |
+----+-----------------+
| 1  | [t11]           |
| 2  | [t21, t22]      |
| 3  | [t31, t32, t33] |
+----+-----------------+

and we want to unnest the tags column, the unnest_array will have 6 values:

+-------+
| tags  |
+-------+
| t11   |
| t21   |
| t22   |
| t31   |
| t32   |
| t33   |
+-------+

now we need to adjust the id column so that it has the same number of values as the unnested tags column while maintaining the association with the unnested list values, so that we have:

+----+-------+
| id | tags  |
+----+-------+
| 1  | t11   |
| 2  | t21   |
| 2  | t22   |
| 3  | t31   |
| 3  | t32   |
| 3  | t33   |
+----+-------+

The batch_from_indices function expands all the other unnested columns using the take kernel to replicate their values so that they match the unnested array, in this example the indices array contains:

[0, 1, 1, 2, 2, 2]

so that the row with id 2 is repeated twice and the row with id 3 is repeated 3 times. Hopefully this clarify the behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vincev -- what I was confused about is that if I look at this description:

    // Create an array with the unnested values of the list array, given the list
    // array:
    //
    //   [1], null, [2, 3, 4], null, [5, 6]
    //
    // the result array is:
    //
    //   1, null, 2, 3, 4, null, 5, 6
    //
    let unnested_array = unnest_array(list_array)?;
``

This looks very much the same to me as calling `list_array.vaules()` to get access to the underlying values: https://docs.rs/arrow/latest/arrow/array/struct.GenericListArray.html#method.values

In this case the values array would be more like

[1, 2, 3, 4, 5, 6]


And the offsets of the list array would be would be like (I think):

[0, 1, 1, 3, 3, 6]


With a null mask showing the second and fourth element are null



So I was thinking you could calculate the take indices directly from the offsets / nulls without having to copy all the values out of the underlying array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if you prefer I can try and use this approach, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think it would make unest even faster than it currently is. If you would like to try it great, but I think we can also leave it a follow on work.

I filed #6961 to track it the idea

@alamb
Copy link
Contributor

alamb commented Jul 12, 2023

This PR is very much on my list / radar @vincev -- i am sorry for the delay in approving (I have been busy with #6904)

I will give it the proper review / attention it needs tomorrow. Thank you for understanding

@vincev
Copy link
Contributor Author

vincev commented Jul 12, 2023

This PR is very much on my list / radar @vincev -- i am sorry for the delay in approving (I have been busy with #6904)

I will give it the proper review / attention it needs tomorrow. Thank you for understanding

Thank you @alamb, no worries take your time.

@vincev
Copy link
Contributor Author

vincev commented Jul 13, 2023

Added a test for FixedSizeList and a fix, this was broken on main as well.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vincev -- I reviewed this code and while I think there may be a way to make it even faster (I left a detailed comment) this seems much better than what is on main. Also I found it very well commented and tested.

I also ran the changes in this PR against the IOx tests (https://github.com/influxdata/influxdb_iox/pull/8229) and everything passed 👍

I'll file a ticket upstream to add support for FixedSizeBinary to the length kernel

DataType::Int32 => {
let list_lengths = as_primitive_array::<Int32Type>(&list_lengths)?;
let indices = create_take_indices(list_lengths, unnested_array.len());
batch_from_indices(batch, schema, column.index(), &unnested_array, &indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vincev -- what I was confused about is that if I look at this description:

    // Create an array with the unnested values of the list array, given the list
    // array:
    //
    //   [1], null, [2, 3, 4], null, [5, 6]
    //
    // the result array is:
    //
    //   1, null, 2, 3, 4, null, 5, 6
    //
    let unnested_array = unnest_array(list_array)?;
``

This looks very much the same to me as calling `list_array.vaules()` to get access to the underlying values: https://docs.rs/arrow/latest/arrow/array/struct.GenericListArray.html#method.values

In this case the values array would be more like

[1, 2, 3, 4, 5, 6]


And the offsets of the list array would be would be like (I think):

[0, 1, 1, 3, 3, 6]


With a null mask showing the second and fourth element are null



So I was thinking you could calculate the take indices directly from the offsets / nulls without having to copy all the values out of the underlying array

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincev can you please merge this PR up with the latest main and then we can merge it in?

If possible could you also add a link to apache/arrow-rs#4517 (I left a suggestion) in the comments so we can potentially clean that up later?

Thanks again

@vincev
Copy link
Contributor Author

vincev commented Jul 13, 2023

There is a clippy error after merging main into this branch but it doesn't look related to the changes in this PR.

@alamb
Copy link
Contributor

alamb commented Jul 13, 2023

The clippy error was fixed in #6959 so I think this PR is good to go. Thanks again @vincev

@alamb alamb merged commit 03f7cc9 into apache:main Jul 13, 2023
@vincev
Copy link
Contributor Author

vincev commented Jul 13, 2023

Thank you @alamb for your review.

@vincev vincev deleted the unnest branch July 13, 2023 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve unnest column performance.

2 participants