Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MutableArrayData::extend_ranges #1229

Closed
tustvold opened this issue Jan 23, 2022 · 3 comments
Closed

Add MutableArrayData::extend_ranges #1229

tustvold opened this issue Jan 23, 2022 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Jan 23, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

MutableArrayData is created with one or more ArrayData and can be used to copy across rows from the source arrays to a destination array. It does this by constructing the following for each of the arrays. These can then be used to copy a range of values from the source array's null mask and data respectively.

type ExtendNullBits<'a> = Box<dyn Fn(&mut _MutableArrayData, usize, usize) + 'a>;
type Extend<'a> = Box<dyn Fn(&mut _MutableArrayData, usize, usize, usize) + 'a>;

It then also constructs

type ExtendNulls = Box<dyn Fn(&mut _MutableArrayData, usize)>;

Which can be used to append null values to the in-progress array.

Users don't call these boxed functions directly, but instead call MutableArrayData::extend or MutableArrayData::extend_nulls which in turn call the appropriate functions.

This works really well for kernels such as concat which call MutableArrayData with large ranges, however, it performs poorly in kernels such as take and filter where the contiguous ranges may be very small.

Edit: The take kernel in fact has custom implementations for each array, likely because using MutableArrayData would be painfully slow, perhaps with this we could unify the implementations 🤔

Describe the solution you'd like

Modify the signatures of these functions to a slice of ranges, and add

MutableArrayData::extend_ranges(&mut self, index: usize, ranges: &[Range<usize>])

This will not only amortise the cost of the extend functions, but will also allow implementations to do more performant gather operations where possible

Additional context

The Filter returned by build_filter and used when filtering a record batch with more than one column, already computes a Vec of ranges - and so this would be effectively free.

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jan 23, 2022
@tustvold
Copy link
Contributor Author

Linking apache/datafusion#416 and apache/datafusion#1572 as at least historically MutableArrayData was one of the bottlenecks in SortPreservingMerge. In particular the code used to reconstruct a record batch https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/sorts/sort_preserving_merge.rs#L452

FYI @yjshen @alamb

@tustvold
Copy link
Contributor Author

Having thought about this a bit more, filter would likely be better off with specialized impls as it can then elide range checks, etc... I'm going to take a stab at that and see what I can come up with.

I'll leave this ticket open as it may still aid SortPreservingMerge

@tustvold
Copy link
Contributor Author

Closing this as my understanding is we intend to migrate SortPreservingMerge to the row format introduced by @yjshen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

1 participant