Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize sort preserving merge #416

Closed
1 of 2 tasks
alamb opened this issue May 24, 2021 · 2 comments
Closed
1 of 2 tasks

Optimize sort preserving merge #416

alamb opened this issue May 24, 2021 · 2 comments
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 24, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The new sort preserving merge operator, introduced in #379 likely has room for performance improvement.

Describe the solution you'd like

  1. Create a benchmark for the merging operator
  2. Optimize / improve benchmark as appropriate

Here is a suggestion from @jhorstmann https://github.com/apache/arrow-datafusion/pull/379/files#r637948151 as a separate ticket so it doesn't get lost:

For bigger number of partitions, storing the cursors in a BinaryHeap, sorted by their current item, would be beneficial.

A rust implementation of that approach can be seen in this blog post and the first comment under it. I have implemented the same approach in java before. I agree with @alamb though to make it work first, and then optimize later.

@alamb alamb added enhancement New feature or request datafusion Changes in the datafusion crate labels May 24, 2021
@jorgecarleitao
Copy link
Member

Also for inspiration: https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/merge_sort/mod.rs

@tustvold
Copy link
Contributor

Closing this ticket as I believe it is not tracking anything anymore.

SortPreservingMerge is now implemented as an n-way tournament tree making use of an order-preserving row encoding for multi-column sorts, and specialized cursors for single column sorts. I'm not aware of any major low-hanging fruit to make it run faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants