Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572

Closed
Tracked by #1568
alamb opened this issue Jan 15, 2022 · 6 comments
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Jan 15, 2022

No description provided.

@yjshen
Copy link
Member

yjshen commented Jan 15, 2022

Thanks @alamb for bringing it up.

I propose using heap-sort for N-way merge, but consolidate all the codes we have now in in_mem_sort and SortPreservingMergeStream(SPMS) into SPMS, and remove in_mem_sort.

The main reason to use heap_sort is to minimize the number of comparisons before emitting each record. Since we have "N" partial ordered parts already, we only need to maintain a heap of size "N" at a time. And benefit from the complexity of O(1) to take out of the smallest record, and O(log n) for new item insertion if the last smallest stream or record batch is not exhausted.

For the current sorting methods in SPMS, I think it's to compare all starting records from all streams at once for each time we emit the least. The comparison might not be an issue when N is relatively small. But would deteriorate fast as N grows.

@tustvold, what's your opinion on the merging algorithm to choose?

@yjshen
Copy link
Member

yjshen commented Jan 15, 2022

Also cc @houqp for more minds.

@tustvold
Copy link
Contributor

tustvold commented Jan 15, 2022

I propose using heap-sort for N-way merge
what's your opinion on the merging algorithm to choose?

SGTM 👍. This was in fact mentioned on the original PR that added the operator here and led to the filing of #416.

I would describe the current SortPreservingMergeStream somewhat as an MVP, at the time I did not have a need to support a large number of partitions and so went with simple and stupid. I'm aware of some optimisation efforts since then, but unless I've missed something, nothing major. It would be really cool for this component to get some love, I suspect there is substantial potential for performance improvement both in algorithm and implementation 👍

@alamb
Copy link
Contributor Author

alamb commented Jan 15, 2022

Great! @yjshen added a heap based implementation https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/sorts/in_mem_sort.rs#L60 -- standardizing on that one sounds like a great idea

@yjshen
Copy link
Member

yjshen commented Jan 16, 2022

Thanks @tustvold @alamb. I'll start from here.

@alamb
Copy link
Contributor Author

alamb commented Jan 25, 2022

I think this was fixed in #1596

@alamb alamb closed this as completed Jan 25, 2022
@alamb alamb added the datafusion Changes in the datafusion crate label Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

3 participants