Sort preserving merge (#362) #379

tustvold · 2021-05-21T14:38:28Z

Closes #362.

Creating as draft as currently builds on top of #378 as it uses a partitioned SortExec as part of its tests.

This PR adds a SortPreservingMergeExec operator that allows merging together multiple sorted partitions into a single partition.

The main implementation is contained within SortPreservingMergeStream and SortKeyCursor:

SortKeyCursor provides the ability to compare the sort keys of the next row that could be yielded for each stream, in order to determine which one to yield.

SortPreservingMergeStream maintains a list of SortKeyCursor for each stream and builds up a list of sorted indices identifying rows within these cursors. When it reads the last row of a RecordBatch, it fetches another from the input. Once it has accumulated target_batch_size` row indexes (or exhausted all input streams) it will combine the relevant rows from the buffered RecordBatches into a single RecordBatch, drop any cursors it no longer needs, and yield the batch.

tustvold · 2021-05-21T14:38:50Z

datafusion/src/physical_plan/sort.rs

@@ -99,11 +99,11 @@ impl ExecutionPlan for SortExec {

    /// Get the output partitioning of this plan
    fn output_partitioning(&self) -> Partitioning {
-        Partitioning::UnknownPartitioning(1)
+        self.input.output_partitioning()


This is the change from #377

alamb

Thank you @tustvold -- I think this is the last missing physical operator we need in DataFusion to start enabling sort based optimizations (e.g. sort-merge-join, etc)

I think this is pretty amazing work -- I am sure there will be more work to optimize this, but I like the overall structure and I think it is looking very cool.

I think we should let at least one other pair of eyes read it carefully so I will hold off on clicking approve until that happens. But from what I can see at this point, this PR is basically ready to go

alamb · 2021-05-21T18:01:24Z

datafusion/src/physical_plan/common.rs

@@ -113,3 +118,29 @@ fn build_file_list_recurse(
    }
    Ok(())
 }
+
+/// Spawns a task to the tokio threadpool and writes its outputs to the provided mpsc sender
+pub(crate) fn spawn_execution(


this is a nice abstraction (and we can probably use it elsewhere)

alamb · 2021-05-21T18:04:00Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+        Partitioning::UnknownPartitioning(1)
+    }
+
+    fn required_child_distribution(&self) -> Distribution {


eventually (not as part of this PR) we should add something like required_child_sort_order so the operators can report on what sortedness they are assuming.

alamb · 2021-05-21T18:11:16Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+                (true, false) => return Ok(Ordering::Less),
+                (false, false) => {}
+                (true, true) => {
+                    // TODO: Building the predicate each time is sub-optimal


I predicate this line will be the bottleneck of this operator.

However, I feel like getting it in and working and then optimizing as a follow on is the correct course of action in this case.

datafusion/src/physical_plan/sort_preserving_merge.rs

alamb · 2021-05-21T18:22:12Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+                "+---+---+-------------------------------+",
+                "| 1 |   | 1970-01-01 00:00:00.000000008 |",
+                "| 1 |   | 1970-01-01 00:00:00.000000008 |",
+                "| 2 | a |                               |",


In order to cover the nulls_first: false case for "c" I think you need several rows here with a tie for a and b, and both a null and non value for c. I didn't see any such cases (though I may have missed it)

Perhaps adding a row like the following would be enough

"| 7 | b | NULL |",

The sort key is just b and c so the lines

"| 7 | b | 1970-01-01 00:00:00.000000006 |", "| 2 | b | |",

test this?

alamb · 2021-05-21T18:23:26Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+        assert_eq!(basic, partition);
+    }
+
+    // Split the provided record batch into multiple batch_size record batches


This might be a function that we could add to RecordBatch itself? I can file a ticket to do so if you would like

apache/arrow-rs#343

alamb · 2021-05-21T18:27:25Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+    }
+
+    #[tokio::test]
+    async fn test_partition_sort_streaming_input_output() {


I think this test covers the case where each input stream has more than one RecordBatch, right (each input partition has three record batches).

Is there any value to another test that has input streams with differing numbers of input batches (I am thinking of an input with 3 partitions: 0 record batches, 1 record batch, and "many" (aka 2 or 3))?

codecov-commenter · 2021-05-24T13:22:00Z

Codecov Report

Merging #379 (7d3dbc5) into master (3593d1f) will increase coverage by 0.54%.
The diff coverage is 81.62%.

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
+ Coverage   74.85%   75.39%   +0.54%     
==========================================
  Files         146      148       +2     
  Lines       24565    25242     +677     
==========================================
+ Hits        18387    19031     +644     
- Misses       6178     6211      +33

Impacted Files	Coverage Δ
datafusion/src/physical_plan/mod.rs	`78.70% <ø> (-4.06%)`	⬇️
datafusion/src/physical_plan/common.rs	`84.21% <77.77%> (-2.00%)`	⬇️
...afusion/src/physical_plan/sort_preserving_merge.rs	`81.66% <81.66%> (ø)`
datafusion/src/physical_plan/merge.rs	`75.00% <100.00%> (+0.71%)`	⬆️
datafusion/src/physical_plan/window_functions.rs	`85.71% <0.00%> (-3.01%)`	⬇️
datafusion/src/scalar.rs	`56.19% <0.00%> (-2.13%)`	⬇️
ballista/rust/client/src/context.rs	`0.00% <0.00%> (ø)`
datafusion/src/physical_plan/expressions/mod.rs	`71.42% <0.00%> (ø)`
...fusion/src/physical_plan/expressions/row_number.rs	`81.25% <0.00%> (ø)`
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3593d1f...7d3dbc5. Read the comment docs.

jhorstmann · 2021-05-24T13:33:53Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+    /// if all cursors for all streams are exhausted
+    fn next_stream_idx(&self) -> Result<Option<usize>> {
+        let mut min_cursor: Option<(usize, &SortKeyCursor)> = None;
+        for (idx, candidate) in self.cursors.iter().enumerate() {


For bigger number of partitions, storing the cursors in a BinaryHeap, sorted by their current item, would be beneficial.

A rust implementation of that approach can be seen in this blog post and the first comment under it. I have implemented the same approach in java before. I agree with @alamb though to make it work first, and then optimize later.

great suggestion @jhorstmann -- thank you -- I filed #416 so it is more visible

tustvold · 2021-05-26T10:19:07Z

Will rebase to remove merges

alamb · 2021-05-26T20:25:44Z

This PR appears to need some rebasing / test fixing love:

https://github.com/apache/arrow-datafusion/pull/379/checks?check_run_id=2674096854



---- physical_plan::sort_preserving_merge::tests::test_partition_sort stdout ----
thread 'physical_plan::sort_preserving_merge::tests::test_partition_sort' panicked at 'called `Result::unwrap()` on an `Err` value: Internal("SortExec requires a single input partition")', datafusion/src/physical_plan/sort_preserving_merge.rs:627:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:

tustvold · 2021-05-27T10:31:04Z

Apologies - I stripped out the merge that fixed the logical conflict 🤦

Pushed a commit that fixes it 😄

alamb

I think this PR is ready -- thanks again @tustvold

What do you think @Dandandan / @andygrove ? Any objections to merging this (as a step towards a more sorted future in DataFusion)?

alamb · 2021-05-24T16:27:33Z

datafusion/src/physical_plan/sort_preserving_merge.rs

+    /// if all cursors for all streams are exhausted
+    fn next_stream_idx(&self) -> Result<Option<usize>> {
+        let mut min_cursor: Option<(usize, &SortKeyCursor)> = None;
+        for (idx, candidate) in self.cursors.iter().enumerate() {


great suggestion @jhorstmann -- thank you -- I filed #416 so it is more visible

alamb · 2021-06-01T14:43:21Z

I just fixed a merge conflict -- if the tests pass I plan to merge this PR in

tustvold commented May 21, 2021

View reviewed changes

tustvold force-pushed the sort-preserving-merge branch from ea493f8 to f090a0a Compare May 21, 2021 14:40

Dandandan mentioned this pull request May 21, 2021

Add support for multiple partitions with SortExec (#362) #378

Merged

alamb reviewed May 21, 2021

View reviewed changes

jhorstmann reviewed May 24, 2021

View reviewed changes

alamb mentioned this pull request May 24, 2021

Optimize sort preserving merge #416

Closed

2 tasks

tustvold marked this pull request as ready for review May 26, 2021 10:10

tustvold added 4 commits May 26, 2021 11:18

Add SortPreservingMergeExec (apache#362)

03232a5

Size MutableArrayData based on in_progress length

7a787c0

make SortPreservingMergeStream::build_record_batch fallible

270ca62

Test SortPreservingMerge with different RecordBatch sizes

29c767b

tustvold force-pushed the sort-preserving-merge branch from 9999e4d to 29c767b Compare May 26, 2021 10:20

fix logical merge conflict

7d3dbc5

alamb approved these changes May 27, 2021

View reviewed changes

Merge branch 'master' into sort-preserving-merge

d53dfa8

alamb merged commit c794f2d into apache:master Jun 1, 2021

houqp added datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021

tustvold mentioned this pull request Jan 15, 2022

Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort preserving merge (#362) #379

Sort preserving merge (#362) #379

tustvold commented May 21, 2021 •

edited

Loading

tustvold May 21, 2021

alamb left a comment

alamb May 21, 2021

alamb May 21, 2021

alamb May 21, 2021

alamb May 21, 2021

tustvold May 24, 2021

alamb May 21, 2021

alamb May 24, 2021

alamb May 21, 2021

codecov-commenter commented May 24, 2021 •

edited

Loading

jhorstmann May 24, 2021

alamb May 24, 2021

tustvold commented May 26, 2021

alamb commented May 26, 2021

tustvold commented May 27, 2021

alamb left a comment

alamb May 24, 2021

alamb commented Jun 1, 2021

Sort preserving merge (#362) #379

Sort preserving merge (#362) #379

Conversation

tustvold commented May 21, 2021 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 24, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented May 26, 2021

alamb commented May 26, 2021

tustvold commented May 27, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 1, 2021

tustvold commented May 21, 2021 •

edited

Loading

codecov-commenter commented May 24, 2021 •

edited

Loading