Pipeline-friendly Bounded Memory Window Executor #4777

mustafasrepo · 2022-12-30T15:42:07Z

Which issue does this PR close?

Improves the situation on #4285.

Rationale for this change

NOTE: Below discussion is a simplification of a more detailed exposition in the streaming execution proposal.

Unlike how the current implementation works, queries involving window expressions can actually be executed without materializing the entire table in memory when certain conditions are met. These conditions for a bounded implementation are as follows:

In order to run WindowExec with bounded memory (without seeing the whole table), window frame boundaries of the given window expression should be bounded; i.e. we cannot run queries involving either UNBOUNDED PRECEDING or UNBOUNDED FOLLOWING.
We should be able to produce query results as we scan the table incrementally. For this to be possible, columns used in the ORDER BY clauses should already be aligned with the ORDER BY specification. With this condition is met, we can remove the PhysicalSort expression before WindowExec and generate results as we scan the table.

If the above conditions are met, we can run a query like the one below

SELECT
    SUM(inc_col) OVER(ORDER BY inc_col ASC RANGE BETWEEN 1 PRECEDING AND 10 FOLLOWING)
FROM annotated_data

with a bounded memory algorithm. Consider the physical plan of the above query:

+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                        |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+                                                                                                                                       |
| physical_plan | ProjectionExec: expr=[SUM(annotated_data.inc_col) ORDER BY [annotated_data.inc_col ASC NULLS LAST] RANGE BETWEEN 1 PRECEDING AND 10 FOLLOWING@0 as SUM(annotated_data.inc_col)]             |
|               |   WindowAggExec: wdw=[SUM(annotated_data.inc_col): Ok(Field { name: "SUM(annotated_data.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })] |
|               |     SortExec: [inc_col@0 ASC NULLS LAST]                                                                                                                                                    |
|               |       MemoryExec: partitions=1, partition_sizes=[51]                                                                                                                                        |
|               |                                                                                                                                                                                             |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

If we know that the column inc_col in the table is monotonically increasing, we can deduce that the SortExec: [inc_col@0 ASC NULLS LAST] step in the physical plan is unnecessary. Hence, we can remove this step from the physical plan (see #4691). Furthermore, we also know that the frame RANGE BETWEEN 1 PRECEDING AND 10 FOLLOWING describes a bounded range. Therefore, we can turn the above physical plan into the one below:

+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+                                                                                                                                             |
| physical_plan | ProjectionExec: expr=[SUM(annotated_data.inc_col) ORDER BY [annotated_data.inc_col ASC NULLS LAST] RANGE BETWEEN 1 PRECEDING AND 10 FOLLOWING@0 as SUM(annotated_data.inc_col)]                   |
|               |   BoundedWindowAggExec: wdw=[SUM(annotated_data.inc_col): Ok(Field { name: "SUM(annotated_data.inc_col)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })] |
|               |     MemoryExec: partitions=1, partition_sizes=[51]                                                                                                                                                |
|               |                                                                                                                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Performance Indicators

We analyzed the new Bounded Memory Window Executor and compared it with the existing implementation via

Benchmarking (criterion) for CPU-time analysis, and
Memory profiling the binary executable (Heaptrack).

Test conditions for both executors is as follows:

No partition
Single query (SUM (x) OVER (RANGE ORDER BY a RANGE BETWEEN 10 PRECEDING AND 10 FOLLOWING))
No sorting is included since the input data is already sorted.
The input is generated as a RecordBatch stream. They have varying sizes in the range of 0 to 50.

NOTE: We did not include the benchmarking code in this PR.

Benchmarking

We measure the execution duration of each operator. The input size is 100_000.

Average execution time for WindowAggExec: 226.72 ms
Average execution time for BoundedWindowAggExec: 154.71 ms

which shows that overall performance improves. This is due to searching RANGE boundaries in a smaller batch since we maintain a bounded state.

Heaptrack

We used a simple test case for memory consumption; the input size is 1_000_000.

WindowAggExec Memory Profiling

peak heap memory consumption: 161,9MB after 1min23s
peak RSS (including headtrack overhead): 202,7MB

BoundedWindowAggExec Profiling

peak heap memory consumption: 78,6MB after 08.633s
peak RSS (including heaptrack overhead):115,4MB

The finding supports that the sliding window approach is memory efficient.

What changes are included in this PR?

This PR includes a bounded memory variant of the already-existing WindowAggExec. We add a rule to choose between WindowAggExec and BoundedWindowAggExec. The strategy is as follows: if window_expr can generate its result without seeing whole table we choose BoundedWindowAggExec otherwise we choose WindowAggExec. Please note that it is possible (but not certainly trivial) to unify these executors. However, we left this as future work.

In this implementation, we also added bounded execution support for COUNT, SUM, MIN, MAX among AggregateFunctions. Among BuiltInWindowFunctions, we added support for ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTH_VALUE. If the window function used is different than these, we fall back to WindowAggExec.

Are these changes tested?

We have added fuzzy tests comparing the results of WindowAggExec and BoundedWindowAggExec. We also added sql tests that runs on already sorted parquet file. Approximately 700 lines of the changes come from test and test utils.

Are there any user-facing changes?

None.

* partition by refactor * minor changes * Unnecessary tuple to Range conversion is removed * move transpose under common

…exec

…a-ai/arrow-datafusion into feature/sort_removal_rule # Conflicts: # datafusion/physical-expr/src/aggregate/count.rs # datafusion/physical-expr/src/aggregate/mod.rs # datafusion/physical-expr/src/aggregate/sum.rs # datafusion/physical-expr/src/window/aggregate.rs

…exec

.github/workflows/rust.yml

ozankabak · 2022-12-30T16:03:04Z

I am very happy that we are finally sending this upstream. We have been working on making windows pipeline-friendly for a while now, and this provides a great foundation for that. There are a few things we will improve with follow-on PRs (e.g. extending this to support GROUPS mode), but the core functionality is already there.

Looking forward to receiving feedback!

alamb · 2022-12-31T12:41:46Z

Thank you for this PR -- I will try and review it later this weekend but I may not have time until early next week

alamb · 2023-01-03T14:24:39Z

Starting to check this out

alamb

Thank you @mustafasrepo

I went through this PR as carefully as I could given its size. Sorry for the delay in review -- finding the contiguous uninterrupted time to review has been hard. I had some suggestions but overall I think it could also be merged as is. As in your past PRs I found it well commented, well tested, and overall a pleasure to read.

I did not review all the logic in the window functions but I did review the tests carefully as well as skimmed all the code.

which shows that overall performance improves. This is due to searching RANGE boundaries in a smaller batch since we maintain a bounded state.

Very impressive benchmark results

.github/workflows/rust.yml

datafusion/core/Cargo.toml

datafusion/core/src/execution/context.rs

datafusion/core/src/physical_optimizer/pipeline_checker.rs

datafusion/core/src/physical_plan/common.rs

datafusion/physical-expr/src/window/partition_evaluator.rs

datafusion/physical-expr/src/window/rank.rs

datafusion/physical-expr/src/window/built_in.rs

datafusion/core/src/physical_plan/windows/bounded_window_agg_exec.rs

ozankabak · 2023-01-03T17:23:55Z

Thank you for the detailed review! We will go through your reviews and let you know when we are done so you can merge.

ozankabak · 2023-01-04T19:33:15Z

@alamb, we just did our final review with @mustafasrepo and this is good to go. Feel free to merge after CI passes 🚀

alamb · 2023-01-04T19:58:36Z

@alamb, we just did our final review with @mustafasrepo and this is good to go. Feel free to merge after CI passes 🚀

Will do -- thank you @mustafasrepo and @ozankabak

ursabot · 2023-01-04T20:31:57Z

Benchmark runs are scheduled for baseline = e1dc962 and contender = 80abc94. 80abc94 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

mustafasrepo and others added 30 commits December 13, 2022 16:38

Sort Removal rule initial commit

56db313

move ordering satisfy to the util

343fafb

update test and change repartition maintain_input_order impl

dfb6683

simplifications

0a42315

partition by refactor (#28)

c2a1593

* partition by refactor * minor changes * Unnecessary tuple to Range conversion is removed * move transpose under common

Add naive sort removal rule

bf7bd11

Add todo for finer Sort removal handling

4cb7258

Merge branch 'apache:master' into feature/sort_removal_rule

dbc30ab

Refactors to improve readability and reduce nesting

aa4f739

reverse expr returns Option (no need for support check)

6309b01

Merge branch 'master' into feature/sort_removal_rule

d0d06de

fix tests

91629b8

partition by and order by no longer ends up at the same window group

ae451a4

Bounded window exec

94c784b

Merge branch 'feature/sort_removal_rule' into feature/bounded_window_…

7c4bcb9

…exec

solve merge problems

0068566

Refactor to simplify code

0e73945

Better comments, change method names

4f145dd

Merge branch 'feature/sort_removal_rule' into feature/bounded_window_…

c63057f

…exec

resolve merge conflicts

838972c

Merge branch 'apache:master' into feature/sort_removal_rule

6d9a876

Merge branch 'apache:master' into feature/bounded_window_exec

f2c7286

Resolve errors introduced by syncing

6b07621

Merge branch 'feature/sort_removal_rule' into feature/bounded_window_…

d62bbdc

…exec

remove set_state, make ntile debuggable

a2d2229

remove locked flag

63d77a6

address reviews

ba388cb

address reviews

572a1a4

Merge branch 'feature/sort_removal_rule' into feature/bounded_window_…

fa30d91

…exec

mustafasrepo and others added 4 commits December 29, 2022 18:17

rename some members

9ceb137

Move rule to physical planning

8b9aa6f

Minor stylistic/comment changes

e13d6e0

Merge branch 'master' into feature/bounded_window_exec

93b8d80

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels Dec 30, 2022

mustafasrepo commented Dec 30, 2022

View reviewed changes

.github/workflows/rust.yml Show resolved Hide resolved

ozankabak added 2 commits December 31, 2022 10:52

Simplify batch-merging utility functions

d97a1ad

Remove unnecessary clones, simplify code

29007ea

github-actions bot removed the logical-expr Logical plan and expressions label Jan 2, 2023

Merge branch 'master' into feature/bounded_window_exec

a5019c3

alamb approved these changes Jan 3, 2023

View reviewed changes

update cargo lock file

ac2f248

mustafasrepo and others added 5 commits January 4, 2023 11:29

address reviews

0ca3889

Merge branch 'master' into feature/bounded_window_exec

516e512

update comments

1e764dd

resolve linter error

28d68bb

Tidy up comments after final review

c4b61c5

This was referenced Jan 4, 2023

Allow concat_batches to take non owned RecordBatch apache/arrow-rs#3456

Closed

Minor: Add link to upstream arrow-rs ticket #4824

Closed

alamb merged commit 80abc94 into apache:master Jan 4, 2023

mustafasrepo deleted the feature/bounded_window_exec branch January 10, 2023 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline-friendly Bounded Memory Window Executor #4777

Pipeline-friendly Bounded Memory Window Executor #4777

mustafasrepo commented Dec 30, 2022

ozankabak commented Dec 30, 2022

alamb commented Dec 31, 2022

alamb commented Jan 3, 2023

alamb left a comment

ozankabak commented Jan 3, 2023

ozankabak commented Jan 4, 2023

alamb commented Jan 4, 2023

ursabot commented Jan 4, 2023

Pipeline-friendly Bounded Memory Window Executor #4777

Pipeline-friendly Bounded Memory Window Executor #4777

Conversation

mustafasrepo commented Dec 30, 2022

Which issue does this PR close?

Rationale for this change

Performance Indicators

Benchmarking

Heaptrack

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak commented Dec 30, 2022

alamb commented Dec 31, 2022

alamb commented Jan 3, 2023

alamb left a comment

Choose a reason for hiding this comment

ozankabak commented Jan 3, 2023

ozankabak commented Jan 4, 2023

alamb commented Jan 4, 2023

ursabot commented Jan 4, 2023