[multistage] Add ROWS support for aggregation window functions with ORDER BY within OVER by somandal · Pull Request #11577 · apache/pinot

somandal · 2023-09-12T18:41:44Z

This PR adds support for ROWS type window frames for aggregation window functions with an ORDER BY clause within the OVER(). The default frame type for ORDER BY within OVER() is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This PR adds support for the same default frame boundaries but with ROWS type.

Issue: #11406

Example queries this can support:

SELECT AVG(a.col1) OVER(ORDER BY a.col2 ROWS UNBOUNDED PRECEDING) from table;
SELECT COUNT(a.col3) OVER (PARTITION BY a.col1 ORDER BY a.col2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from table;

This also allows the use of ROW_NUMBER() (which only allows ROWS type frames) with aggregation functions, such as:

SELECT ROW_NUMBER() OVER (PARTITION BY a.col1 ORDER BY a.col2), SUM(a.col3) OVER (PARTITION BY a.col1 ORDER BY a.col2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from table;

The supported functions with this change are:

COUNT
SUM
AVG
MIN
MAX
BOOL_AND
BOOL_OR
ROW_NUMBER (this was already supported, no new changes in this PR for this)

cc @siddharthteotia @walterddr @Jackie-Jiang

…n OVER

codecov-commenter · 2023-09-12T19:22:50Z

Codecov Report

Merging #11577 (60f0b4c) into master (bc0d9fb) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master   #11577      +/-   ##
============================================
- Coverage     63.05%   63.03%   -0.03%     
- Complexity     1108     1111       +3     
============================================
  Files          2325     2325              
  Lines        124797   124799       +2     
  Branches      19056    19057       +1     
============================================
- Hits          78693    78667      -26     
- Misses        40508    40535      +27     
- Partials       5596     5597       +1

Flag	Coverage Δ
integration	`<0.01% <0.00%> (ø)`
integration1	`<0.01% <0.00%> (ø)`
integration2	`0.00% <0.00%> (ø)`
java-11	`63.00% <100.00%> (+12.97%)`	⬆️
java-17	`62.90% <100.00%> (-0.02%)`	⬇️
java-20	`62.91% <100.00%> (+0.01%)`	⬆️
temurin	`63.03% <100.00%> (-0.03%)`	⬇️
unittests	`63.03% <100.00%> (-0.03%)`	⬇️
unittests1	`67.44% <100.00%> (-0.04%)`	⬇️
unittests2	`14.48% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...uery/runtime/operator/WindowAggregateOperator.java	`96.53% <100.00%> (+0.53%)`	⬆️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

siddharthteotia · 2023-09-14T06:11:31Z

pinot-query-planner/src/test/resources/queries/WindowFunctionPlans.json

          "\n"
        ]
      },
+      {


Do we already have a test for the scenario where if we use ROWS frame type for an OVER() clause without ORDER BY, it throws syntax error or implicit conversion to RANGE frame type ?

OVER() or OVER() with only PARTITION BY both only take default frame as "ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING/CURRENT ROW". Calcite throws an exception if you modify this to be of RANGE type saying an ORDER BY clause must be present.

Calcite gets a bit odd here though, even though it accepts "ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING/CURRENT ROW", if you open a debugger and check what values we get for the frame clauses you'll see that _isRows is overridden to FALSE by Calcite (meaning it is treating it as RANGE) and the frame boundaries are always set to UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, irrespective of if you passed CURRENT ROW for the upper limit or not.

So I guess you can say Calcite does implicit RANGE conversion and doesn't even allow a difference in behavior between an upper bound of UNBOUNDED FOLLOWING vs CURRENT ROW if an ORDER BY is not present within the OVER() clause.

Interestingly i had checked PostgreSQL behavior prior to starting this PR and in PostgreSQL they do allow OVER() without ORDER BY to use different upper bounds with ROWS and treats the output differently in that case (will do a rolling aggregation for CURRENT ROW).

Hope that answers your question (though a very long answer 😅 )

Jackie-Jiang

Out of the scope of this PR, can you share how you usually generate the expected result for the query plan test? We have added so many hardcoded plan tests, and whenever we need to add a feature that changes the query plan it is extremely painful because we need to modify thousands of lines of test. #11418 is an example where the actual change might be 20 lines but almost 2000 lines of test change.
We need to either modify the test to only test the part related, or provide a way to regenerate the expected results automatically

somandal · 2023-09-19T15:09:45Z

Out of the scope of this PR, can you share how you usually generate the expected result for the query plan test? We have added so many hardcoded plan tests, and whenever we need to add a feature that changes the query plan it is extremely painful because we need to modify thousands of lines of test. #11418 is an example where the actual change might be 20 lines but almost 2000 lines of test change. We need to either modify the test to only test the part related, or provide a way to regenerate the expected results automatically

I usually let the test run and look at the difference in the plans and verify whether the planner side makes sense. At least during development of planner side code I've found this very useful to catch bugs or find opportunities for improvements. Unfortunately I agree that this is a very tedious process to maintain this long list of plans and keep modifying them for any planner side changes. I don't know of a good way to generate these automatically though (unlike runtime where for most queries H2 can be used). Would it be helpful to reduce duplication in the plans by removing some of the unnecessary ones (e.g. in window functions initially we had tests for single window functions and corresponding tests using multiple window functions - this was good for the initial development to ensure both work but may not be so useful now)?

walterddr · 2023-09-19T16:06:32Z

I usually let the test run and look at the difference in the plans and verify whether the planner side makes sense. At least during development of planner side code I've found this very useful to catch bugs or find opportunities for improvements. Unfortunately I agree that this is a very tedious process to maintain this long list of plans and keep modifying them for any planner side changes. I don't know of a good way to generate these automatically though (unlike runtime where for most queries H2 can be used). Would it be helpful to reduce duplication in the plans by removing some of the unnecessary ones (e.g. in window functions initially we had tests for single window functions and corresponding tests using multiple window functions - this was good for the initial development to ensure both work but may not be so useful now)?

one way is to utilize something similar to https://github.com/apache/calcite/blob/9c758942ab51af22a1095087ea2daeabc7d692ea/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml where

the RulesTest.xml is generated and compared against the reference during build/test time; and
once we validate all the deltas look good, we can directly copy-paste over the changes from the generated xml to the ref xml
create the PR then merged/checked-in the ref --> so there's no hand modification required

thoughts?

somandal · 2025-07-08T21:26:45Z

Closing this as it's a really old PR and we'll mostly need to revisit all this anyways

Add ROWS support for aggregation window functions with ORDER BY withi…

60f0b4c

…n OVER

siddharthteotia reviewed Sep 14, 2023

View reviewed changes

Jackie-Jiang reviewed Sep 17, 2023

View reviewed changes

somandal closed this Jul 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multistage] Add ROWS support for aggregation window functions with ORDER BY within OVER#11577

[multistage] Add ROWS support for aggregation window functions with ORDER BY within OVER#11577
somandal wants to merge 1 commit intoapache:masterfrom
somandal:rows-support-aggregation-window-functions

somandal commented Sep 12, 2023 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 12, 2023 •

edited

Loading

Uh oh!

siddharthteotia Sep 14, 2023

Uh oh!

somandal Sep 14, 2023 •

edited

Loading

Uh oh!

Jackie-Jiang left a comment

Uh oh!

somandal commented Sep 19, 2023

Uh oh!

walterddr commented Sep 19, 2023

Uh oh!

somandal commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

somandal commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

siddharthteotia Sep 14, 2023

Choose a reason for hiding this comment

Uh oh!

somandal Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

somandal commented Sep 19, 2023

Uh oh!

walterddr commented Sep 19, 2023

Uh oh!

somandal commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

somandal commented Sep 12, 2023 •

edited

Loading

codecov-commenter commented Sep 12, 2023 •

edited

Loading

somandal Sep 14, 2023 •

edited

Loading