[multistage] Add ROWS support for aggregation window functions with ORDER BY within OVER#11577
[multistage] Add ROWS support for aggregation window functions with ORDER BY within OVER#11577somandal wants to merge 1 commit intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11577 +/- ##
============================================
- Coverage 63.05% 63.03% -0.03%
- Complexity 1108 1111 +3
============================================
Files 2325 2325
Lines 124797 124799 +2
Branches 19056 19057 +1
============================================
- Hits 78693 78667 -26
- Misses 40508 40535 +27
- Partials 5596 5597 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 11 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
| "\n" | ||
| ] | ||
| }, | ||
| { |
There was a problem hiding this comment.
Do we already have a test for the scenario where if we use ROWS frame type for an OVER() clause without ORDER BY, it throws syntax error or implicit conversion to RANGE frame type ?
There was a problem hiding this comment.
OVER() or OVER() with only PARTITION BY both only take default frame as "ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING/CURRENT ROW". Calcite throws an exception if you modify this to be of RANGE type saying an ORDER BY clause must be present.
Calcite gets a bit odd here though, even though it accepts "ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING/CURRENT ROW", if you open a debugger and check what values we get for the frame clauses you'll see that _isRows is overridden to FALSE by Calcite (meaning it is treating it as RANGE) and the frame boundaries are always set to UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, irrespective of if you passed CURRENT ROW for the upper limit or not.
So I guess you can say Calcite does implicit RANGE conversion and doesn't even allow a difference in behavior between an upper bound of UNBOUNDED FOLLOWING vs CURRENT ROW if an ORDER BY is not present within the OVER() clause.
Interestingly i had checked PostgreSQL behavior prior to starting this PR and in PostgreSQL they do allow OVER() without ORDER BY to use different upper bounds with ROWS and treats the output differently in that case (will do a rolling aggregation for CURRENT ROW).
Hope that answers your question (though a very long answer 😅 )
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Out of the scope of this PR, can you share how you usually generate the expected result for the query plan test? We have added so many hardcoded plan tests, and whenever we need to add a feature that changes the query plan it is extremely painful because we need to modify thousands of lines of test. #11418 is an example where the actual change might be 20 lines but almost 2000 lines of test change.
We need to either modify the test to only test the part related, or provide a way to regenerate the expected results automatically
I usually let the test run and look at the difference in the plans and verify whether the planner side makes sense. At least during development of planner side code I've found this very useful to catch bugs or find opportunities for improvements. Unfortunately I agree that this is a very tedious process to maintain this long list of plans and keep modifying them for any planner side changes. I don't know of a good way to generate these automatically though (unlike runtime where for most queries H2 can be used). Would it be helpful to reduce duplication in the plans by removing some of the unnecessary ones (e.g. in window functions initially we had tests for single window functions and corresponding tests using multiple window functions - this was good for the initial development to ensure both work but may not be so useful now)? |
one way is to utilize something similar to https://github.com/apache/calcite/blob/9c758942ab51af22a1095087ea2daeabc7d692ea/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml where
thoughts? |
|
Closing this as it's a really old PR and we'll mostly need to revisit all this anyways |
This PR adds support for
ROWStype window frames for aggregation window functions with anORDER BYclause within theOVER(). The default frame type forORDER BYwithinOVER()isRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This PR adds support for the same default frame boundaries but withROWStype.Issue: #11406
Example queries this can support:
SELECT AVG(a.col1) OVER(ORDER BY a.col2 ROWS UNBOUNDED PRECEDING) from table;SELECT COUNT(a.col3) OVER (PARTITION BY a.col1 ORDER BY a.col2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from table;This also allows the use of
ROW_NUMBER()(which only allowsROWStype frames) with aggregation functions, such as:SELECT ROW_NUMBER() OVER (PARTITION BY a.col1 ORDER BY a.col2), SUM(a.col3) OVER (PARTITION BY a.col1 ORDER BY a.col2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from table;The supported functions with this change are:
cc @siddharthteotia @walterddr @Jackie-Jiang