Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce resultSize in IndexedTable #7420

Merged
merged 1 commit into from
Sep 13, 2021

Conversation

Jackie-Jiang
Copy link
Contributor

Introduce resultSize (number of records kept after calling finish()) in IndexedTable which can be different from the trimSize.
On the broker side, we only need to keep limit records in the final result because the groups are already fully merged. This can significantly reduce the cost of final record sorting.
Also separate the logic for queries with/without order-by and move the common methods into the IndexedTable to reduce the duplicate code.

@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2021

Codecov Report

Merging #7420 (ced0dad) into master (5976922) will decrease coverage by 1.26%.
The diff coverage is 85.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7420      +/-   ##
============================================
- Coverage     71.87%   70.61%   -1.27%     
  Complexity     3347     3347              
============================================
  Files          1517     1517              
  Lines         75039    75040       +1     
  Branches      10921    10930       +9     
============================================
- Hits          53932    52987     -945     
- Misses        17482    18430     +948     
+ Partials       3625     3623       -2     
Flag Coverage Δ
integration1 ?
integration2 29.09% <76.66%> (+<0.01%) ⬆️
unittests1 69.69% <85.00%> (-0.02%) ⬇️
unittests2 14.53% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../java/org/apache/pinot/core/util/GroupByUtils.java 100.00% <ø> (ø)
.../pinot/core/data/table/ConcurrentIndexedTable.java 78.26% <73.68%> (-3.56%) ⬇️
...not/core/query/reduce/GroupByDataTableReducer.java 89.36% <83.33%> (+0.07%) ⬆️
...ache/pinot/core/data/table/SimpleIndexedTable.java 80.00% <85.71%> (-12.31%) ⬇️
...org/apache/pinot/core/data/table/IndexedTable.java 84.74% <91.66%> (+5.25%) ⬆️
...re/data/table/UnboundedConcurrentIndexedTable.java 100.00% <100.00%> (+73.91%) ⬆️
...perator/combine/GroupByOrderByCombineOperator.java 82.71% <100.00%> (ø)
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
...nverttorawindex/ConvertToRawIndexTaskExecutor.java 0.00% <0.00%> (-100.00%) ⬇️
...ore/startree/executor/StarTreeGroupByExecutor.java 0.00% <0.00%> (-86.67%) ⬇️
... and 101 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5976922...ced0dad. Read the comment docs.

@Jackie-Jiang Jackie-Jiang force-pushed the optimize_indexed_table branch from b504c02 to ced0dad Compare September 9, 2021 23:47
@richardstartin
Copy link
Member

This can significantly reduce the cost of final record sorting.

How is this evaluated? Is it repeatable?

@Jackie-Jiang
Copy link
Contributor Author

@richardstartin This is repeatable for all group-by order-by queries without having clause (having clause has other issues, and I just put a TODO and keep the current behavior). For example, currently when we do SELECT ... GROUP BY ... ORDER BY ... LIMIT 10, on the broker side we will get and sort the top 5000 (trim size) records using heap sort. With the change, we only sort on the top 10 records with the same algorithm, and smaller heap is guaranteed to give better performance.

_hasOrderBy = false;
_tableResizer = null;
_trimSize = trimSize;
_trimThreshold = trimSize;
_trimSize = Integer.MAX_VALUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For group by without order by cases, why not use the value from limit clause as the trimSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no trim required for cases without order-by, so trim size and trim threshold doesn't really matter

@Jackie-Jiang Jackie-Jiang merged commit f3068bc into apache:master Sep 13, 2021
@Jackie-Jiang Jackie-Jiang deleted the optimize_indexed_table branch September 13, 2021 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants