You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This rearranges the group key assignment in 2 steps - split the loop to compute the group ids from the loop which marks the existence of the groups in a block, and then specialises the group id calculation loop for cases where there are 2 groups and 3 groups respectively. This simplifies the group id calculation loop to the extent that C2 can vectorize it on JDK11. This speeds up the code up at least 5x compared to a loop which can handle any number of group by expressions in a microbenchmark small enough to examine the generated code:
The vectorized group id assignment is so fast that it's not the bottleneck afterwards - nearly 70% of the time is now spent in group marking, and only 20% computing group ids. The marking can probably be improved but it's harder to parallelise than computing group ids.
Merging #7949 (76084cd) into master (b6eeaf3) will decrease coverage by 43.66%.
The diff coverage is 90.00%.
❗ Current head 76084cd differs from pull request most recent head 10670dd. Consider uploading reports for the commit 10670dd to get more accurate results
I am guessing the vpaddd and other similar instructions are the SIMD instructions getting used in the second region in the JIT compiled code and that sort of proves that new code is getting auto-vectorized. Is this correct ?
Yes, the multiplications (vpmulld) and additions (vpaddd) are vectorised so 8 are performed at a time after the transformation, which explains the speedup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This rearranges the group key assignment in 2 steps - split the loop to compute the group ids from the loop which marks the existence of the groups in a block, and then specialises the group id calculation loop for cases where there are 2 groups and 3 groups respectively. This simplifies the group id calculation loop to the extent that C2 can vectorize it on JDK11. This speeds up the code up at least 5x compared to a loop which can handle any number of group by expressions in a microbenchmark small enough to examine the generated code:
The difference in code generation can be seen with perfasm (for a 3D group by, cardinalities 100 x 30 x 50):
before:
The vectorized group id assignment is so fast that it's not the bottleneck afterwards - nearly 70% of the time is now spent in group marking, and only 20% computing group ids. The marking can probably be improved but it's harder to parallelise than computing group ids.