-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() #22922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…::Ideal() Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: 1. x*8 can be optimized as x<<3. 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. Currently OpenJDK implemented a few such patterns in mid-end, including: 1. |C| = 1<<n (n>0) 2. |C| = (1<<n) - 1 (n>0) 3. |C| = (1<<m) + (1<<n) (m>n, n>=0) The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: 3.1. C = (1<<n) + 1 (n>0) 3.2. C = -((1<<n) + 1) (n>0) 3.3. C = (1<<m) + (1<<n) (m>n, n>0) 3.4. C = -((1<<m) + (1<<n)) (m>n, n>0) According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. Before (MUL is not converted): ``` mov x1, #C mul x2, x1, x0 ``` Now (MUL is converted): For 3.1: ``` add x2, x0, x0, lsl #n ``` For 3.2: ``` add x2, x0, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 ``` For 3.3: ``` lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 ``` For 3.4: ``` lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 ``` Test results (ns/op) on Arm Neoverse V2: ``` Before Now Uplift Pattern Notes testInt9 103.379 60.702 1.70305756 3.1 testIntN33 103.231 106.825 0.96635619 3.2 n > 4 testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 ``` **(S1) From this point on, we should treat pattern 3 as follows:** 3.1 C = (1<<n) + 1 (n>0) 3.2 C = -((1<<n) + 1) (0<n<=4) 3.3 C = (1<<m) + (1<<n) (m>n, 0<n<=4) 3.4 C = -((1<<m) + (1<<n)) (disable) Since this conversion is implemented in mid-end, it impacts other optimizations, such as auto-vectorization. Assume there's the following loop which can be vectorized. Vector-A: ``` for (int i=0; i<len; i++) { sum += a[i] * C; } ``` Before: ``` movi v19.4s, #C // this will be hoisted out of the loop mla v16.4s, v17.4s, v19.4s ``` After: For 3.1: ``` shl v19.4s, v17.4s, #m add v17.4s, v19.4s, v17.4s add v16.4s, v16.4s, v17.4s ``` For 3.2: ``` (add w11, w11, w11, lsl #m sub w11, w12, w11) * 4 // *4 is for 4 ints ``` For 3.3: ``` shl v18.4s, v17.4s, #m shl v19.4s, v17.4s, #n add v18.4s, v19.4s, v18.4s add v16.4s, v16.4s, v18.4s ``` For 3.4: ``` (lsl w12, w4, #m add w11, w12, w4, lsl #n sub w13, w13, w11) * 4 // *4 is for 4 ints ``` The generated instruction before is more simple and performing: ``` Before Now Uplift Pattern testInt9AddSum 47.958 63.696 0.752920121 3.1 testIntN33AddSum 48.013 147.834 0.324776438 3.2 testIntN9AddSum 48.026 149.149 0.322000148 3.2 testInt18AddSum 47.971 69.393 0.691294511 3.3 testInt36AddSum 47.98 69.395 0.69140428 3.3 testInt96AddSum 47.992 69.453 0.690999669 3.3 testIntN18AddSum 48.014 157.132 0.305564748 3.4 testIntN36AddSum 48.02 157.094 0.305676856 3.4 testIntN96AddSum 48.032 153.642 0.312622851 3.4 ``` **(S2) From this point on, we should disable pattern 3 totally.** But we can have different cases, for example: Vector-B: ``` for (int i=0; i<100000; i++) { a[i] = a[i] * C; } ``` Test results: ``` Before Now Uplift Pattern testInt9Store 43.392 33.338 1.301577779 3.1 testIntN33Store 43.365 75.993 0.570644665 3.2 testIntN9Store 43.5 75.452 0.576525473 3.2 testInt18Store 43.442 41.847 1.038115038 3.3 testInt36Store 43.369 41.843 1.03646966 3.3 testInt96Store 43.389 41.931 1.03477141 3.3 testIntN18Store 43.372 57.909 0.748968209 3.4 testIntN36Store 43.373 57.042 0.760369552 3.4 testIntN96Store 43.405 58.145 0.746495829 3.4 ``` **(S3) From this point on, we should treat pattern 3 as follow:** 3.1 C = (1<<n) + 1 (n>0) 3.2 C = -((1<<n) + 1) (disable) 3.3 C = (1<<m) + (1<<n) (m>n, n>0) 3.4 C = -((1<<m) + (1<<n)) (disable) Combining S1, S2 and S3, we get: Pattern S1 S2 S3 3.1 (n>0, 1.7) (disable, 0.75) (n>0, 1.3) 3.2 (0<n<=4, 1.0) (disable, 0.32) (disable, 0.57) 3.3 (m>n, 0<n<=4, 1.04) (disable, 0.69) (m>n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74) For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end. For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable. For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere. In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction. I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines. Here are the test results on an Arm V2 and an AMD Genoa machine: ``` Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Pattern Notes testInt8 60.36989 60.276736 1 116.768294 116.772547 0.99 1 testInt8AddSum 63.658064 63.797732 0.99 16.04973 16.051491 0.99 1 testInt8Store 38.829618 39.054129 0.99 19.857453 20.006321 0.99 1 testIntN8 59.99655 60.150053 0.99 132.269926 132.252473 1 1 testIntN8AddSum 145.678098 146.181549 0.99 158.546226 158.806476 0.99 1 testIntN8Store 32.802445 32.897907 0.99 19.047873 19.065941 0.99 1 testInt7 98.978213 99.176574 0.99 114.07026 113.08989 1 2 testInt7AddSum 62.675636 62.310799 1 23.370851 20.971655 1.11 2 testInt7Store 32.850828 32.923315 0.99 23.884952 23.628681 1.01 2 testIntN7 60.27949 60.668158 0.99 174.224893 174.102295 1 2 testIntN7AddSum 62.746696 62.288476 1 20.93192 20.964557 0.99 2 testIntN7Store 32.812906 32.851355 0.99 23.810024 23.526074 1.01 2 testInt9 60.820402 60.331938 1 108.850777 108.846161 1 3.1 testInt9AddSum 62.24679 62.374637 0.99 20.698749 20.741137 0.99 3.1 testInt9Store 32.871723 32.912065 0.99 19.055537 19.080735 0.99 3.1 testIntN33 106.517618 103.450746 1.02 153.894345 140.641135 1.09 3.2 n > 4 testIntN33AddSum 147.589815 47.911612 3.08 153.851885 17.008453 9.04 3.2 testIntN33Store 75.434513 43.473053 1.73 26.612181 20.436323 1.3 3.2 testIntN9 102.173268 103.70682 0.98 155.858169 140.718967 1.1 3.2 n <= 4 testIntN9AddSum 148.724952 47.963305 3.1 186.902111 20.249414 9.23 3.2 testIntN9Store 74.783788 43.339188 1.72 20.150159 20.888448 0.96 3.2 testInt18 98.905625 102.942092 0.96 142.480636 140.748778 1.01 3.3 m <= 4, n <= 4 testInt18AddSum 68.695585 48.103536 1.42 26.88524 16.77886 1.6 3.3 testInt18Store 41.307909 43.385183 0.95 21.233238 20.875026 1.01 3.3 testInt36 99.039742 103.714745 0.95 142.265806 142.334039 0.99 3.3 m > 4, n <= 4 testInt36AddSum 68.736756 47.952189 1.43 26.868362 17.030035 1.57 3.3 testInt36Store 41.403698 43.414093 0.95 21.225454 20.52266 1.03 3.3 testInt96 105.00287 103.528144 1.01 237.649526 140.643255 1.68 3.3 m > 4, n > 4 testInt96AddSum 68.481133 48.04549 1.42 26.877407 16.918209 1.58 3.3 testInt96Store 41.276292 43.512994 0.94 23.456117 20.540181 1.14 3.3 testIntN18 138.629044 103.269657 1.34 210.315628 140.716818 1.49 3.4 m <= 4, n <= 4 testIntN18AddSum 156.635652 48.003989 3.26 215.807135 16.917665 12.75 3.4 testIntN18Store 57.584487 43.410415 1.32 26.819827 20.707778 1.29 3.4 testIntN36 139.068861 103.766774 1.34 209.522432 140.720322 1.48 3.4 m > 4, n <= 4 testIntN36AddSum 156.36928 48.027779 3.25 215.705842 16.893192 12.76 3.4 testIntN36Store 57.715418 43.493958 1.32 21.651252 20.676877 1.04 3.4 testIntN96 139.151761 103.453665 1.34 269.254161 140.753499 1.91 3.4 m > 4, n > 4 testIntN96AddSum 153.123557 48.110524 3.18 263.262635 17.011144 15.47 3.4 testIntN96Store 57.793179 43.47574 1.32 24.444592 20.530219 1.19 3.4 ``` limitations: 1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend. 3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation.
Hi @erifan, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user erifan" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
❗ This change is not yet ready to be integrated. |
/signed |
Thank you! Please allow for up to two weeks to process your OCA, although it is usually done within one to two business days. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
/covered |
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @erifan
I have some first questions / comments. I only scanned through quickly.
My biggest question:
You only mention aarch64. But we would need to know that your changes also work well on x64.
Also:
Can you summarize if your changes are only for performane of vectorization, or also for scalar code?
test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java
Outdated
Show resolved
Hide resolved
@@ -250,27 +250,33 @@ Node *MulINode::Ideal(PhaseGVN *phase, bool can_reshape) { | |||
sign_flip = true; | |||
} | |||
|
|||
// Get low bit; check for being the only bit | |||
// TODO: abs_con = (1<<m)+(1<<n) and con = -((1<<n)+1) was supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO
here on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. I see. We don't leave TODO's in the code. Because nobody will ever look at it again. If we agree to go ahead with this, then you should rather file an RFE to keep track of this. But before you file it, let's first discuss the over-all strategy.
private static int addTo6(int a) { | ||
return a + a + a + a + a + a; // a*6 => (a<<1) + (a<<2) | ||
return a + a + a + a + a + a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an improvement on aarch64 for all implementations? What about x64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a*6
is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of a*18 = (a<<4) + (a<<1)
, (same with a*6 = (a<<2) + (a<<1)
) in three different cases:
Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes
testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar
testInt18AddSum 68.70 48.10 1.42 26.88 16.78 1.6 vectorized
testInt18Store 41.31 43.39 0.95 21.23 20.88 1.01 vectorized
We can see that for scalar case the conversion from a*6 => (a<<2) + (a<<1)
is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in mulnode.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes
testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar
Ok, that would be a 4% regression on V2. It is not much, but still possibly relevant.
I think I would need to see a clear strategy that we can actually pull off. Otherwise it may be that you introduce a regression here, that then nobody gets around to fixing later.
// But if it's not vectorizable, maybe it's profitable to do the conversion on | ||
// some architectures, support it in backends if it's worthwhile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is all about vectorization only: we could consider delaying the Ideal
optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for some cases, it's better to do the conversion before vectorization, for example:
x * 8 => x << 3
Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we could also do a transform with vectors, and convert a vector mul to a vector shift, right?
It is a bit scary to have these cases where some are better before and some better after vectorization. Makes performance quite unpredictable - your change may introduce improvements in some cases but regressions in others.
Node* n1 = phase->transform( | ||
new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1)))); | ||
res = new AddINode(in(1), n1); | ||
} else if (is_power_of_2(abs_con + 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now you only check for power_of_2 +- 1
, right? But before we also looked at patterns with 2 bits, such as 64 + 8
.
You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have tested the patch on several of x64 machines, including amd Genoa and Intel SPR and some older x86 machines, there's no noticeable performance loss. Test results on aarch64 and amd Genoa were included in the commit message, please take a look, thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now you only check for power_of_2 +- 1, right?
Yes now I only check these patterns:
1, Const = 1<<n, (n > 0)
2, Const = -(1<<n), (n > 0)
3, Const = (1<<n) + 1, (n > 0)
4, Const = (1<<n) - 1, (n > 0)
5, Const = -((1<<n) -1) (n > 0)
Removed these patterns:
1, Const = (1<<m) + (1<<n) (m > 0, n > 0)
2, Const = -((1<<m) + (1<<n)) (m > 0, n > 0)
3, Const = -((1<<n) + 1), (n > 0)
Hi @eme64 , thanks for your review!
Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform. I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64.
For both vectorization and scalar cases. For example the pattern x * C => -((x<<m) + (x<<n)) is supported now, this pattern is a negative optimization from both the vectorization and scalar perspectives on both aarch64 and x64, and this patch removed this pattern. As mentioned in my commit message, this patch is good for most cases, but there are small performance loss for some cases. Ideally, if we implement vplan, I think it would be better to keep the multiplication operation before vectorization and let the vectorizer generate the optimal vector code. Then for the scalar case that cannot be vectorized, convert to other more optimal instructions in the backend ( or in this pass if it's merged) if necessary. |
Hi @erifan Thanks for your responses. I now looked at the benchmark results. Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review. Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.
What do you think? |
About the OCA, I am an employee from NVIDIA's Java compiler team, and NVIDIA has signed OCA. |
Ok, I'll combine the patches and file one big patch, and update the test results.
OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors.
Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting.
Yes. |
@erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here. For
The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember Additionally: if your workload has other And then the characteristics of scalar ops may not be identical to vector ops. It would be interesting to have a really solid benchmark, where you explore the impact of these different effects. Maybe perfect tuning is not possible. Maybe we are willing to take a Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. But I think the As to what to do in sequence, here a suggestion:
This way you can separate scalar and vector changes. And please investigate when We may also want to get people from ARM and Intel into this discussion at some point. That's enough for now. Let me know what you think :) |
Hi @eme64 thanks for your review.
I agree with you. I am actually working on
AARCH64 doesn't actually have a
Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general.
I don't know such a benchmark suite yet. For AARCH64, I usually refer to the Arm Optimization Guide, but some instructions seem to be missing there. I guess AMD and Intel should have similar documents? I'm not sure.
Yes I agree, I will deal with the performance loss.
Do you mean shift add => lea ? I think this is already done on x64.
Yes. |
@erifan Amazing, thanks for the enthusiasm :) |
I also think that moving mul-related idealization to happen after loop opts would be a good idea. In the past I've noticed that patterns such as |
@erifan this pull request can not be integrated into git checkout JDK-8346964
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Change-Id: Ib47ed4f9c6d69326a0b7cb8ba7c29f604b8fc1ec
@erifan This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@erifan This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
/open |
@erifan This pull request is now open |
@erifan you opened this again. Does that mean we should review again? I see that you did not make any changes since our last conversation. If it is not ready for review, could you please convert it to Draft, so it is clear that you are not asking for reviews currently? |
Oh sorry to bother you. I'm working on something else. For this PR, actually I have finished the modification according to your suggestion, and tests on aarch64 and x86_64 also showed that there's no performance regression. But I didn't file the patch because of the following reasons: Thanks! |
@erifan This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
/keepalive |
@erifan The pull request is being re-evaluated and the inactivity timeout has been reset. |
@erifan This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:
Currently OpenJDK implemented a few such patterns in mid-end, including:
The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.
But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
3.1. C = (1<<n) + 1 (n>0)
3.2. C = -((1<<n) + 1) (n>0)
3.3. C = (1<<m) + (1<<n) (m>n, n>0)
3.4. C = -((1<<m) + (1<<n)) (m>n, n>0)
According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.
Before (MUL is not converted):
Now (MUL is converted):
For 3.1:
For 3.2:
For 3.3:
For 3.4:
Test results (ns/op) on Arm Neoverse V2:
(S1) From this point on, we should treat pattern 3 as follows:
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (0<n<=4)
3.3 C = (1<<m) + (1<<n) (m>n, 0<n<=4)
3.4 C = -((1<<m) + (1<<n)) (disable)
Since this conversion is implemented in mid-end, it impacts other optimizations, such as auto-vectorization. Assume there's the following loop which can be vectorized.
Vector-A:
Before:
After:
For 3.1:
For 3.2:
For 3.3:
For 3.4:
The generated instruction before is more simple and performing:
(S2) From this point on, we should disable pattern 3 totally.
But we can have different cases, for example:
Vector-B:
Test results:
(S3) From this point on, we should treat pattern 3 as follow:
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (disable)
3.3 C = (1<<m) + (1<<n) (m>n, n>0)
3.4 C = -((1<<m) + (1<<n)) (disable)
Combining S1, S2 and S3, we get:
Pattern S1 S2 S3
3.1 (n>0, 1.7) (disable, 0.75) (n>0, 1.3)
3.2 (0<n<=4, 1.0) (disable, 0.32) (disable, 0.57) 3.3 (m>n, 0<n<=4, 1.04) (disable, 0.69) (m>n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74)
For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end.
For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable.
For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere.
In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction.
I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines.
Here are the test results on an Arm V2 and an AMD Genoa machine:
limitations:
1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend.
3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922
$ git checkout pull/22922
Update a local copy of the PR:
$ git checkout pull/22922
$ git pull https://git.openjdk.org/jdk.git pull/22922/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 22922
View PR using the GUI difftool:
$ git pr show -t 22922
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22922.diff
Using Webrev
Link to Webrev Comment