-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[AArch64] Set MaxInterleaving to 4 for Neoverse V2 and V3 #100385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: Sjoerd Meijer (sjoerdmeijer) ChangesThis helps loop based workloads and benchmarks quite a lot, SPEC INT is unaffected. Full diff: https://github.com/llvm/llvm-project/pull/100385.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 32a355fe38f1c..280083ae824cd 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -233,9 +233,11 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
PrefLoopAlignment = Align(32);
MaxBytesForLoopAlignment = 16;
break;
+ case NeoverseV2:
+ MaxInterleaveFactor = 4;
+ LLVM_FALLTHROUGH;
case NeoverseN2:
case NeoverseN3:
- case NeoverseV2:
case NeoverseV3:
PrefFunctionAlignment = Align(16);
PrefLoopAlignment = Align(32);
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll
index 0e54bd15e5ea5..6a47e0ea34a50 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll
@@ -5,6 +5,7 @@
; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a14 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a15 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a16 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
+; RUN: opt -passes=loop-vectorize -mtriple=arm64 -mcpu=neoverse-v2 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
; Tests for selecting interleave counts for loops with loads and stores.
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll
index 72d528d8748ba..9211cdfaebd15 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll
@@ -5,6 +5,7 @@
; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a14 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a15 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a16 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
+; RUN: opt -passes=loop-vectorize -mtriple=arm64 -mcpu=neoverse-v2 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
; Tests for selecting the interleave count for loops with reductions.
|
I agree this makes sense in principle, considering the number of pipelines available. Whenever I have tried it in practice the performance has not looked great, I believe because it create loops bodies that are so large that they are executed less often and you end up with worse performance overall. This can depend heavily on the dynamic trip counts though, and different domains of code will typically have different expected loop trip counts. Can you give more details what kinds of performance validation you have ran? |
Here are some motivating examples from TSVC. The numbers are ratios of "after" and "before" (after / before), i.e. with and without this patch applied. Lower than 1 are improvements. This shows ~2x improvements for the first three cases, and then a few more decent improvements. This is a modified version of TSVC where each kernel is compiled into each own executable to reduce noise; I do not trust the TSCV numbers from the llvm test-suite (but that is a separate story):
The benchmarks that I ran are the llvm test-suite, TSVC, RAJAPerf, and SPEC INT. |
@@ -233,9 +233,11 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) { | |||
PrefLoopAlignment = Align(32); | |||
MaxBytesForLoopAlignment = 16; | |||
break; | |||
case NeoverseV2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here to explain why "4" is chosen would be nice.
Here are the RAJAPerf numbers:
There is 1 regression, which I did not spot before; not sure if it is noise or something else, but let's assume it is real. The geomean number is 0.984, so an overall 1.5% improvement. TODO:
|
I have checked SPEC FP (the C/C++ apps) and that is unaffected. I double checked the llvm test-suite (MultiSource):
What do you think @davemgreen ? |
A lot more vectorisation happens in Fortran SPEC FP benchmarks, particular fotonik3d, so if possible it might be good to check those too. |
I have never done this before, but if it is just a matter of building flang I can certainly give this a try. |
I had some problems with flang-new and building some FP apps, but I did manage to build and run fotonik3d, which showed a 1.3% improvement. Of course, not sure it is exactly caused by this change or second order effects, but it is not a regression, which is the most important thing. |
@ceseo do you know if flang is capable of building Fortran SPEC FP properly yet? Would save Sjoerd the hassle if we (Linaro) know it doesn't work yet. |
I think the compilation of 521.wrf_r hangs. Getting flang-new compiled was a pain too, I tried 2 different AArch64 platforms and 4 different compilers, and saw weird behaviour from GCC erroring:
and clang somehow silently crashing or erroring. Anyway, would be good to fix, but don't think it should hold up this patch. From all the numbers collected, this seems a generally good thing, do you agree @davemgreen ? |
Yes, it can. We even presented comparisons between Flang and gfortran for SPEC CPU 2017 rate at Linaro Connect in May. |
Running flang-new on spec should work. It might be worth making sure the others in it are OK. I think this makes sense when you look at it from the point of view of the CPU. There are 4 vector pipelines, so it makes sense to start interleaving 4x, and other cores do the same. The other dimension is the code that you are running and the dynamic trip-count of the important loops in that code. Certain domains tend to have higher trip counts, like HPC, ML and certain DSP routines. Benchmarks often have a super high trip count to make the benchmark meaningful. DSP/image processing can often be 16 x 16 tiles, and for certain other domains the trip counts are often very low, possibly closer to 1 than anything else. So I'm not sure I trust the more benchmark-y results as much as the others. I think it makes sense to have a higher maximum interleave factor, it just might not always be best to use it. Having it higher for reductions might be more of a win than other cases, for example. I think this is a good idea so long as the performance is OK. Can you add Neoverse-V3 to the same switch, I think it should work similarly. |
Thanks for the thoughts and suggestions Dave. |
Thanks - there is a cam_r in there too I think. |
d1b2fd0
to
0160e7a
Compare
|
All these benchmarks work fine for us. Can you try increasing the stack setting to unlimited, it should help 503.bwaves_r.
|
Thanks @kiranchandramohan , Dave also helped me, and I was able to run the Fortran benchmarks. |
Thanks @davemgreen, I have now been able to run that, and I can confirm the 4.5% regression in cam4. I would like to think that these are two exceptions, and that interleaving with a maximum of 4 is the sensible and generally beneficial thing to do; cam4 is the exception of all codes that I ran. The way forward is to follow up on this and look into epilogue vectorisation. At the moment, epilogue vectorisation is effectively disabled because it will kick in only if the tripcount is 16 or more:
By lowering this to 8, I have verified that we gain almost all performance back for cam4. WDYT? |
Hi - I think that sounds good. It might spend a bit of codesize, but should hopefully mitigate the perf issues we've seen before. In the long run we should look into predicated epilog vectorization again - that should be especially good in cases where we can remove the scalar loop entirely. |
One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that. |
Ok, thanks. Is this then a fair summary: that work is slightly orthogonal, it will help this work and will make it even better? |
With different ways forward, are you happy to land this @davemgreen ? |
I hadn't realised that we didn't account for the UF already. That sounds like a good thing to fix, thanks for the info. @juliannagele @fhahn do you have a timeframe for when such a patch will be ready? I would like to avoid the regressions if we can, and would not want to end up relying on a patch the never materializes. Otherwise perhaps @sjoerdmeijer could increase the limit for cases where the UF is 4, and we can improve it further in the future where needed. |
@davemgreen: in the meantime, I will prepare a patch to create a TTI hook that controls the |
Do you need to take interleaving into account when lowering the minimum? The reason I ask is that for large loops we may not interleave at all so you'll have an epilogue even when the main vector loop has IC=1,VF=8. Maybe that's fine, but I can imagine some downsides in terms of increased code size and worse performance for some low trip counts. |
Hi @david-arm , thanks for the suggestion, that makes perfect sense. Fixing this generally is going to be a 2-step solution anyway I think. I.e., the first one is to make that single magic number 16 target dependent with a TTI hook, and the second step is coming up with a better heuristic that takes this the minimum epilogue VF into account and other things like VF x UF, which could be implemented in a helper function in the vectoriser, or perhaps in the Subtarget specific TTI hook implementation that also sets this minimum epilogue VF. Either way, I think this is a good heuristic that can be added in the TTI hook that I will start drafting now. |
Hi, I've got an initial version of this ready, I'll clean it up a bit and share a draft tomorrow |
Here's a first version of a patch for epilogue vectorisation: 6b0f7de4f3347b7b5fc6362e7aa7764465501dfb considering VF x UF when checking profitability of epilogue vectorisation and adjusting the cost model accordingly (the main idea is to use the known upper bound for the trip count of the epilogue). There's also some accompanying microbenchmarks here 1e1576e6cb1f09965c3e7520c5a51958508df23b. Let me know what you think! |
Thanks for working on this @juliannagele ! |
@juliannagele is OOO till end of next week unfortunately. @sjoerdmeijer @davemgreen any chance you could check if the commit fixes the regressions that were mentioned earlier here? |
Sure, no problem, I was going to suggest and do it anyway. :) |
Just to cross-reference, #100385 is the PR to adjust the interleave heuristics |
0160e7a
to
04e554c
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
This now includes lowering the minimum epilogue vectorisation factor from 16 to 8 for the Neoverse V2, which mostly recovers the one regression we spotted in cam4_r. A bit of TTI plumbing and new EpilogueVectorizationMinVF hook is introduced to make this now target specific. This patch now depends on #108190: it needs the
|
Can we add Neoverse V3 too? Otherwise this sounds OK to me. |
Sorry Dave, I accidentally dropped that when I merged and reworked my patches, I will put it back in and a test too. |
04e554c
to
a834320
Compare
Restored the Neoverse V3 and added a test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Assuming all the test failures are related to the first patch, the second one here LGTM.
Thanks Dave, that's right, these failures are from the first patch that this relies on. I will make sure things are green when that lands. |
a834320
to
cad286c
Compare
cad286c
to
741fb1c
Compare
This helps loop based workloads and benchmarks quite a lot, SPEC INT is unaffected.