Skip to content

[AArch64] Set MaxInterleaving to 4 for Neoverse V2 and V3 #100385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

sjoerdmeijer
Copy link
Collaborator

This helps loop based workloads and benchmarks quite a lot, SPEC INT is unaffected.

@llvmbot
Copy link
Member

llvmbot commented Jul 24, 2024

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-transforms

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

This helps loop based workloads and benchmarks quite a lot, SPEC INT is unaffected.


Full diff: https://github.com/llvm/llvm-project/pull/100385.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+3-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll (+1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll (+1)
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 32a355fe38f1c..280083ae824cd 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -233,9 +233,11 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     PrefLoopAlignment = Align(32);
     MaxBytesForLoopAlignment = 16;
     break;
+  case NeoverseV2:
+    MaxInterleaveFactor = 4;
+    LLVM_FALLTHROUGH;
   case NeoverseN2:
   case NeoverseN3:
-  case NeoverseV2:
   case NeoverseV3:
     PrefFunctionAlignment = Align(16);
     PrefLoopAlignment = Align(32);
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll
index 0e54bd15e5ea5..6a47e0ea34a50 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-load-store.ll
@@ -5,6 +5,7 @@
 ; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a14 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
 ; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a15 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
 ; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a16 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
+; RUN: opt -passes=loop-vectorize -mtriple=arm64 -mcpu=neoverse-v2 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
 
 ; Tests for selecting interleave counts for loops with loads and stores.
 
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll
index 72d528d8748ba..9211cdfaebd15 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleaving-reduction.ll
@@ -5,6 +5,7 @@
 ; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a14 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
 ; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a15 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
 ; RUN: opt -passes=loop-vectorize -mtriple=arm64-apple-macos -mcpu=apple-a16 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
+; RUN: opt -passes=loop-vectorize -mtriple=arm64 -mcpu=neoverse-v2 -S %s | FileCheck --check-prefix=INTERLEAVE-4 %s
 
 ; Tests for selecting the interleave count for loops with reductions.
 

@davemgreen
Copy link
Collaborator

I agree this makes sense in principle, considering the number of pipelines available. Whenever I have tried it in practice the performance has not looked great, I believe because it create loops bodies that are so large that they are executed less often and you end up with worse performance overall. This can depend heavily on the dynamic trip counts though, and different domains of code will typically have different expected loop trip counts.

Can you give more details what kinds of performance validation you have ran?

@sjoerdmeijer
Copy link
Collaborator Author

sjoerdmeijer commented Jul 24, 2024

Here are some motivating examples from TSVC. The numbers are ratios of "after" and "before" (after / before), i.e. with and without this patch applied. Lower than 1 are improvements. This shows ~2x improvements for the first three cases, and then a few more decent improvements. This is a modified version of TSVC where each kernel is compiled into each own executable to reduce noise; I do not trust the TSCV numbers from the llvm test-suite (but that is a separate story):

kernel after / before
s3111 0.50
s314 0.50
s3113 0.59
s316 0.68
s252 0.70
s452 0.75
s4117 0.77
vif 0.78
s1112 0.79
s2101 0.83
s319 0.83
s312 0.83
s311 0.85
vsumr 0.85
s124 0.87
s243 0.89
s255 0.90
s1251 0.91
s115 0.92
s1281 0.92
s1279 0.92
s471 0.92
s272 0.94
s127 0.95
s453 0.95

The benchmarks that I ran are the llvm test-suite, TSVC, RAJAPerf, and SPEC INT.
I didn't see any regressions. I could double check RAJAPerf as I didn't run that many iterations of it.
I can easily test SPEC FP too.

@@ -233,9 +233,11 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
PrefLoopAlignment = Align(32);
MaxBytesForLoopAlignment = 16;
break;
case NeoverseV2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here to explain why "4" is chosen would be nice.

@sjoerdmeijer
Copy link
Collaborator Author

Here are the RAJAPerf numbers:

Kernel after / before
Basic_INIT_VIEW1D 0.66
Basic_INIT_VIEW1D_OFFSET 0.70
Basic_NESTED_INIT 0.89
Basic_REDUCE3_INT 0.89
Polybench_JACOBI_2D 0.91
Polybench_FLOYD_WARSHALL 0.92
Basic_MAT_MAT_SHARED 0.94
Apps_FIR 0.94
Basic_MULADDSUB 1.25

There is 1 regression, which I did not spot before; not sure if it is noise or something else, but let's assume it is real. The geomean number is 0.984, so an overall 1.5% improvement.

TODO:

  • double check the LLVM test-suite,
  • and I will add SPEC FP to the mix.

@sjoerdmeijer
Copy link
Collaborator Author

I have checked SPEC FP (the C/C++ apps) and that is unaffected.

I double checked the llvm test-suite (MultiSource):

  • I see small improvements in scimark2 and SPASS
  • I don't see any regressions.

What do you think @davemgreen ?

@david-arm
Copy link
Contributor

A lot more vectorisation happens in Fortran SPEC FP benchmarks, particular fotonik3d, so if possible it might be good to check those too.

@sjoerdmeijer
Copy link
Collaborator Author

A lot more vectorisation happens in Fortran SPEC FP benchmarks, particular fotonik3d, so if possible it might be good to check those too.

I have never done this before, but if it is just a matter of building flang I can certainly give this a try.

@sjoerdmeijer
Copy link
Collaborator Author

I had some problems with flang-new and building some FP apps, but I did manage to build and run fotonik3d, which showed a 1.3% improvement. Of course, not sure it is exactly caused by this change or second order effects, but it is not a regression, which is the most important thing.

@DavidSpickett
Copy link
Collaborator

DavidSpickett commented Jul 25, 2024

@ceseo do you know if flang is capable of building Fortran SPEC FP properly yet? Would save Sjoerd the hassle if we (Linaro) know it doesn't work yet.

@sjoerdmeijer
Copy link
Collaborator Author

@ceseo do you know if flang is capable of building Fortran SPEC FP properly yet? Would save Sjoerd the hassle if we (Linaro) know it doesn't work yet.

I think the compilation of 521.wrf_r hangs.
And for 503.bwaves_r, I see a segmentation fault.

Getting flang-new compiled was a pain too, I tried 2 different AArch64 platforms and 4 different compilers, and saw weird behaviour from GCC erroring:

llvm-project/flang/include/flang/Evaluate/integer.h:839:27: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=]
  839 |               product[to] = xy & partMask;
      |               ~~~~~~~~~~~~^~~~~~~~~~~~~~~  

and clang somehow silently crashing or erroring.

Anyway, would be good to fix, but don't think it should hold up this patch.

From all the numbers collected, this seems a generally good thing, do you agree @davemgreen ?

@ceseo
Copy link
Contributor

ceseo commented Jul 25, 2024

@ceseo do you know if flang is capable of building Fortran SPEC FP properly yet? Would save Sjoerd the hassle if we (Linaro) know it doesn't work yet.

Yes, it can. We even presented comparisons between Flang and gfortran for SPEC CPU 2017 rate at Linaro Connect in May.

@davemgreen
Copy link
Collaborator

Running flang-new on spec should work. It might be worth making sure the others in it are OK.

I think this makes sense when you look at it from the point of view of the CPU. There are 4 vector pipelines, so it makes sense to start interleaving 4x, and other cores do the same. The other dimension is the code that you are running and the dynamic trip-count of the important loops in that code. Certain domains tend to have higher trip counts, like HPC, ML and certain DSP routines. Benchmarks often have a super high trip count to make the benchmark meaningful. DSP/image processing can often be 16 x 16 tiles, and for certain other domains the trip counts are often very low, possibly closer to 1 than anything else.

So I'm not sure I trust the more benchmark-y results as much as the others. I think it makes sense to have a higher maximum interleave factor, it just might not always be best to use it. Having it higher for reductions might be more of a win than other cases, for example.

I think this is a good idea so long as the performance is OK. Can you add Neoverse-V3 to the same switch, I think it should work similarly.

@sjoerdmeijer
Copy link
Collaborator Author

Thanks for the thoughts and suggestions Dave.
I am pretty sure about what I saw with flang-new: 521.wrf_r compilation hangs and 503.bwaves_r segfaults. That's why I tried fotonik3d only, but I will try again and check if I can compile/run the other Fortran apps.
Sure, will add it to the V3 too.

@davemgreen
Copy link
Collaborator

Thanks - there is a cam_r in there too I think.

@sjoerdmeijer
Copy link
Collaborator Author

  • I have added the V3,
  • And I ran more FP benchmarks: 554.ROMs and 507.cactusBSSN, for which performance are unaffected.
  • I did try cam_r, but that runs in a segfault. I briefly looked into it, and looks genuine. But not related to this patch of course.

@kiranchandramohan
Copy link
Contributor

Thanks for the thoughts and suggestions Dave.
I am pretty sure about what I saw with flang-new: 521.wrf_r compilation hangs and 503.bwaves_r segfaults. That's why I tried fotonik3d only, but I will try again and check if I can compile/run the other Fortran apps.
Sure, will add it to the V3 too.

All these benchmarks work fine for us. Can you try increasing the stack setting to unlimited, it should help 503.bwaves_r.
Do you have a setting like the following for 521.wrf_r?

521.wrf_r,621.wrf_s:  #lang='F,C'
   CPORTABILITY  = -DSPEC_CASE_FLAG 
   FPORTABILITY  = -fconvert=big-endian

@sjoerdmeijer
Copy link
Collaborator Author

Thanks for the thoughts and suggestions Dave.
I am pretty sure about what I saw with flang-new: 521.wrf_r compilation hangs and 503.bwaves_r segfaults. That's why I tried fotonik3d only, but I will try again and check if I can compile/run the other Fortran apps.
Sure, will add it to the V3 too.

All these benchmarks work fine for us. Can you try increasing the stack setting to unlimited, it should help 503.bwaves_r. Do you have a setting like the following for 521.wrf_r?

521.wrf_r,621.wrf_s:  #lang='F,C'
   CPORTABILITY  = -DSPEC_CASE_FLAG 
   FPORTABILITY  = -fconvert=big-endian

Thanks @kiranchandramohan , Dave also helped me, and I was able to run the Fortran benchmarks.

@sjoerdmeijer
Copy link
Collaborator Author

Thanks - there is a cam_r in there too I think.

Thanks @davemgreen, I have now been able to run that, and I can confirm the 4.5% regression in cam4.
Before, the vector body was processing 8 elements with an interleaving factor of 2, and after interleaving 4x it processes 16 elements. The vector body is now no longer executed because of the low tripcount (< 16), and only the epilogue loop is executed. Not only is the vector body not executed, but the function is called a lot and the epilogue is very hot.

I would like to think that these are two exceptions, and that interleaving with a maximum of 4 is the sensible and generally beneficial thing to do; cam4 is the exception of all codes that I ran.

The way forward is to follow up on this and look into epilogue vectorisation. At the moment, epilogue vectorisation is effectively disabled because it will kick in only if the tripcount is 16 or more:

   static cl::opt<unsigned> EpilogueVectorizationMinVF(
       "epilogue-vectorization-minimum-VF", cl::init(16), cl::Hidden,
        cl::desc("Only loops with vectorization factor equal to or larger than "
                        "the specified value are considered for epilogue vectorization."));

By lowering this to 8, I have verified that we gain almost all performance back for cam4.
I am hoping that lowering this to 8 for the Neoverse V2 is not going to be a revolution, and that's what I would like to investigate next if we are happy with the interleaving.

WDYT?

@davemgreen
Copy link
Collaborator

Hi - I think that sounds good. It might spend a bit of codesize, but should hopefully mitigate the perf issues we've seen before. In the long run we should look into predicated epilog vectorization again - that should be especially good in cases where we can remove the scalar loop entirely.

@fhahn
Copy link
Contributor

fhahn commented Aug 16, 2024

One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that.

@sjoerdmeijer
Copy link
Collaborator Author

One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that.

Ok, thanks. Is this then a fair summary: that work is slightly orthogonal, it will help this work and will make it even better?

@sjoerdmeijer
Copy link
Collaborator Author

With different ways forward, are you happy to land this @davemgreen ?

@davemgreen
Copy link
Collaborator

One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that.

I hadn't realised that we didn't account for the UF already. That sounds like a good thing to fix, thanks for the info. @juliannagele @fhahn do you have a timeframe for when such a patch will be ready? I would like to avoid the regressions if we can, and would not want to end up relying on a patch the never materializes. Otherwise perhaps @sjoerdmeijer could increase the limit for cases where the UF is 4, and we can improve it further in the future where needed.

@sjoerdmeijer
Copy link
Collaborator Author

One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that.

I hadn't realised that we didn't account for the UF already. That sounds like a good thing to fix, thanks for the info. @juliannagele @fhahn do you have a timeframe for when such a patch will be ready? I would like to avoid the regressions if we can, and would not want to end up relying on a patch the never materializes. Otherwise perhaps @sjoerdmeijer could increase the limit for cases where the UF is 4, and we can improve it further in the future where needed.

@davemgreen: in the meantime, I will prepare a patch to create a TTI hook that controls the epilogue-vectorization-minimum-VF. I would like to lower it to 8 for the V2 but I don't think it makes sense to do this for all targets.

@david-arm
Copy link
Contributor

One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that.

I hadn't realised that we didn't account for the UF already. That sounds like a good thing to fix, thanks for the info. @juliannagele @fhahn do you have a timeframe for when such a patch will be ready? I would like to avoid the regressions if we can, and would not want to end up relying on a patch the never materializes. Otherwise perhaps @sjoerdmeijer could increase the limit for cases where the UF is 4, and we can improve it further in the future where needed.

@davemgreen: in the meantime, I will prepare a patch to create a TTI hook that controls the epilogue-vectorization-minimum-VF. I would like to lower it to 8 for the V2 but I don't think it makes sense to do this for all targets.

Do you need to take interleaving into account when lowering the minimum? The reason I ask is that for large loops we may not interleave at all so you'll have an epilogue even when the main vector loop has IC=1,VF=8. Maybe that's fine, but I can imagine some downsides in terms of increased code size and worse performance for some low trip counts.

@sjoerdmeijer
Copy link
Collaborator Author

Do you need to take interleaving into account when lowering the minimum? The reason I ask is that for large loops we may not interleave at all so you'll have an epilogue even when the main vector loop has IC=1,VF=8. Maybe that's fine, but I can imagine some downsides in terms of increased code size and worse performance for some low trip counts.

Hi @david-arm , thanks for the suggestion, that makes perfect sense. Fixing this generally is going to be a 2-step solution anyway I think. I.e., the first one is to make that single magic number 16 target dependent with a TTI hook, and the second step is coming up with a better heuristic that takes this the minimum epilogue VF into account and other things like VF x UF, which could be implemented in a helper function in the vectoriser, or perhaps in the Subtarget specific TTI hook implementation that also sets this minimum epilogue VF. Either way, I think this is a good heuristic that can be added in the TTI hook that I will start drafting now.

@juliannagele
Copy link
Member

One issue with interleaving is that epilogue vecotrizaiton only considers VF, but not VF x UF. There are a number of cases where epilogue vectorization would be beneficial on AArch64 when the VF < 16 but UF > 1. @juliannagele is currently looking into adjusting the cost model for that.

I hadn't realised that we didn't account for the UF already. That sounds like a good thing to fix, thanks for the info. @juliannagele @fhahn do you have a timeframe for when such a patch will be ready?

Hi, I've got an initial version of this ready, I'll clean it up a bit and share a draft tomorrow

@juliannagele
Copy link
Member

juliannagele commented Aug 23, 2024

Here's a first version of a patch for epilogue vectorisation: 6b0f7de4f3347b7b5fc6362e7aa7764465501dfb considering VF x UF when checking profitability of epilogue vectorisation and adjusting the cost model accordingly (the main idea is to use the known upper bound for the trip count of the epilogue). There's also some accompanying microbenchmarks here 1e1576e6cb1f09965c3e7520c5a51958508df23b. Let me know what you think!

@sjoerdmeijer
Copy link
Collaborator Author

Here's a first version of a patch for epilogue vectorisation: 6b0f7de4f3347b7b5fc6362e7aa7764465501dfb considering VF x UF when checking profitability of epilogue vectorisation and adjusting the cost model accordingly (the main idea is to use the known upper bound for the trip count of the epilogue). There's also some accompanying microbenchmarks here 1e1576e6cb1f09965c3e7520c5a51958508df23b. Let me know what you think!

Thanks for working on this @juliannagele !
It's probably best if you open a merge request, then we can continue that discussion there.

@fhahn
Copy link
Contributor

fhahn commented Aug 27, 2024

Here's a first version of a patch for epilogue vectorisation: 6b0f7de4f3347b7b5fc6362e7aa7764465501dfb considering VF x UF when checking profitability of epilogue vectorisation and adjusting the cost model accordingly (the main idea is to use the known upper bound for the trip count of the epilogue). There's also some accompanying microbenchmarks here 1e1576e6cb1f09965c3e7520c5a51958508df23b. Let me know what you think!

Thanks for working on this @juliannagele ! It's probably best if you open a merge request, then we can continue that discussion there.

@juliannagele is OOO till end of next week unfortunately. @sjoerdmeijer @davemgreen any chance you could check if the commit fixes the regressions that were mentioned earlier here?

@sjoerdmeijer
Copy link
Collaborator Author

Here's a first version of a patch for epilogue vectorisation: 6b0f7de4f3347b7b5fc6362e7aa7764465501dfb considering VF x UF when checking profitability of epilogue vectorisation and adjusting the cost model accordingly (the main idea is to use the known upper bound for the trip count of the epilogue). There's also some accompanying microbenchmarks here 1e1576e6cb1f09965c3e7520c5a51958508df23b. Let me know what you think!

Thanks for working on this @juliannagele ! It's probably best if you open a merge request, then we can continue that discussion there.

@juliannagele is OOO till end of next week unfortunately. @sjoerdmeijer @davemgreen any chance you could check if the commit fixes the regressions that were mentioned earlier here?

Sure, no problem, I was going to suggest and do it anyway. :)

@fhahn
Copy link
Contributor

fhahn commented Sep 11, 2024

Just to cross-reference, #100385 is the PR to adjust the interleave heuristics

Copy link

github-actions bot commented Oct 8, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@sjoerdmeijer
Copy link
Collaborator Author

This now includes lowering the minimum epilogue vectorisation factor from 16 to 8 for the Neoverse V2, which mostly recovers the one regression we spotted in cam4_r. A bit of TTI plumbing and new EpilogueVectorizationMinVF hook is introduced to make this now target specific.

This patch now depends on #108190: it needs the multiplier information that is now passed into the profitability calculation:

 bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
     const ElementCount VF) const {
     const ElementCount VF, const unsigned Multiplier)

@davemgreen
Copy link
Collaborator

Can we add Neoverse V3 too? Otherwise this sounds OK to me.

@sjoerdmeijer
Copy link
Collaborator Author

Can we add Neoverse V3 too? Otherwise this sounds OK to me.

Sorry Dave, I accidentally dropped that when I merged and reworked my patches, I will put it back in and a test too.

@sjoerdmeijer
Copy link
Collaborator Author

Restored the Neoverse V3 and added a test.

@sjoerdmeijer sjoerdmeijer changed the title [AArch64] Set MaxInterleaving to 4 for Neoverse V2 [AArch64] Set MaxInterleaving to 4 for Neoverse V2 and V3 Oct 9, 2024
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Assuming all the test failures are related to the first patch, the second one here LGTM.

@sjoerdmeijer
Copy link
Collaborator Author

Thanks. Assuming all the test failures are related to the first patch, the second one here LGTM.

Thanks Dave, that's right, these failures are from the first patch that this relies on. I will make sure things are green when that lands.

@sjoerdmeijer sjoerdmeijer merged commit 9bccf61 into llvm:main Nov 20, 2024
6 of 8 checks passed
@sjoerdmeijer sjoerdmeijer deleted the interleave4 branch November 20, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants