Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP]Reduce number of alternate instruction, where possible #123360

Conversation

alexey-bataev
Copy link
Member

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:

%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>

i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:

%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>

It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0%
test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Created using spr 1.3.5
@llvmbot
Copy link
Member

llvmbot commented Jan 17, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:

%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add &lt;8 x i32&gt;
%v2 = sub &lt;8 x i32&gt;
%res = shuffle %v1, %v2, &lt;0, 9, 2, 11, 4, 13, 6, 15&gt;

i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:

%v1 = add &lt;4 x i32&gt;
%v2 = sub &lt;4 x i32&gt;
%res = shuffle %v1, %v2, &lt;0, 4, 1, 5, 2, 6, 3, 7&gt;

It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0%
test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)


Patch is 243.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123360.diff

21 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+540-49)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/slpordering.ll (+102-76)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/gather-with-minbith-user.ll (+5-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+96-64)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll (+5-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll (+232-616)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+2-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll (+54-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll (+54-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll (+64-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll (+64-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll (+86-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll (+86-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-schedule-for-subvector.ll (+3-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/long-full-reg-stores.ll (+3-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/matched-shuffled-entries.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/non-load-reduced-as-part-of-bv.ll (+5-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll (+2-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/splat-score-adjustment.ll (+9-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/addsub.ll (+4-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/resized-alt-shuffle-after-minbw.ll (+18-17)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index b0b8f8249d657b..59063d6b4c9bc4 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1461,6 +1461,7 @@ class BoUpSLP {
     VectorizableTree.clear();
     ScalarToTreeEntry.clear();
     MultiNodeScalars.clear();
+    ScalarsInSplitNodes.clear();
     MustGather.clear();
     NonScheduledFirst.clear();
     EntryToLastInstruction.clear();
@@ -3196,12 +3197,30 @@ class BoUpSLP {
 
     /// \returns Common mask for reorder indices and reused scalars.
     SmallVector<int> getCommonMask() const {
+      if (State == TreeEntry::SplitVectorize)
+        return {};
       SmallVector<int> Mask;
       inversePermutation(ReorderIndices, Mask);
       ::addMask(Mask, ReuseShuffleIndices);
       return Mask;
     }
 
+    /// \returns The mask for split nodes.
+    SmallVector<int> getSplitMask() const {
+      assert(State == TreeEntry::SplitVectorize && !ReorderIndices.empty() &&
+             "Expected only split vectorize node.");
+      SmallVector<int> Mask(getVectorFactor(), PoisonMaskElem);
+      unsigned CommonVF = std::max<unsigned>(
+          CombinedEntriesWithIndices.back().second,
+          Scalars.size() - CombinedEntriesWithIndices.back().second);
+      for (auto [Idx, I] : enumerate(ReorderIndices))
+        Mask[I] =
+            Idx + (Idx >= CombinedEntriesWithIndices.back().second
+                       ? CommonVF - CombinedEntriesWithIndices.back().second
+                       : 0);
+      return Mask;
+    }
+
     /// \returns true if the scalars in VL are equal to this entry.
     bool isSame(ArrayRef<Value *> VL) const {
       auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) {
@@ -3293,6 +3312,8 @@ class BoUpSLP {
                          ///< complex node like select/cmp to minmax, mul/add to
                          ///< fma, etc. Must be used for the following nodes in
                          ///< the pattern, not the very first one.
+      SplitVectorize,    ///< Splits the node into 2 subnodes, vectorizes them
+                         ///< independently and then combines back.
     };
     EntryState State;
 
@@ -3324,7 +3345,7 @@ class BoUpSLP {
     /// The index of this treeEntry in VectorizableTree.
     unsigned Idx = 0;
 
-    /// For gather/buildvector/alt opcode (TODO) nodes, which are combined from
+    /// For gather/buildvector/alt opcode nodes, which are combined from
     /// other nodes as a series of insertvector instructions.
     SmallVector<std::pair<unsigned, unsigned>, 2> CombinedEntriesWithIndices;
 
@@ -3471,8 +3492,9 @@ class BoUpSLP {
                           SmallVectorImpl<Value *> *AltScalars = nullptr) const;
 
     /// Return true if this is a non-power-of-2 node.
-    bool isNonPowOf2Vec() const {
-      bool IsNonPowerOf2 = !has_single_bit(Scalars.size());
+    bool isNonPowOf2Vec(const TargetTransformInfo &TTI) const {
+      bool IsNonPowerOf2 = !hasFullVectorsOrPowerOf2(
+          TTI, getValueType(Scalars.front()), Scalars.size());
       return IsNonPowerOf2;
     }
 
@@ -3530,6 +3552,9 @@ class BoUpSLP {
       case CombinedVectorize:
         dbgs() << "CombinedVectorize\n";
         break;
+      case SplitVectorize:
+        dbgs() << "SplitVectorize\n";
+        break;
       }
       dbgs() << "MainOp: ";
       if (MainOp)
@@ -3611,8 +3636,10 @@ class BoUpSLP {
                           const EdgeInfo &UserTreeIdx,
                           ArrayRef<int> ReuseShuffleIndices = {},
                           ArrayRef<unsigned> ReorderIndices = {}) {
-    assert(((!Bundle && EntryState == TreeEntry::NeedToGather) ||
-            (Bundle && EntryState != TreeEntry::NeedToGather)) &&
+    assert(((!Bundle && (EntryState == TreeEntry::NeedToGather ||
+                         EntryState == TreeEntry::SplitVectorize)) ||
+            (Bundle && EntryState != TreeEntry::NeedToGather &&
+             EntryState != TreeEntry::SplitVectorize)) &&
            "Need to vectorize gather entry?");
     // Gathered loads still gathered? Do not create entry, use the original one.
     if (GatheredLoadsEntriesFirst.has_value() &&
@@ -3646,12 +3673,29 @@ class BoUpSLP {
                   return VL[Idx];
                 });
       InstructionsState S = getSameOpcode(Last->Scalars, *TLI);
-      if (S)
+      if (S) {
         Last->setOperations(S);
+      } else if (EntryState == TreeEntry::SplitVectorize) {
+        auto *MainOp =
+            cast<Instruction>(*find_if(Last->Scalars, IsaPred<Instruction>));
+        auto *AltOp = cast<Instruction>(*find_if(Last->Scalars, [=](Value *V) {
+          auto *I = dyn_cast<Instruction>(V);
+          return I && I->getOpcode() != MainOp->getOpcode();
+        }));
+        Last->setOperations(InstructionsState(MainOp, AltOp));
+        for (Value *V : VL) {
+          auto *I = dyn_cast<Instruction>(V);
+          if (!I)
+            continue;
+          ScalarsInSplitNodes.try_emplace(I, Last);
+        }
+      }
       Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
     }
-    if (!Last->isGather()) {
+    if (!Last->isGather() && Last->State != TreeEntry::SplitVectorize) {
       for (Value *V : VL) {
+        if (isa<PoisonValue>(V))
+          continue;
         const TreeEntry *TE = getTreeEntry(V);
         assert((!TE || TE == Last || doesNotNeedToBeScheduled(V)) &&
                "Scalar already in tree!");
@@ -3679,7 +3723,7 @@ class BoUpSLP {
         }
       }
       assert(!BundleMember && "Bundle and VL out of sync");
-    } else {
+    } else if (Last->isGather()) {
       // Build a map for gathered scalars to the nodes where they are used.
       bool AllConstsOrCasts = true;
       for (Value *V : VL)
@@ -3745,6 +3789,9 @@ class BoUpSLP {
   /// nodes.
   SmallDenseMap<Value *, SmallVector<TreeEntry *>> MultiNodeScalars;
 
+  /// Scalars, used in split vectorize nodes.
+  SmallDenseMap<Value *, TreeEntry *> ScalarsInSplitNodes;
+
   /// Maps a value to the proposed vectorizable size.
   SmallDenseMap<Value *, unsigned> InstrElementSize;
 
@@ -5648,12 +5695,14 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom) {
              }) &&
       (TE.ReorderIndices.empty() || isReverseOrder(TE.ReorderIndices)))
     return std::nullopt;
-  if ((TE.State == TreeEntry::Vectorize ||
-       TE.State == TreeEntry::StridedVectorize) &&
-      (isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) ||
-       (TopToBottom && isa<StoreInst, InsertElementInst>(TE.getMainOp())))) {
-    assert(!TE.isAltShuffle() && "Alternate instructions are only supported by "
-                                 "BinaryOperator and CastInst.");
+  if (TE.State == TreeEntry::SplitVectorize ||
+      ((TE.State == TreeEntry::Vectorize ||
+        TE.State == TreeEntry::StridedVectorize) &&
+       (isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) ||
+        (TopToBottom && isa<StoreInst, InsertElementInst>(TE.getMainOp()))))) {
+    assert((TE.State == TreeEntry::SplitVectorize || !TE.isAltShuffle()) &&
+           "Alternate instructions are only supported by "
+           "BinaryOperator and CastInst.");
     return TE.ReorderIndices;
   }
   if (TE.State == TreeEntry::Vectorize && TE.getOpcode() == Instruction::PHI) {
@@ -5938,7 +5987,7 @@ void BoUpSLP::reorderTopToBottom() {
     // Patterns like [fadd,fsub] can be combined into a single instruction in
     // x86. Reordering them into [fsub,fadd] blocks this pattern. So we need
     // to take into account their order when looking for the most used order.
-    if (TE->isAltShuffle()) {
+    if (TE->isAltShuffle() && TE->State != TreeEntry::SplitVectorize) {
       VectorType *VecTy =
           getWidenedType(TE->Scalars[0]->getType(), TE->Scalars.size());
       unsigned Opcode0 = TE->getOpcode();
@@ -5976,7 +6025,8 @@ void BoUpSLP::reorderTopToBottom() {
       }
       VFToOrderedEntries[TE->getVectorFactor()].insert(TE.get());
       if (!(TE->State == TreeEntry::Vectorize ||
-            TE->State == TreeEntry::StridedVectorize) ||
+            TE->State == TreeEntry::StridedVectorize ||
+            TE->State == TreeEntry::SplitVectorize) ||
           !TE->ReuseShuffleIndices.empty())
         GathersToOrders.try_emplace(TE.get(), *CurrentOrder);
       if (TE->State == TreeEntry::Vectorize &&
@@ -5985,6 +6035,30 @@ void BoUpSLP::reorderTopToBottom() {
     }
   });
 
+  auto UpdateSplitUserNode = [&](TreeEntry *UserTE, unsigned Idx,
+                                 ArrayRef<int> Mask, ArrayRef<int> MaskOrder) {
+    assert(UserTE->State == TreeEntry::SplitVectorize &&
+           "Expected split user node.");
+    SmallVector<int> NewMask(UserTE->getVectorFactor());
+    SmallVector<int> NewMaskOrder(UserTE->getVectorFactor());
+    std::iota(NewMask.begin(), NewMask.end(), 0);
+    std::iota(NewMaskOrder.begin(), NewMaskOrder.end(), 0);
+    if (Idx == 0) {
+      copy(Mask, NewMask.begin());
+      copy(MaskOrder, NewMaskOrder.begin());
+    } else {
+      assert(Idx == 1 && "Expected either 0 or 1 index.");
+      unsigned Offset = UserTE->CombinedEntriesWithIndices.back().second;
+      for (unsigned I : seq<unsigned>(Mask.size())) {
+        NewMask[I + Offset] = Mask[I] + Offset;
+        NewMaskOrder[I + Offset] = MaskOrder[I] + Offset;
+      }
+    }
+    reorderScalars(UserTE->Scalars, NewMask);
+    reorderOrder(UserTE->ReorderIndices, NewMaskOrder, /*BottomOrder=*/true);
+    if (isIdentityOrder(UserTE->ReorderIndices))
+      UserTE->ReorderIndices.clear();
+  };
   // Reorder the graph nodes according to their vectorization factor.
   for (unsigned VF = VectorizableTree.front()->getVectorFactor();
        !VFToOrderedEntries.empty() && VF > 1; VF -= 2 - (VF & 1U)) {
@@ -6007,7 +6081,8 @@ void BoUpSLP::reorderTopToBottom() {
     for (const TreeEntry *OpTE : OrderedEntries) {
       // No need to reorder this nodes, still need to extend and to use shuffle,
       // just need to merge reordering shuffle and the reuse shuffle.
-      if (!OpTE->ReuseShuffleIndices.empty() && !GathersToOrders.count(OpTE))
+      if (!OpTE->ReuseShuffleIndices.empty() && !GathersToOrders.count(OpTE) &&
+          OpTE->State != TreeEntry::SplitVectorize)
         continue;
       // Count number of orders uses.
       const auto &Order = [OpTE, &GathersToOrders, &AltShufflesToOrders,
@@ -6114,6 +6189,8 @@ void BoUpSLP::reorderTopToBottom() {
       // Just do the reordering for the nodes with the given VF.
       if (TE->Scalars.size() != VF) {
         if (TE->ReuseShuffleIndices.size() == VF) {
+          assert(TE->State != TreeEntry::SplitVectorize &&
+                 "Split vectorized not expected.");
           // Need to reorder the reuses masks of the operands with smaller VF to
           // be able to find the match between the graph nodes and scalar
           // operands of the given node during vectorization/cost estimation.
@@ -6121,7 +6198,8 @@ void BoUpSLP::reorderTopToBottom() {
                         [VF, &TE](const EdgeInfo &EI) {
                           return EI.UserTE->Scalars.size() == VF ||
                                  EI.UserTE->Scalars.size() ==
-                                     TE->Scalars.size();
+                                     TE->Scalars.size() ||
+                                 EI.UserTE->State == TreeEntry::SplitVectorize;
                         }) &&
                  "All users must be of VF size.");
           if (SLPReVec) {
@@ -6144,19 +6222,29 @@ void BoUpSLP::reorderTopToBottom() {
           // Update ordering of the operands with the smaller VF than the given
           // one.
           reorderNodeWithReuses(*TE, Mask);
+          // Update orders in user split vectorize nodes.
+          for (EdgeInfo &EI : TE->UserTreeIndices) {
+            if (EI.UserTE->State != TreeEntry::SplitVectorize)
+              continue;
+            UpdateSplitUserNode(EI.UserTE, EI.EdgeIdx, Mask, MaskOrder);
+          }
         }
         continue;
       }
-      if ((TE->State == TreeEntry::Vectorize ||
-           TE->State == TreeEntry::StridedVectorize) &&
-          (isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
-               InsertElementInst>(TE->getMainOp()) ||
-           (SLPReVec && isa<ShuffleVectorInst>(TE->getMainOp())))) {
-        assert(!TE->isAltShuffle() &&
-               "Alternate instructions are only supported by BinaryOperator "
-               "and CastInst.");
-        // Build correct orders for extract{element,value}, loads and
-        // stores.
+      if ((TE->State == TreeEntry::SplitVectorize &&
+           TE->ReuseShuffleIndices.empty()) ||
+          ((TE->State == TreeEntry::Vectorize ||
+            TE->State == TreeEntry::StridedVectorize) &&
+           (isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
+                InsertElementInst>(TE->getMainOp()) ||
+            (SLPReVec && isa<ShuffleVectorInst>(TE->getMainOp()))))) {
+        assert(
+            (!TE->isAltShuffle() || (TE->State == TreeEntry::SplitVectorize &&
+                                     TE->ReuseShuffleIndices.empty())) &&
+            "Alternate instructions are only supported by BinaryOperator "
+            "and CastInst.");
+        // Build correct orders for extract{element,value}, loads,
+        // stores and alternate (split) nodes.
         reorderOrder(TE->ReorderIndices, Mask);
         if (isa<InsertElementInst, StoreInst>(TE->getMainOp()))
           TE->reorderOperands(Mask);
@@ -6177,6 +6265,12 @@ void BoUpSLP::reorderTopToBottom() {
         addMask(NewReuses, TE->ReuseShuffleIndices);
         TE->ReuseShuffleIndices.swap(NewReuses);
       }
+      // Update orders in user split vectorize nodes.
+      for (EdgeInfo &EI : TE->UserTreeIndices) {
+        if (EI.UserTE->State != TreeEntry::SplitVectorize)
+          continue;
+        UpdateSplitUserNode(EI.UserTE, EI.EdgeIdx, Mask, MaskOrder);
+      }
     }
   }
 }
@@ -6189,7 +6283,8 @@ bool BoUpSLP::canReorderOperands(
     if (any_of(Edges, [I](const std::pair<unsigned, TreeEntry *> &OpData) {
           return OpData.first == I &&
                  (OpData.second->State == TreeEntry::Vectorize ||
-                  OpData.second->State == TreeEntry::StridedVectorize);
+                  OpData.second->State == TreeEntry::StridedVectorize ||
+                  OpData.second->State == TreeEntry::SplitVectorize);
         }))
       continue;
     if (TreeEntry *TE = getVectorizedOperand(UserTE, I)) {
@@ -6207,6 +6302,7 @@ bool BoUpSLP::canReorderOperands(
       // node, just reorder reuses mask.
       if (TE->State != TreeEntry::Vectorize &&
           TE->State != TreeEntry::StridedVectorize &&
+          TE->State != TreeEntry::SplitVectorize &&
           TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty())
         GatherOps.push_back(TE);
       continue;
@@ -6216,6 +6312,7 @@ bool BoUpSLP::canReorderOperands(
                  [&Gather, UserTE, I](TreeEntry *TE) {
                    assert(TE->State != TreeEntry::Vectorize &&
                           TE->State != TreeEntry::StridedVectorize &&
+                          TE->State != TreeEntry::SplitVectorize &&
                           "Only non-vectorized nodes are expected.");
                    if (any_of(TE->UserTreeIndices,
                               [UserTE, I](const EdgeInfo &EI) {
@@ -6245,13 +6342,15 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
   SmallVector<TreeEntry *> NonVectorized;
   for (const std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
     if (TE->State != TreeEntry::Vectorize &&
-        TE->State != TreeEntry::StridedVectorize)
+        TE->State != TreeEntry::StridedVectorize &&
+        TE->State != TreeEntry::SplitVectorize)
       NonVectorized.push_back(TE.get());
     if (std::optional<OrdersType> CurrentOrder =
             getReorderingData(*TE, /*TopToBottom=*/false)) {
       OrderedEntries.insert(TE.get());
       if (!(TE->State == TreeEntry::Vectorize ||
-            TE->State == TreeEntry::StridedVectorize) ||
+            TE->State == TreeEntry::StridedVectorize ||
+            TE->State == TreeEntry::SplitVectorize) ||
           !TE->ReuseShuffleIndices.empty())
         GathersToOrders.insert(TE.get());
     }
@@ -6270,6 +6369,7 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
     for (TreeEntry *TE : OrderedEntries) {
       if (!(TE->State == TreeEntry::Vectorize ||
             TE->State == TreeEntry::StridedVectorize ||
+            TE->State == TreeEntry::SplitVectorize ||
             (TE->isGather() && GathersToOrders.contains(TE))) ||
           TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
           !all_of(drop_begin(TE->UserTreeIndices),
@@ -6295,6 +6395,51 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
       return Data1.first->Idx > Data2.first->Idx;
     });
     for (auto &Data : UsersVec) {
+      if (Data.first->State == TreeEntry::SplitVectorize) {
+        assert(
+            Data.second.size() <= 2 &&
+            "Expected not greater than 2 operands for split vectorize node.");
+        if (any_of(Data.second, [](const auto &Op) {
+              return Op.second->UserTreeIndices.size() != 1;
+            }))
+          continue;
+        // Update orders in user split vectorize nodes.
+        for (const auto &P : Data.first->CombinedEntriesWithIndices) {
+          TreeEntry &OpTE = *VectorizableTree[P.first].get();
+          if (OpTE.isGather() || OpTE.ReorderIndices.empty())
+            continue;
+          SmallVector<int> Mask;
+          inversePermutation(OpTE.ReorderIndices, Mask);
+          SmallVector<int> MaskOrder(OpTE.ReorderIndices.size(),
+                                     PoisonMaskElem);
+          unsigned E = OpTE.ReorderIndices.size();
+          transform(OpTE.ReorderIndices, MaskOrder.begin(), [E](unsigned I) {
+            return I < E ? static_cast<int>(I) : PoisonMaskElem;
+          });
+          SmallVector<int> NewMask(Data.first->getVectorFactor());
+          SmallVector<int> NewMaskOrder(Data.first->getVectorFactor());
+          std::iota(NewMask.begin(), NewMask.end(), 0);
+          std::iota(NewMaskOrder.begin(), NewMaskOrder.end(), 0);
+          if (P.second == 0) {
+            copy(Mask, NewMask.begin());
+            copy(MaskOrder, NewMaskOrder.begin());
+          } else {
+            unsigned Offset = P.second;
+            for (unsigned I : seq<unsigned>(Mask.size())) {
+              NewMask[I + Offset] = Mask[I] + Offset;
+              NewMaskOrder[I + Offset] = MaskOrder[I] + Offset;
+            }
+          }
+          reorderScalars(Data.first->Scalars, NewMask);
+          reorderOrder(Data.first->ReorderIndices, NewMaskOrder,
+                       /*BottomOrder=*/true);
+          if (isIdentityOrder(Data.first->ReorderIndices))
+            Data.first->ReorderIndices.clear();
+          // Clear ordering of the operand.
+          OpTE.ReorderIndices.clear();
+        }
+        continue;
+      }
       // Check that operands are used only in the User node.
       SmallVector<TreeEntry *> GatherOps;
       if (!canReorderOperands(Data.first, Data.second, NonVectorized,
@@ -6451,6 +6596,7 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
         // Gathers are processed separately.
         if (TE->State != TreeEntry::Vectorize &&
             TE->State != TreeEntry::StridedVectorize &&
+            TE->State != TreeEntry::SplitVectorize &&
             (TE->State != TreeEntry::ScatterVectorize ||
              TE->ReorderIndices.empty()))
           continue;
@@ -6521,7 +6667,7 @@ void BoUpSLP::buildExternalUses(
     TreeEntry *Entry = TEPtr.get();
 
     // No need to handle users of gathered values.
-    if (Entry->isGather())
+    if (Entry->isGather() || Entry->State == TreeEntry::SplitVectorize)
       continue;
 
     // For each lane:
@@ -8227,6 +8373,142 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
     return;
   }
 
+  // Tries to build split node.
+  auto TrySplitNode = [&, &TTI = *TTI](unsigned SmallNodeSize,
+                                       const InstructionsState &LocalState) {
+    if (VL.size() <= SmallNodeSize)
+      return false;
+
+    // Any value is used in ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jan 17, 2025

@llvm/pr-subscribers-vectorizers

Author: Alexey Bataev (alexey-bataev)

Changes

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:

%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add &lt;8 x i32&gt;
%v2 = sub &lt;8 x i32&gt;
%res = shuffle %v1, %v2, &lt;0, 9, 2, 11, 4, 13, 6, 15&gt;

i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:

%v1 = add &lt;4 x i32&gt;
%v2 = sub &lt;4 x i32&gt;
%res = shuffle %v1, %v2, &lt;0, 4, 1, 5, 2, 6, 3, 7&gt;

It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0%
test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)


Patch is 243.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123360.diff

21 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+540-49)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/slpordering.ll (+102-76)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/gather-with-minbith-user.ll (+5-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+96-64)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll (+5-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll (+232-616)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+2-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll (+54-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll (+54-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll (+64-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll (+64-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll (+86-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll (+86-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-schedule-for-subvector.ll (+3-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/long-full-reg-stores.ll (+3-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/matched-shuffled-entries.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/non-load-reduced-as-part-of-bv.ll (+5-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll (+2-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/splat-score-adjustment.ll (+9-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/addsub.ll (+4-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/resized-alt-shuffle-after-minbw.ll (+18-17)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index b0b8f8249d657b..59063d6b4c9bc4 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1461,6 +1461,7 @@ class BoUpSLP {
     VectorizableTree.clear();
     ScalarToTreeEntry.clear();
     MultiNodeScalars.clear();
+    ScalarsInSplitNodes.clear();
     MustGather.clear();
     NonScheduledFirst.clear();
     EntryToLastInstruction.clear();
@@ -3196,12 +3197,30 @@ class BoUpSLP {
 
     /// \returns Common mask for reorder indices and reused scalars.
     SmallVector<int> getCommonMask() const {
+      if (State == TreeEntry::SplitVectorize)
+        return {};
       SmallVector<int> Mask;
       inversePermutation(ReorderIndices, Mask);
       ::addMask(Mask, ReuseShuffleIndices);
       return Mask;
     }
 
+    /// \returns The mask for split nodes.
+    SmallVector<int> getSplitMask() const {
+      assert(State == TreeEntry::SplitVectorize && !ReorderIndices.empty() &&
+             "Expected only split vectorize node.");
+      SmallVector<int> Mask(getVectorFactor(), PoisonMaskElem);
+      unsigned CommonVF = std::max<unsigned>(
+          CombinedEntriesWithIndices.back().second,
+          Scalars.size() - CombinedEntriesWithIndices.back().second);
+      for (auto [Idx, I] : enumerate(ReorderIndices))
+        Mask[I] =
+            Idx + (Idx >= CombinedEntriesWithIndices.back().second
+                       ? CommonVF - CombinedEntriesWithIndices.back().second
+                       : 0);
+      return Mask;
+    }
+
     /// \returns true if the scalars in VL are equal to this entry.
     bool isSame(ArrayRef<Value *> VL) const {
       auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) {
@@ -3293,6 +3312,8 @@ class BoUpSLP {
                          ///< complex node like select/cmp to minmax, mul/add to
                          ///< fma, etc. Must be used for the following nodes in
                          ///< the pattern, not the very first one.
+      SplitVectorize,    ///< Splits the node into 2 subnodes, vectorizes them
+                         ///< independently and then combines back.
     };
     EntryState State;
 
@@ -3324,7 +3345,7 @@ class BoUpSLP {
     /// The index of this treeEntry in VectorizableTree.
     unsigned Idx = 0;
 
-    /// For gather/buildvector/alt opcode (TODO) nodes, which are combined from
+    /// For gather/buildvector/alt opcode nodes, which are combined from
     /// other nodes as a series of insertvector instructions.
     SmallVector<std::pair<unsigned, unsigned>, 2> CombinedEntriesWithIndices;
 
@@ -3471,8 +3492,9 @@ class BoUpSLP {
                           SmallVectorImpl<Value *> *AltScalars = nullptr) const;
 
     /// Return true if this is a non-power-of-2 node.
-    bool isNonPowOf2Vec() const {
-      bool IsNonPowerOf2 = !has_single_bit(Scalars.size());
+    bool isNonPowOf2Vec(const TargetTransformInfo &TTI) const {
+      bool IsNonPowerOf2 = !hasFullVectorsOrPowerOf2(
+          TTI, getValueType(Scalars.front()), Scalars.size());
       return IsNonPowerOf2;
     }
 
@@ -3530,6 +3552,9 @@ class BoUpSLP {
       case CombinedVectorize:
         dbgs() << "CombinedVectorize\n";
         break;
+      case SplitVectorize:
+        dbgs() << "SplitVectorize\n";
+        break;
       }
       dbgs() << "MainOp: ";
       if (MainOp)
@@ -3611,8 +3636,10 @@ class BoUpSLP {
                           const EdgeInfo &UserTreeIdx,
                           ArrayRef<int> ReuseShuffleIndices = {},
                           ArrayRef<unsigned> ReorderIndices = {}) {
-    assert(((!Bundle && EntryState == TreeEntry::NeedToGather) ||
-            (Bundle && EntryState != TreeEntry::NeedToGather)) &&
+    assert(((!Bundle && (EntryState == TreeEntry::NeedToGather ||
+                         EntryState == TreeEntry::SplitVectorize)) ||
+            (Bundle && EntryState != TreeEntry::NeedToGather &&
+             EntryState != TreeEntry::SplitVectorize)) &&
            "Need to vectorize gather entry?");
     // Gathered loads still gathered? Do not create entry, use the original one.
     if (GatheredLoadsEntriesFirst.has_value() &&
@@ -3646,12 +3673,29 @@ class BoUpSLP {
                   return VL[Idx];
                 });
       InstructionsState S = getSameOpcode(Last->Scalars, *TLI);
-      if (S)
+      if (S) {
         Last->setOperations(S);
+      } else if (EntryState == TreeEntry::SplitVectorize) {
+        auto *MainOp =
+            cast<Instruction>(*find_if(Last->Scalars, IsaPred<Instruction>));
+        auto *AltOp = cast<Instruction>(*find_if(Last->Scalars, [=](Value *V) {
+          auto *I = dyn_cast<Instruction>(V);
+          return I && I->getOpcode() != MainOp->getOpcode();
+        }));
+        Last->setOperations(InstructionsState(MainOp, AltOp));
+        for (Value *V : VL) {
+          auto *I = dyn_cast<Instruction>(V);
+          if (!I)
+            continue;
+          ScalarsInSplitNodes.try_emplace(I, Last);
+        }
+      }
       Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
     }
-    if (!Last->isGather()) {
+    if (!Last->isGather() && Last->State != TreeEntry::SplitVectorize) {
       for (Value *V : VL) {
+        if (isa<PoisonValue>(V))
+          continue;
         const TreeEntry *TE = getTreeEntry(V);
         assert((!TE || TE == Last || doesNotNeedToBeScheduled(V)) &&
                "Scalar already in tree!");
@@ -3679,7 +3723,7 @@ class BoUpSLP {
         }
       }
       assert(!BundleMember && "Bundle and VL out of sync");
-    } else {
+    } else if (Last->isGather()) {
       // Build a map for gathered scalars to the nodes where they are used.
       bool AllConstsOrCasts = true;
       for (Value *V : VL)
@@ -3745,6 +3789,9 @@ class BoUpSLP {
   /// nodes.
   SmallDenseMap<Value *, SmallVector<TreeEntry *>> MultiNodeScalars;
 
+  /// Scalars, used in split vectorize nodes.
+  SmallDenseMap<Value *, TreeEntry *> ScalarsInSplitNodes;
+
   /// Maps a value to the proposed vectorizable size.
   SmallDenseMap<Value *, unsigned> InstrElementSize;
 
@@ -5648,12 +5695,14 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom) {
              }) &&
       (TE.ReorderIndices.empty() || isReverseOrder(TE.ReorderIndices)))
     return std::nullopt;
-  if ((TE.State == TreeEntry::Vectorize ||
-       TE.State == TreeEntry::StridedVectorize) &&
-      (isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) ||
-       (TopToBottom && isa<StoreInst, InsertElementInst>(TE.getMainOp())))) {
-    assert(!TE.isAltShuffle() && "Alternate instructions are only supported by "
-                                 "BinaryOperator and CastInst.");
+  if (TE.State == TreeEntry::SplitVectorize ||
+      ((TE.State == TreeEntry::Vectorize ||
+        TE.State == TreeEntry::StridedVectorize) &&
+       (isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) ||
+        (TopToBottom && isa<StoreInst, InsertElementInst>(TE.getMainOp()))))) {
+    assert((TE.State == TreeEntry::SplitVectorize || !TE.isAltShuffle()) &&
+           "Alternate instructions are only supported by "
+           "BinaryOperator and CastInst.");
     return TE.ReorderIndices;
   }
   if (TE.State == TreeEntry::Vectorize && TE.getOpcode() == Instruction::PHI) {
@@ -5938,7 +5987,7 @@ void BoUpSLP::reorderTopToBottom() {
     // Patterns like [fadd,fsub] can be combined into a single instruction in
     // x86. Reordering them into [fsub,fadd] blocks this pattern. So we need
     // to take into account their order when looking for the most used order.
-    if (TE->isAltShuffle()) {
+    if (TE->isAltShuffle() && TE->State != TreeEntry::SplitVectorize) {
       VectorType *VecTy =
           getWidenedType(TE->Scalars[0]->getType(), TE->Scalars.size());
       unsigned Opcode0 = TE->getOpcode();
@@ -5976,7 +6025,8 @@ void BoUpSLP::reorderTopToBottom() {
       }
       VFToOrderedEntries[TE->getVectorFactor()].insert(TE.get());
       if (!(TE->State == TreeEntry::Vectorize ||
-            TE->State == TreeEntry::StridedVectorize) ||
+            TE->State == TreeEntry::StridedVectorize ||
+            TE->State == TreeEntry::SplitVectorize) ||
           !TE->ReuseShuffleIndices.empty())
         GathersToOrders.try_emplace(TE.get(), *CurrentOrder);
       if (TE->State == TreeEntry::Vectorize &&
@@ -5985,6 +6035,30 @@ void BoUpSLP::reorderTopToBottom() {
     }
   });
 
+  auto UpdateSplitUserNode = [&](TreeEntry *UserTE, unsigned Idx,
+                                 ArrayRef<int> Mask, ArrayRef<int> MaskOrder) {
+    assert(UserTE->State == TreeEntry::SplitVectorize &&
+           "Expected split user node.");
+    SmallVector<int> NewMask(UserTE->getVectorFactor());
+    SmallVector<int> NewMaskOrder(UserTE->getVectorFactor());
+    std::iota(NewMask.begin(), NewMask.end(), 0);
+    std::iota(NewMaskOrder.begin(), NewMaskOrder.end(), 0);
+    if (Idx == 0) {
+      copy(Mask, NewMask.begin());
+      copy(MaskOrder, NewMaskOrder.begin());
+    } else {
+      assert(Idx == 1 && "Expected either 0 or 1 index.");
+      unsigned Offset = UserTE->CombinedEntriesWithIndices.back().second;
+      for (unsigned I : seq<unsigned>(Mask.size())) {
+        NewMask[I + Offset] = Mask[I] + Offset;
+        NewMaskOrder[I + Offset] = MaskOrder[I] + Offset;
+      }
+    }
+    reorderScalars(UserTE->Scalars, NewMask);
+    reorderOrder(UserTE->ReorderIndices, NewMaskOrder, /*BottomOrder=*/true);
+    if (isIdentityOrder(UserTE->ReorderIndices))
+      UserTE->ReorderIndices.clear();
+  };
   // Reorder the graph nodes according to their vectorization factor.
   for (unsigned VF = VectorizableTree.front()->getVectorFactor();
        !VFToOrderedEntries.empty() && VF > 1; VF -= 2 - (VF & 1U)) {
@@ -6007,7 +6081,8 @@ void BoUpSLP::reorderTopToBottom() {
     for (const TreeEntry *OpTE : OrderedEntries) {
       // No need to reorder this nodes, still need to extend and to use shuffle,
       // just need to merge reordering shuffle and the reuse shuffle.
-      if (!OpTE->ReuseShuffleIndices.empty() && !GathersToOrders.count(OpTE))
+      if (!OpTE->ReuseShuffleIndices.empty() && !GathersToOrders.count(OpTE) &&
+          OpTE->State != TreeEntry::SplitVectorize)
         continue;
       // Count number of orders uses.
       const auto &Order = [OpTE, &GathersToOrders, &AltShufflesToOrders,
@@ -6114,6 +6189,8 @@ void BoUpSLP::reorderTopToBottom() {
       // Just do the reordering for the nodes with the given VF.
       if (TE->Scalars.size() != VF) {
         if (TE->ReuseShuffleIndices.size() == VF) {
+          assert(TE->State != TreeEntry::SplitVectorize &&
+                 "Split vectorized not expected.");
           // Need to reorder the reuses masks of the operands with smaller VF to
           // be able to find the match between the graph nodes and scalar
           // operands of the given node during vectorization/cost estimation.
@@ -6121,7 +6198,8 @@ void BoUpSLP::reorderTopToBottom() {
                         [VF, &TE](const EdgeInfo &EI) {
                           return EI.UserTE->Scalars.size() == VF ||
                                  EI.UserTE->Scalars.size() ==
-                                     TE->Scalars.size();
+                                     TE->Scalars.size() ||
+                                 EI.UserTE->State == TreeEntry::SplitVectorize;
                         }) &&
                  "All users must be of VF size.");
           if (SLPReVec) {
@@ -6144,19 +6222,29 @@ void BoUpSLP::reorderTopToBottom() {
           // Update ordering of the operands with the smaller VF than the given
           // one.
           reorderNodeWithReuses(*TE, Mask);
+          // Update orders in user split vectorize nodes.
+          for (EdgeInfo &EI : TE->UserTreeIndices) {
+            if (EI.UserTE->State != TreeEntry::SplitVectorize)
+              continue;
+            UpdateSplitUserNode(EI.UserTE, EI.EdgeIdx, Mask, MaskOrder);
+          }
         }
         continue;
       }
-      if ((TE->State == TreeEntry::Vectorize ||
-           TE->State == TreeEntry::StridedVectorize) &&
-          (isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
-               InsertElementInst>(TE->getMainOp()) ||
-           (SLPReVec && isa<ShuffleVectorInst>(TE->getMainOp())))) {
-        assert(!TE->isAltShuffle() &&
-               "Alternate instructions are only supported by BinaryOperator "
-               "and CastInst.");
-        // Build correct orders for extract{element,value}, loads and
-        // stores.
+      if ((TE->State == TreeEntry::SplitVectorize &&
+           TE->ReuseShuffleIndices.empty()) ||
+          ((TE->State == TreeEntry::Vectorize ||
+            TE->State == TreeEntry::StridedVectorize) &&
+           (isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
+                InsertElementInst>(TE->getMainOp()) ||
+            (SLPReVec && isa<ShuffleVectorInst>(TE->getMainOp()))))) {
+        assert(
+            (!TE->isAltShuffle() || (TE->State == TreeEntry::SplitVectorize &&
+                                     TE->ReuseShuffleIndices.empty())) &&
+            "Alternate instructions are only supported by BinaryOperator "
+            "and CastInst.");
+        // Build correct orders for extract{element,value}, loads,
+        // stores and alternate (split) nodes.
         reorderOrder(TE->ReorderIndices, Mask);
         if (isa<InsertElementInst, StoreInst>(TE->getMainOp()))
           TE->reorderOperands(Mask);
@@ -6177,6 +6265,12 @@ void BoUpSLP::reorderTopToBottom() {
         addMask(NewReuses, TE->ReuseShuffleIndices);
         TE->ReuseShuffleIndices.swap(NewReuses);
       }
+      // Update orders in user split vectorize nodes.
+      for (EdgeInfo &EI : TE->UserTreeIndices) {
+        if (EI.UserTE->State != TreeEntry::SplitVectorize)
+          continue;
+        UpdateSplitUserNode(EI.UserTE, EI.EdgeIdx, Mask, MaskOrder);
+      }
     }
   }
 }
@@ -6189,7 +6283,8 @@ bool BoUpSLP::canReorderOperands(
     if (any_of(Edges, [I](const std::pair<unsigned, TreeEntry *> &OpData) {
           return OpData.first == I &&
                  (OpData.second->State == TreeEntry::Vectorize ||
-                  OpData.second->State == TreeEntry::StridedVectorize);
+                  OpData.second->State == TreeEntry::StridedVectorize ||
+                  OpData.second->State == TreeEntry::SplitVectorize);
         }))
       continue;
     if (TreeEntry *TE = getVectorizedOperand(UserTE, I)) {
@@ -6207,6 +6302,7 @@ bool BoUpSLP::canReorderOperands(
       // node, just reorder reuses mask.
       if (TE->State != TreeEntry::Vectorize &&
           TE->State != TreeEntry::StridedVectorize &&
+          TE->State != TreeEntry::SplitVectorize &&
           TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty())
         GatherOps.push_back(TE);
       continue;
@@ -6216,6 +6312,7 @@ bool BoUpSLP::canReorderOperands(
                  [&Gather, UserTE, I](TreeEntry *TE) {
                    assert(TE->State != TreeEntry::Vectorize &&
                           TE->State != TreeEntry::StridedVectorize &&
+                          TE->State != TreeEntry::SplitVectorize &&
                           "Only non-vectorized nodes are expected.");
                    if (any_of(TE->UserTreeIndices,
                               [UserTE, I](const EdgeInfo &EI) {
@@ -6245,13 +6342,15 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
   SmallVector<TreeEntry *> NonVectorized;
   for (const std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
     if (TE->State != TreeEntry::Vectorize &&
-        TE->State != TreeEntry::StridedVectorize)
+        TE->State != TreeEntry::StridedVectorize &&
+        TE->State != TreeEntry::SplitVectorize)
       NonVectorized.push_back(TE.get());
     if (std::optional<OrdersType> CurrentOrder =
             getReorderingData(*TE, /*TopToBottom=*/false)) {
       OrderedEntries.insert(TE.get());
       if (!(TE->State == TreeEntry::Vectorize ||
-            TE->State == TreeEntry::StridedVectorize) ||
+            TE->State == TreeEntry::StridedVectorize ||
+            TE->State == TreeEntry::SplitVectorize) ||
           !TE->ReuseShuffleIndices.empty())
         GathersToOrders.insert(TE.get());
     }
@@ -6270,6 +6369,7 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
     for (TreeEntry *TE : OrderedEntries) {
       if (!(TE->State == TreeEntry::Vectorize ||
             TE->State == TreeEntry::StridedVectorize ||
+            TE->State == TreeEntry::SplitVectorize ||
             (TE->isGather() && GathersToOrders.contains(TE))) ||
           TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
           !all_of(drop_begin(TE->UserTreeIndices),
@@ -6295,6 +6395,51 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
       return Data1.first->Idx > Data2.first->Idx;
     });
     for (auto &Data : UsersVec) {
+      if (Data.first->State == TreeEntry::SplitVectorize) {
+        assert(
+            Data.second.size() <= 2 &&
+            "Expected not greater than 2 operands for split vectorize node.");
+        if (any_of(Data.second, [](const auto &Op) {
+              return Op.second->UserTreeIndices.size() != 1;
+            }))
+          continue;
+        // Update orders in user split vectorize nodes.
+        for (const auto &P : Data.first->CombinedEntriesWithIndices) {
+          TreeEntry &OpTE = *VectorizableTree[P.first].get();
+          if (OpTE.isGather() || OpTE.ReorderIndices.empty())
+            continue;
+          SmallVector<int> Mask;
+          inversePermutation(OpTE.ReorderIndices, Mask);
+          SmallVector<int> MaskOrder(OpTE.ReorderIndices.size(),
+                                     PoisonMaskElem);
+          unsigned E = OpTE.ReorderIndices.size();
+          transform(OpTE.ReorderIndices, MaskOrder.begin(), [E](unsigned I) {
+            return I < E ? static_cast<int>(I) : PoisonMaskElem;
+          });
+          SmallVector<int> NewMask(Data.first->getVectorFactor());
+          SmallVector<int> NewMaskOrder(Data.first->getVectorFactor());
+          std::iota(NewMask.begin(), NewMask.end(), 0);
+          std::iota(NewMaskOrder.begin(), NewMaskOrder.end(), 0);
+          if (P.second == 0) {
+            copy(Mask, NewMask.begin());
+            copy(MaskOrder, NewMaskOrder.begin());
+          } else {
+            unsigned Offset = P.second;
+            for (unsigned I : seq<unsigned>(Mask.size())) {
+              NewMask[I + Offset] = Mask[I] + Offset;
+              NewMaskOrder[I + Offset] = MaskOrder[I] + Offset;
+            }
+          }
+          reorderScalars(Data.first->Scalars, NewMask);
+          reorderOrder(Data.first->ReorderIndices, NewMaskOrder,
+                       /*BottomOrder=*/true);
+          if (isIdentityOrder(Data.first->ReorderIndices))
+            Data.first->ReorderIndices.clear();
+          // Clear ordering of the operand.
+          OpTE.ReorderIndices.clear();
+        }
+        continue;
+      }
       // Check that operands are used only in the User node.
       SmallVector<TreeEntry *> GatherOps;
       if (!canReorderOperands(Data.first, Data.second, NonVectorized,
@@ -6451,6 +6596,7 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
         // Gathers are processed separately.
         if (TE->State != TreeEntry::Vectorize &&
             TE->State != TreeEntry::StridedVectorize &&
+            TE->State != TreeEntry::SplitVectorize &&
             (TE->State != TreeEntry::ScatterVectorize ||
              TE->ReorderIndices.empty()))
           continue;
@@ -6521,7 +6667,7 @@ void BoUpSLP::buildExternalUses(
     TreeEntry *Entry = TEPtr.get();
 
     // No need to handle users of gathered values.
-    if (Entry->isGather())
+    if (Entry->isGather() || Entry->State == TreeEntry::SplitVectorize)
       continue;
 
     // For each lane:
@@ -8227,6 +8373,142 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
     return;
   }
 
+  // Tries to build split node.
+  auto TrySplitNode = [&, &TTI = *TTI](unsigned SmallNodeSize,
+                                       const InstructionsState &LocalState) {
+    if (VL.size() <= SmallNodeSize)
+      return false;
+
+    // Any value is used in ...
[truncated]

Copy link

github-actions bot commented Jan 17, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 466217eb0334d467ec8e9b96c52ee988aaadc389 238d71529221832bf99d72983c95b29c84c18a02 llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp llvm/test/Transforms/PhaseOrdering/AArch64/slpordering.ll llvm/test/Transforms/SLPVectorizer/AArch64/alternate-vectorization-split-node.ll llvm/test/Transforms/SLPVectorizer/AArch64/gather-with-minbith-user.ll llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll llvm/test/Transforms/SLPVectorizer/X86/buildvector-schedule-for-subvector.ll llvm/test/Transforms/SLPVectorizer/X86/long-full-reg-stores.ll llvm/test/Transforms/SLPVectorizer/X86/matched-shuffled-entries.ll llvm/test/Transforms/SLPVectorizer/X86/non-load-reduced-as-part-of-bv.ll llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll llvm/test/Transforms/SLPVectorizer/X86/splat-score-adjustment.ll llvm/test/Transforms/SLPVectorizer/addsub.ll llvm/test/Transforms/SLPVectorizer/resized-alt-shuffle-after-minbw.ll

The following files introduce new uses of undef:

  • llvm/test/Transforms/SLPVectorizer/AArch64/gather-with-minbith-user.ll
  • llvm/test/Transforms/SLPVectorizer/X86/non-load-reduced-as-part-of-bv.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

Copy link

github-actions bot commented Jan 17, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

undefs in tests are generated by constant folder in InstructionBuilder

Created using spr 1.3.5
Created using spr 1.3.5
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear whether this patch makes it easier or harder to remove the main-alt opcode kludge in the future and allow general splitting - copyable elements, more than 2 subnodes, non-matching instruction types (e.g. binop and intrinsic).

I guess I'm asking - is there a larger plan for this or is it focused on improving existing main-alt opcode performance?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
@alexey-bataev
Copy link
Member Author

alexey-bataev commented Jan 22, 2025

I'm not clear whether this patch makes it easier or harder to remove the main-alt opcode kludge in the future and allow general splitting - copyable elements, more than 2 subnodes, non-matching instruction types (e.g. binop and intrinsic).

  1. Main/alt opcode cannot be removed completely. X86 has instruction addsub, which effectively supports alt/main vectorization, plus too small alt/main combinations (<2) lead to inefficient vectorization.
  2. Copyable elements vectorization will require some other stuff (possibly like adding pseudo-instructions for correct scheduling). This is the work I plan for the next patch.

I guess I'm asking - is there a larger plan for this or is it focused on improving existing main-alt opcode performance?

Currently, this is mostly all we can do about main/alternate vectorization.

@alexey-bataev
Copy link
Member Author

I'm not clear whether this patch makes it easier or harder to remove the main-alt opcode kludge in the future and allow general splitting - copyable elements, more than 2 subnodes, non-matching instruction types (e.g. binop and intrinsic).

  1. Main/alt opcode cannot be removed completely. X86 has instruction addsub, which effectively supports alt/main vectorization, plus too small alt/main combinations (<2) lead to inefficient vectorization.
  2. Copyable elements vectorization will require some other stuff (possibly like adding pseudo-instructions for correct scheduling). This is the work I plan for the next patch.

I guess I'm asking - is there a larger plan for this or is it focused on improving existing main-alt opcode performance?

Currently, this is mostly all we can do about main/alternate vectorization.

Also, as the first step to improve alt/main vectorization (and copyable elements), one of our team members works on #112181. It has some issues with compile time, we still investigating how we can improve it

@alexey-bataev
Copy link
Member Author

I'm not clear whether this patch makes it easier or harder to remove the main-alt opcode kludge in the future and allow general splitting - copyable elements, more than 2 subnodes, non-matching instruction types (e.g. binop and intrinsic).

  1. Main/alt opcode cannot be removed completely. X86 has instruction addsub, which effectively supports alt/main vectorization, plus too small alt/main combinations (<2) lead to inefficient vectorization.
  2. Copyable elements vectorization will require some other stuff (possibly like adding pseudo-instructions for correct scheduling). This is the work I plan for the next patch.

I guess I'm asking - is there a larger plan for this or is it focused on improving existing main-alt opcode performance?

Currently, this is mostly all we can do about main/alternate vectorization.

Also, as the first step to improve alt/main vectorization (and copyable elements), one of our team members works on #112181. It has some issues with compile time, we still investigating how we can improve it

Just want to clarify two points here.

  1. If number of alternate opcodes <(or equal?)3, we can use [SLP] Make getSameOpcode support different instructions if they have same semantics. #112181 and copyable elements support, because nodes with 1 element are not vectorized and with <=(?) 3 might be not very effective from the perf point of view (this requires some extra investigation).
  2. In other cases, split node vectorization is much better, since it allows to reduce register pressure and replaces wide (multi-) register operations with smaller ones

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the current version causes some crashes on AArch64:

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "arm64-apple-macosx13.0.0"

define i32 @test(ptr %c) {
entry:
  %bitlen = getelementptr i8, ptr %c, i64 136
  %0 = load i64, ptr %bitlen, align 8
  %incdec.ptr.4 = getelementptr i8, ptr %c, i64 122
  %shr45.4 = lshr i64 %0, 0
  %conv43.5 = trunc i64 %shr45.4 to i8
  %incdec.ptr.5 = getelementptr i8, ptr %c, i64 121
  store i8 %conv43.5, ptr %incdec.ptr.4, align 1
  %shr45.5 = lshr i64 %0, 0
  %conv43.6 = trunc i64 %shr45.5 to i8
  %incdec.ptr.6 = getelementptr i8, ptr %c, i64 120
  store i8 %conv43.6, ptr %incdec.ptr.5, align 1
  %conv43.7 = trunc i64 %0 to i8
  %incdec.ptr.7 = getelementptr i8, ptr %c, i64 119
  store i8 %conv43.7, ptr %incdec.ptr.6, align 1
  %arrayidx38.1 = getelementptr i8, ptr %c, i64 144
  %1 = load i64, ptr %arrayidx38.1, align 8
  %conv43.145 = trunc i64 %1 to i8
  %incdec.ptr.146 = getelementptr i8, ptr %c, i64 118
  store i8 %conv43.145, ptr %incdec.ptr.7, align 1
  %shr45.147 = lshr i64 %1, 0
  %conv43.1.1 = trunc i64 %shr45.147 to i8
  %incdec.ptr.1.1 = getelementptr i8, ptr %c, i64 117
  store i8 %conv43.1.1, ptr %incdec.ptr.146, align 1
  %shr45.1.1 = lshr i64 %1, 0
  %conv43.2.1 = trunc i64 %shr45.1.1 to i8
  %incdec.ptr.2.1 = getelementptr i8, ptr %c, i64 116
  store i8 %conv43.2.1, ptr %incdec.ptr.1.1, align 1
  %shr45.2.1 = lshr i64 %1, 0
  %conv43.3.1 = trunc i64 %shr45.2.1 to i8
  %incdec.ptr.3.1 = getelementptr i8, ptr %c, i64 115
  store i8 %conv43.3.1, ptr %incdec.ptr.2.1, align 1
  %shr45.3.1 = lshr i64 %1, 0
  %conv43.4.1 = trunc i64 %shr45.3.1 to i8
  store i8 %conv43.4.1, ptr %incdec.ptr.3.1, align 1
  ret i32 0
}

Assertion failed: ((!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"), function getInsertSubvectorOverhead, file BasicTTIImpl.h, line 166.

Created using spr 1.3.5
Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase after #124914?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
Created using spr 1.3.5
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
Created using spr 1.3.5
Created using spr 1.3.5
@alexey-bataev alexey-bataev merged commit e588085 into main Jan 31, 2025
4 of 8 checks passed
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slpreduce-number-of-alternate-instruction-where-possible branch January 31, 2025 11:56
github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Jan 31, 2025
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                            results     results0    diff
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   277800.00   280536.00  1.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81802.00    82426.00  0.8%
                        test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   790552.00   790952.00  0.1%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383795.00   383987.00  0.1%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2075541.00  2076501.00  0.0%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2075541.00  2076501.00  0.0%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   312702.00   312766.00  0.0%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2049374.00  2049358.00 -0.0%
                    test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1091836.00  1091772.00 -0.0%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test   852339.00   852211.00 -0.0%
                                test-suite :: MultiSource/Applications/oggenc/oggenc.test   190651.00   190523.00 -0.1%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44203.00    44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12997.00    12981.00 -0.1%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   668971.00   658427.00 -1.6%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   668971.00   658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program                                                                                                                                                size..text
                                                                               results    results0   diff
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  587336.00  587668.00  0.1%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  643308.00  643614.00  0.0%
                    test-suite :: MultiSource/Applications/d/make_dparser.test   79678.00   79710.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277322.00  277420.00  0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  933660.00  933682.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  148038.00  148024.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  283036.00  283008.00 -0.0%
   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00 -0.1%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  540582.00  511772.00 -5.3%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  540582.00  511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: llvm/llvm-project#123360
@asb
Copy link
Contributor

asb commented Jan 31, 2025

@alexey-bataev this appears to have caused a failure for the rva23 evl builder: https://lab.llvm.org/staging/#/builders/16/builds/717/steps/11/logs/stdio

[140/7284] Building C object lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o
FAILED: lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o 
/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang --target=riscv64-linux-gnu --sysroot=/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/../rvsysroot -DBLAKE3_NO_AVX2 -DBLAKE3_NO_AVX512 -DBLAKE3_NO_SSE2 -DBLAKE3_NO_SSE41 -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage2/lib/Support/BLAKE3 -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/lib/Support/BLAKE3 -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage2/include -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/include -march=rva23u64 -mllvm -force-tail-folding-style=data-with-evl -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -fPIC -fno-semantic-interposition -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -UNDEBUG -MD -MT lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o -MF lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o.d -o lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o -c /home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/lib/Support/BLAKE3/blake3_portable.c
clang: /home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6039: void llvm::slpvectorizer::BoUpSLP::TreeEntry::reorderSplitNode(unsigned int, ArrayRef<int>, ArrayRef<int>): Assertion `Idx == 1 && "Expected either 0 or 1 index."' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang --target=riscv64-linux-gnu --sysroot=/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/../rvsysroot -DBLAKE3_NO_AVX2 -DBLAKE3_NO_AVX512 -DBLAKE3_NO_SSE2 -DBLAKE3_NO_SSE41 -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage2/lib/Support/BLAKE3 -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/lib/Support/BLAKE3 -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage2/include -I/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/include -march=rva23u64 -mllvm -force-tail-folding-style=data-with-evl -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -fPIC -fno-semantic-interposition -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -UNDEBUG -MD -MT lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o -MF lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o.d -o lib/Support/BLAKE3/CMakeFiles/LLVMSupportBlake3.dir/blake3_portable.c.o -c /home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/lib/Support/BLAKE3/blake3_portable.c
1.	<eof> parser at end of file
2.	Optimizer
3.	Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;no-prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-lookup;no-keep-loops;hoist-common-insts;no-hoist-loads-stores-with-cond-faulting;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;no-switch-to-lookup;keep-loops;no-hoist-common-insts;hoist-loads-stores-with-cond-faulting;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/llvm/llvm/lib/Support/BLAKE3/blake3_portable.c"
4.	Running pass "slp-vectorizer" on function "llvm_blake3_compress_in_place_portable"
 #0 0x00005aa8c3d78836 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x439d836)
 #1 0x00005aa8c3d7614e llvm::sys::RunSignalHandlers() (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x439b14e)
 #2 0x00005aa8c3ce0b2d CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0
 #3 0x0000780e6a24c1d0 (/usr/lib/libc.so.6+0x3d1d0)
 #4 0x0000780e6a2a5334 (/usr/lib/libc.so.6+0x96334)
 #5 0x0000780e6a24c120 raise (/usr/lib/libc.so.6+0x3d120)
 #6 0x0000780e6a2334c3 abort (/usr/lib/libc.so.6+0x244c3)
 #7 0x0000780e6a2333df (/usr/lib/libc.so.6+0x243df)
 #8 0x0000780e6a244177 (/usr/lib/libc.so.6+0x35177)
 #9 0x00005aa8c5994756 llvm::slpvectorizer::BoUpSLP::TreeEntry::reorderSplitNode(unsigned int, llvm::ArrayRef<int>, llvm::ArrayRef<int>) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x5fb9756)
#10 0x00005aa8c59998c0 llvm::slpvectorizer::BoUpSLP::reorderBottomToTop(bool) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x5fbe8c0)
#11 0x00005aa8c59fcf0a llvm::SLPVectorizerPass::vectorizeStoreChain(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, unsigned int, unsigned int, unsigned int&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x6021f0a)
#12 0x00005aa8c59ff742 llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&, llvm::DenseSet<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, llvm::DenseMapInfo<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, void>>&)::$_0::operator()(std::set<std::pair<unsigned int, int>, llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&, llvm::DenseSet<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, llvm::DenseMapInfo<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, void>>&)::StoreDistCompare, std::allocator<std::pair<unsigned int, int>>> const&) const SLPVectorizer.cpp:0:0
#13 0x00005aa8c59fe37b llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&, llvm::DenseSet<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, llvm::DenseMapInfo<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, void>>&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x602337b)
#14 0x00005aa8c59f90d9 llvm::SLPVectorizerPass::vectorizeStoreChains(llvm::slpvectorizer::BoUpSLP&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x601e0d9)
#15 0x00005aa8c59f78ec llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x601c8ec)
#16 0x00005aa8c59f6e76 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x601be76)
#17 0x00005aa8c543f13d llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilder.cpp:0:0
#18 0x00005aa8c386ff5a llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x3e94f5a)
#19 0x00005aa8c4635c1d llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) BackendUtil.cpp:0:0
#20 0x00005aa8c387461f llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x3e9961f)
#21 0x00005aa8c4630d7d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) BackendUtil.cpp:0:0
#22 0x00005aa8c386ef6a llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x3e93f6a)
#23 0x00005aa8c462df8d (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile>>&, clang::BackendConsumer*) BackendUtil.cpp:0:0
#24 0x00005aa8c4624075 clang::emitBackendOutput(clang::CompilerInstance&, clang::CodeGenOptions&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x4c49075)
#25 0x00005aa8c464ef47 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x4c73f47)
#26 0x00005aa8c60eda79 clang::ParseAST(clang::Sema&, bool, bool) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x6712a79)
#27 0x00005aa8c4b1db44 clang::FrontendAction::Execute() (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x5142b44)
#28 0x00005aa8c4a85d6d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x50aad6d)
#29 0x00005aa8c4c0f1cd clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x52341cd)
#30 0x00005aa8c2c04a8e cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x3229a8e)
#31 0x00005aa8c2c00a19 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0
#32 0x00005aa8c48f2999 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::$_0>(long) Job.cpp:0:0
#33 0x00005aa8c3ce082e llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x430582e)
#34 0x00005aa8c48f1ff4 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x4f16ff4)
#35 0x00005aa8c48b1958 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x4ed6958)
#36 0x00005aa8c48b1b97 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x4ed6b97)
#37 0x00005aa8c48ced89 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x4ef3d89)
#38 0x00005aa8c2c00266 clang_main(int, char**, llvm::ToolContext const&) (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x3225266)
#39 0x00005aa8c2c102d6 main (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x32352d6)
#40 0x0000780e6a234e08 (/usr/lib/libc.so.6+0x25e08)
#41 0x0000780e6a234ecc __libc_start_main (/usr/lib/libc.so.6+0x25ecc)
#42 0x00005aa8c2bfe4e5 _start (/home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin/clang+0x32234e5)
clang: error: clang frontend command failed with exit code 134 (use -v to see invocation)
clang version 21.0.0git (https://github.com/llvm/llvm-project.git c1163b843b63f775817b84f124cdcf33f25c28f6)
Target: riscv64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/buildbot-worker/bbroot/clang-riscv-rva23-evl-vec-2stage/stage1.install/bin
Build config: +assertions
clang: note: diagnostic msg: 
********************

alexey-bataev added a commit that referenced this pull request Jan 31, 2025
@asb
Copy link
Contributor

asb commented Jan 31, 2025

Thanks for the revert - I was just double-checking if we get it with plain RVV and indeed we do.

repro.zip

Shouldn't be necessary, but this reproduces it early on when bootstrapping LLVM (x86-64 host, cross-compiling for RISCV rva23u64 - this could be simplified, it's just adapted from what we do in CI):

cd llvm-project
mkdir -p build && cd build
mkdir -p stage1 stage1.install stage2
printf "!!!!!!!!!! Building stage 1 !!!!!!!!!!\n"
cd stage1
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_ENABLE_ASSERTIONS=True \
  -DLLVM_LIT_ARGS="-v" \
  -DCMAKE_INSTALL_PREFIX=../stage1.install \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DLLVM_ENABLE_LLD=True \
  -DLLVM_TARGETS_TO_BUILD="RISCV" \
  -DCMAKE_C_COMPILER_LAUNCHER=ccache \
  -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -DLLVM_ENABLE_PROJECTS="lld;clang;clang-tools-extra;llvm" \
  ../../llvm
ninja
printf "!!!!!!!!!! Installing stage 1 !!!!!!!!!!\n"
ninja install
cd ../stage2
printf "!!!!!!!!!! Building stage 2 !!!!!!!!!!\n"
cat - <<EOF > stage1-toolchain.cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSROOT $HOME/rvsysroot)
set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_C_FLAGS_INIT "-march=rva23u64")
set(CMAKE_CXX_FLAGS_INIT "-march=rva23u64")
set(CMAKE_LINKER_TYPE LLD)
set(CMAKE_C_COMPILER $(pwd)/../stage1.install/bin/clang)
set(CMAKE_CXX_COMPILER $(pwd)/../stage1.install/bin/clang++)
set(CMAKE_C_COMPILER_LAUNCHER ccache)
set(CMAKE_CXX_COMPILER_LAUNCHER ccache)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
EOF
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_ENABLE_ASSERTIONS=True \
  -DLLVM_LIT_ARGS="-v" \
  -DLLVM_NATIVE_TOOL_DIR=$(pwd)/../stage1.install/bin \
  -DLLVM_BUILD_TESTS=True \
  -DLLVM_ENABLE_PROJECTS="lld;clang;clang-tools-extra;llvm" \
  -DCMAKE_TOOLCHAIN_FILE=$(pwd)/stage1-toolchain.cmake \
  -DLLVM_HOST_TRIPLE=riscv64-linux-gnu \
  ../../llvm
ninja

; CHECK-NEXT: [[TMP99:%.*]] = xor <16 x i32> [[TMP98]], [[TMP96]]
; CHECK-NEXT: [[TMP100:%.*]] = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP99]])
; CHECK-NEXT: [[CONV118:%.*]] = and i32 [[TMP100]], 65535
; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[TMP100]], 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite a lot worse btw. I'm not sure what is going on, but I know the alt instructions, while not useful directly for aarch64, can lead to better vectorization if we can shuffle nicely in the right places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not find a suitable way to detect it, will disable it for AArch64/ARM targets for now

github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Jan 31, 2025
@alexey-bataev
Copy link
Member Author

Thanks for the revert - I was just double-checking if we get it with plain RVV and indeed we do.

repro.zip

Shouldn't be necessary, but this reproduces it early on when bootstrapping LLVM (x86-64 host, cross-compiling for RISCV rva23u64 - this could be simplified, it's just adapted from what we do in CI):

cd llvm-project
mkdir -p build && cd build
mkdir -p stage1 stage1.install stage2
printf "!!!!!!!!!! Building stage 1 !!!!!!!!!!\n"
cd stage1
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_ENABLE_ASSERTIONS=True \
  -DLLVM_LIT_ARGS="-v" \
  -DCMAKE_INSTALL_PREFIX=../stage1.install \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DLLVM_ENABLE_LLD=True \
  -DLLVM_TARGETS_TO_BUILD="RISCV" \
  -DCMAKE_C_COMPILER_LAUNCHER=ccache \
  -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -DLLVM_ENABLE_PROJECTS="lld;clang;clang-tools-extra;llvm" \
  ../../llvm
ninja
printf "!!!!!!!!!! Installing stage 1 !!!!!!!!!!\n"
ninja install
cd ../stage2
printf "!!!!!!!!!! Building stage 2 !!!!!!!!!!\n"
cat - <<EOF > stage1-toolchain.cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSROOT $HOME/rvsysroot)
set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_C_FLAGS_INIT "-march=rva23u64")
set(CMAKE_CXX_FLAGS_INIT "-march=rva23u64")
set(CMAKE_LINKER_TYPE LLD)
set(CMAKE_C_COMPILER $(pwd)/../stage1.install/bin/clang)
set(CMAKE_CXX_COMPILER $(pwd)/../stage1.install/bin/clang++)
set(CMAKE_C_COMPILER_LAUNCHER ccache)
set(CMAKE_CXX_COMPILER_LAUNCHER ccache)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
EOF
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_ENABLE_ASSERTIONS=True \
  -DLLVM_LIT_ARGS="-v" \
  -DLLVM_NATIVE_TOOL_DIR=$(pwd)/../stage1.install/bin \
  -DLLVM_BUILD_TESTS=True \
  -DLLVM_ENABLE_PROJECTS="lld;clang;clang-tools-extra;llvm" \
  -DCMAKE_TOOLCHAIN_FILE=$(pwd)/stage1-toolchain.cmake \
  -DLLVM_HOST_TRIPLE=riscv64-linux-gnu \
  ../../llvm
ninja

Thanks Alex, reproducer was enough to locate and fix the bug

alexey-bataev added a commit that referenced this pull request Feb 1, 2025
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                            results     results0    diff
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   277800.00   280536.00  1.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81802.00    82426.00  0.8%
                        test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   790552.00   790952.00  0.1%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383795.00   383987.00  0.1%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2075541.00  2076501.00  0.0%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2075541.00  2076501.00  0.0%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   312702.00   312766.00  0.0%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2049374.00  2049358.00 -0.0%
                    test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1091836.00  1091772.00 -0.0%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test   852339.00   852211.00 -0.0%
                                test-suite :: MultiSource/Applications/oggenc/oggenc.test   190651.00   190523.00 -0.1%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44203.00    44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12997.00    12981.00 -0.1%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   668971.00   658427.00 -1.6%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   668971.00   658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program                                                                                                                                                size..text
                                                                               results    results0   diff
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  587336.00  587668.00  0.1%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  643308.00  643614.00  0.0%
                    test-suite :: MultiSource/Applications/d/make_dparser.test   79678.00   79710.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277322.00  277420.00  0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  933660.00  933682.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  148038.00  148024.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  283036.00  283008.00 -0.0%
   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00 -0.1%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  540582.00  511772.00 -5.3%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  540582.00  511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: #123360
github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Feb 1, 2025
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                            results     results0    diff
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   277800.00   280536.00  1.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81802.00    82426.00  0.8%
                        test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   790552.00   790952.00  0.1%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383795.00   383987.00  0.1%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2075541.00  2076501.00  0.0%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2075541.00  2076501.00  0.0%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   312702.00   312766.00  0.0%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2049374.00  2049358.00 -0.0%
                    test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1091836.00  1091772.00 -0.0%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test   852339.00   852211.00 -0.0%
                                test-suite :: MultiSource/Applications/oggenc/oggenc.test   190651.00   190523.00 -0.1%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44203.00    44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12997.00    12981.00 -0.1%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   668971.00   658427.00 -1.6%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   668971.00   658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program                                                                                                                                                size..text
                                                                               results    results0   diff
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  587336.00  587668.00  0.1%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  643308.00  643614.00  0.0%
                    test-suite :: MultiSource/Applications/d/make_dparser.test   79678.00   79710.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277322.00  277420.00  0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  933660.00  933682.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  148038.00  148024.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  283036.00  283008.00 -0.0%
   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00 -0.1%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  540582.00  511772.00 -5.3%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  540582.00  511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: llvm/llvm-project#123360
@mstorsjo
Copy link
Member

mstorsjo commented Feb 2, 2025

The relanded version of this is causing failed asserts:

typedef void (*filter_ss_t)(const short *, long long, short *, long long, int);
filter_ss_t pu_0_0;
template <int N, int width, int height>
void interp_vert_ss_c(const short *src, long long srcStride, short *dst,
                      long long dstStride, int) {
  int row, col;
  src -= (N / 2 - 1) * srcStride;
  row = 0;
  for (; row < height; row++) {
    col = 0;
    for (; col < width; col++) {
      int sum = src[col];
      sum += src[col + srcStride];
      sum += src[col + 2 * srcStride];
      sum += src[col + 3 * srcStride];
      sum += src[col + 4 * srcStride];
      sum += src[col + 5 * srcStride];
      sum += src[col + 6 * srcStride] * 6;
      sum += src[col + 7 * srcStride] * 7;
      dst[col] = sum;
    }
    src += srcStride;
    dst += dstStride;
  }
}
void setupFilterPrimitives_c() { pu_0_0 = interp_vert_ss_c<8, 16, 4>; }
$ clang -target x86_64-linux-gnu -c -O2 repro.cpp
clang: ../include/llvm/ADT/ArrayRef.h:260: const T& llvm::ArrayRef<T>::operator[](size_t) const [with T = int; size_t = long unsigned int]: Assertion `Index < Length && "Invalid index!"' failed.

I can push a revert later today unless you happen to get to fixing it sooner than that.

@alexey-bataev
Copy link
Member Author

Feel free to revert

mstorsjo added a commit that referenced this pull request Feb 2, 2025
This reverts commit d5a7a48.

That commit triggers failed asserts, see
#123360 for details.
@mstorsjo
Copy link
Member

mstorsjo commented Feb 2, 2025

Reverted it now, thanks!

github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Feb 2, 2025
… possible"

This reverts commit d5a7a48.

That commit triggers failed asserts, see
llvm/llvm-project#123360 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants