Skip to content

[LSV] Insert casts to vectorize mismatched types #134436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

gandhi56
Copy link
Contributor

@gandhi56 gandhi56 commented Apr 4, 2025

After collecting equivalence classes, loop over each distinct pair of them and check if they could be merged into one.

Consider classes A and B such that their leaders differ only by their scalar bitwidths. We do not yet merge them otherwise. Let N be the scalar bitwidth of the leader instruction in A. Iterate over all instructions in B and ensure their total bitwidths match the total bitwidth of the leader instruction of A. Finally, cast each instruction in B with a mismatched type to a pointer, integer or floating point type.

Resolve issue #97715

@llvmbot
Copy link
Member

llvmbot commented Apr 4, 2025

@llvm/pr-subscribers-backend-nvptx
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Anshil Gandhi (gandhi56)

Changes

After collecting equivalence classes, loop over each distinct pair of them and check if they could be merged into one.

Consider classes A and B such that their leaders differ only by their scalar bitwidth. (We do not merge them otherwise.) Let N be the scalar bitwidth of the leader instruction in A. Iterate over all instructions in B and ensure their total bitwidths match the total bitwidth of the leader instruction of A. Finally, cast each instruction in B with a mismatched type to an intN type.


Full diff: https://github.com/llvm/llvm-project/pull/134436.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+81-1)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+27-3)
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 04b392829f0d7..c94f10fb8b855 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -324,6 +324,10 @@ class Vectorizer {
       Instruction *ChainElem, Instruction *ChainBegin,
       const DenseMap<Instruction *, APInt /*OffsetFromLeader*/> &ChainOffsets);
 
+  /// Merge the equivalence classes if casts could be inserted in one to match
+  /// the scalar bitwidth of the instructions in the other class.
+  void insertCastsToMergeClasses(EquivalenceClassMap &EQClasses);
+
   /// Merges the equivalence classes if they have underlying objects that differ
   /// by one level of indirection (i.e., one is a getelementptr and the other is
   /// the base pointer in that getelementptr).
@@ -1310,6 +1314,82 @@ std::optional<APInt> Vectorizer::getConstantOffsetSelects(
   return std::nullopt;
 }
 
+void Vectorizer::insertCastsToMergeClasses(EquivalenceClassMap &EQClasses) {
+  if (EQClasses.size() < 2)
+    return;
+
+  // Loop over all equivalence classes and try to merge them. Keep track of
+  // classes that are merged into others.
+  DenseSet<EqClassKey> ClassesToErase;
+  for (auto EC1 : EQClasses) {
+    for (auto EC2 : EQClasses) {
+      if (ClassesToErase.contains(EC2.first) || EC1 <= EC2)
+        continue;
+
+      auto [Ptr1, AS1, TySize1, IsLoad1] = EC1.first;
+      auto [Ptr2, AS2, TySize2, IsLoad2] = EC2.first;
+
+      // Attempt to merge EC2 into EC1. Skip if the pointers, address spaces or
+      // whether the leader instruction is a load/store are different. Also skip
+      // if the scalar bitwidth of the first equivalence class is smaller than
+      // the second one to avoid reconsidering the same equivalence class pair.
+      if (Ptr1 != Ptr2 || AS1 != AS2 || IsLoad1 != IsLoad2 || TySize1 < TySize2)
+        continue;
+
+      // Ensure all instructions in EC2 can be bitcasted into NewTy.
+      /// TODO: NewTyBits is needed as stuctured binded variables cannot be
+      /// captured by a lambda until C++20.
+      auto NewTyBits = std::get<2>(EC1.first);
+      if (any_of(EC2.second, [&](Instruction *I) {
+            return DL.getTypeSizeInBits(getLoadStoreType(I)) != NewTyBits;
+          }))
+        continue;
+
+      // Create a new type for the equivalence class.
+      /// TODO: NewTy should be an FP type for an all-FP equivalence class.
+      auto *NewTy = Type::getIntNTy(EC2.second[0]->getContext(), NewTyBits);
+      for (auto *Inst : EC2.second) {
+        auto *Ptr = getLoadStorePointerOperand(Inst);
+        auto *OrigTy = Inst->getType();
+        if (OrigTy == NewTy)
+          continue;
+        if (auto *LI = dyn_cast<LoadInst>(Inst)) {
+          Builder.SetInsertPoint(LI->getIterator());
+          auto *NewLoad = Builder.CreateLoad(NewTy, Ptr);
+          auto *Cast = Builder.CreateBitOrPointerCast(
+              NewLoad, OrigTy, NewLoad->getName() + ".cast");
+          LI->replaceAllUsesWith(Cast);
+          LI->eraseFromParent();
+          EQClasses[EC1.first].emplace_back(NewLoad);
+        } else {
+          auto *SI = cast<StoreInst>(Inst);
+          Builder.SetInsertPoint(SI->getIterator());
+          auto *Cast = Builder.CreateBitOrPointerCast(
+              SI->getValueOperand(), NewTy,
+              SI->getValueOperand()->getName() + ".cast");
+          auto *NewStore = Builder.CreateStore(
+              Cast, getLoadStorePointerOperand(SI), SI->isVolatile());
+          SI->eraseFromParent();
+          EQClasses[EC1.first].emplace_back(NewStore);
+        }
+      }
+
+      // Sort the instructions in the equivalence class by their order in the
+      // basic block. This is important to ensure that the instructions are
+      // vectorized in the correct order.
+      std::sort(EQClasses[EC1.first].begin(), EQClasses[EC1.first].end(),
+                [](Instruction *A, Instruction *B) {
+                  return A && B && A->comesBefore(B);
+                });
+      ClassesToErase.insert(EC2.first);
+    }
+  }
+
+  // Erase the equivalence classes that were merged into others.
+  for (auto Key : ClassesToErase)
+    EQClasses.erase(Key);
+}
+
 void Vectorizer::mergeEquivalenceClasses(EquivalenceClassMap &EQClasses) const {
   if (EQClasses.size() < 2) // There is nothing to merge.
     return;
@@ -1495,7 +1575,7 @@ Vectorizer::collectEquivalenceClasses(BasicBlock::iterator Begin,
          /*IsLoad=*/LI != nullptr}]
         .emplace_back(&I);
   }
-
+  insertCastsToMergeClasses(Ret);
   mergeEquivalenceClasses(Ret);
   return Ret;
 }
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll
index ede2e4066c263..c364bc2da4c5d 100644
--- a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll
+++ b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll
@@ -95,10 +95,10 @@ entry:
   ret void
 }
 
-; Ideally this would be merged
 ; CHECK-LABEL: @merge_load_i32_v2i16(
-; CHECK: load i32,
-; CHECK: load <2 x i16>
+; CHECK: load <2 x i32>
+; CHECK: extractelement <2 x i32> %0, i32 0
+; CHECK: extractelement <2 x i32> %0, i32 1
 define amdgpu_kernel void @merge_load_i32_v2i16(ptr addrspace(1) nocapture %a) #0 {
 entry:
   %a.1 = getelementptr inbounds i32, ptr addrspace(1) %a, i32 1
@@ -111,3 +111,27 @@ entry:
 
 attributes #0 = { nounwind }
 attributes #1 = { nounwind readnone }
+
+; CHECK-LABEL: @merge_i32_2i16_float_4i8(
+; CHECK: load <4 x i32>
+; CHECK: store <2 x i32>
+; CHECK: store <2 x i32>
+define void @merge_i32_2i16_float_4i8(ptr addrspace(1) %ptr1, ptr addrspace(2) %ptr2) {
+  %gep1 = getelementptr inbounds i32, ptr addrspace(1) %ptr1, i64 0
+  %load1 = load i32, ptr addrspace(1) %gep1, align 4
+  %gep2 = getelementptr inbounds <2 x i16>, ptr addrspace(1) %ptr1, i64 1
+  %load2 = load <2 x i16>, ptr addrspace(1) %gep2, align 4
+  %gep3 = getelementptr inbounds float, ptr addrspace(1) %ptr1, i64 2
+  %load3 = load float, ptr addrspace(1) %gep3, align 4
+  %gep4 = getelementptr inbounds <4 x i8>, ptr addrspace(1) %ptr1, i64 3
+  %load4 = load <4 x i8>, ptr addrspace(1) %gep4, align 4
+  %store.gep1 = getelementptr inbounds i32, ptr addrspace(2) %ptr2, i64 0
+  store i32 %load1, ptr addrspace(2) %store.gep1, align 4
+  %store.gep2 = getelementptr inbounds <2 x i16>, ptr addrspace(2) %ptr2, i64 1
+  store <2 x i16> %load2, ptr addrspace(2) %store.gep2, align 4
+  %store.gep3 = getelementptr inbounds float, ptr addrspace(2) %ptr2, i64 2
+  store float %load3, ptr addrspace(2) %store.gep3, align 4
+  %store.gep4 = getelementptr inbounds <4 x i8>, ptr addrspace(2) %ptr2, i64 3
+  store <4 x i8> %load4, ptr addrspace(2) %store.gep4, align 4
+  ret void
+}

@gandhi56 gandhi56 force-pushed the issue-97715/lsv-insert-casts branch 3 times, most recently from 87f500d to a37da6c Compare April 9, 2025 19:20
Copy link

github-actions bot commented Apr 9, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp,h -- llvm/include/llvm/Transforms/Utils/Local.h llvm/lib/Transforms/Utils/Local.cpp llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index cfa6a02c2..8c117e990 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -239,9 +239,7 @@ void reorder(Instruction *I) {
 
 class Vectorizer {
 
-  enum ClassTyDist {
-    Int, Float, Ptr, Other
-  };
+  enum ClassTyDist { Int, Float, Ptr, Other };
 
   Function &F;
   AliasAnalysis &AA;
@@ -281,11 +279,11 @@ private:
 
   static int getTypeKind(Instruction *I) {
     unsigned ID = I->getType()->getTypeID();
-    switch(ID) {
-      case Type::IntegerTyID:
-      case Type::FloatTyID:
-      case Type::PointerTyID:
-        return ID;
+    switch (ID) {
+    case Type::IntegerTyID:
+    case Type::FloatTyID:
+    case Type::PointerTyID:
+      return ID;
     };
     return -1;
   }

@gandhi56 gandhi56 force-pushed the issue-97715/lsv-insert-casts branch from a37da6c to 9d03603 Compare April 9, 2025 23:23
@gandhi56 gandhi56 requested a review from broxigarchen April 10, 2025 18:24
@gandhi56 gandhi56 self-assigned this Apr 13, 2025
@gandhi56 gandhi56 force-pushed the issue-97715/lsv-insert-casts branch 3 times, most recently from 706665f to 3024610 Compare April 30, 2025 19:38
@@ -1495,7 +1629,7 @@ Vectorizer::collectEquivalenceClasses(BasicBlock::iterator Begin,
/*IsLoad=*/LI != nullptr}]
.emplace_back(&I);
}

insertCastsToMergeClasses(Ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is eagerly mutating the IR before vectorization is performed? Should try to only select a type, and coerce as part of the final vectorization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging equivalence classes, LSV converts them into chains at which point it is too late to introduce cast instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it too late to introduce cast instructions at that point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, it may be possible. The necessary changes seem cumbersome to me, however. After collecting classes, LSV extracts chains from each class. These chains are split based on contiguity, alignment and MayAlias instructions, before vectorization. Merging chains after these splits would require careful handling of their instructions as vectorizeChain makes certain assumptions before determining the type of vectorized load/store.

I prefer to insert casts alongside mergeEquivalenceClasses(..) as gatherChains already understands what kinds of chains are handlable by vectorizeChain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to not make pointless IR changes, and only do this if it vectorization will occur

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I will postpone the merging of class until vectorization then. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a deep dive into inserting casts before vectorizeChain.

  • Firstly, all gathered chains are guaranteed to be vectorized except if the chain only has one element.
  • gatherChains is responsible for calculating offsets. The splitChain functions further splits the gathered chains based on the legality of prospective vectorization. Inserting casts after these functions may break the legality of these chains as the offsets may no longer be correct. There are two ways to tackle this challenge:
    • Recompute the offsets and re-run the split chain functions - sounds like too much of an overhead to me.
    • Store chains in a heap-based data structure which can preserve legality, further demanding a lot of bookkeeping in order to replace the splitChain functions - this approach seems quite inscalable for longer chains.

gandhi56 added a commit that referenced this pull request May 1, 2025
gandhi56 added a commit that referenced this pull request May 1, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: #134436

This is a reland of #138155,
which was reverted due to missed nits.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: llvm#134436

This is a reland of llvm#138155,
which was reverted due to missed nits.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: llvm#134436

This is a reland of llvm#138155,
which was reverted due to missed nits.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: llvm#134436

This is a reland of llvm#138155,
which was reverted due to missed nits.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 6, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 6, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: llvm/llvm-project#134436

This is a reland of llvm/llvm-project#138155,
which was reverted due to missed nits.
@gandhi56 gandhi56 marked this pull request as draft May 6, 2025 17:26
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: llvm#134436

This is a reland of llvm#138155,
which was reverted due to missed nits.
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request May 9, 2025
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request May 9, 2025
Autogenerate checks for merge-vectors.ll and introduce
merge-vectors-complex.ll with mismatched types.
Related PR: llvm#134436

This is a reland of llvm#138155,
which was reverted due to missed nits.
@gandhi56 gandhi56 requested a review from Copilot May 27, 2025 16:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@gandhi56 gandhi56 requested a review from Copilot May 27, 2025 16:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces changes to several AMDGPU LLVM test files to insert casts that help vectorize mismatched types. Key updates include reordering and adjusting low‐level instruction sequences in test expectations, modified bit‐manipulation steps, and updates to disjoint flag check patterns to reflect the new cast insertion behavior.

Reviewed Changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated no comments.

Show a summary per file
File Description
llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll Updates expected disjoint V_OR_B32_e64 output with explicit naming
llvm/test/CodeGen/AMDGPU/build_vector.ll Reorders shift and move instructions for vector builds
llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll Adjusts the ordering of BFI and register moves for copysign checks
llvm/test/CodeGen/AMDGPU/fabs.ll Refactors fabs test to use bitmasking (s_and_b32) for sign clearing
llvm/test/CodeGen/AMDGPU/fabs.f16.ll Updates fabs test for half types with similar bitmasking changes
Comments suppressed due to low confidence (5)

llvm/test/CodeGen/AMDGPU/build_vector.ll:271

  • The shift and move instructions have been reordered to build the vector from subvectors. Please ensure that the new sequence produces the exact intended bit layout across registers.
; GFX8-NEXT:    s_lshl_b32 s3, s3, 16

llvm/test/CodeGen/AMDGPU/fabs.f16.ll:218

  • The fabs test for half types has been updated to use s_and_b32 for masking the sign bits. Verify that the new mask value (0x7fff7fff) is correct and that the overall instruction ordering preserves the intended conversion from half to float and back.
; CI-NEXT:    s_and_b32 s3, s3, 0x7fff7fff

llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll:51

  • The check pattern has been updated to use an explicit temporary (%10) instead of the previous placeholder. Please verify that the revised output string correctly reflects the expected hardware disjoint flag semantics and remains consistent with the overall test framework.
; CHECK-NEXT:   %10:vgpr_32 = disjoint V_OR_B32_e64 [[COPY1]], [[COPY]], implicit $exec

llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll:475

  • The ordering of register moves and the use of the BFI instruction have changed in this file. Please verify that the new ordering correctly combines the magnitude and sign values without altering the intended bit-level semantics.
; SI-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x9

llvm/test/CodeGen/AMDGPU/fabs.ll:120

  • The fabs transformation now uses s_and_b32 to clear the sign bit. Please double-check that this revised instruction sequence correctly removes the sign from the float value across all test scenarios.
; VI-NEXT:    s_and_b32 s0, s3, 0x7fffffff

@gandhi56 gandhi56 force-pushed the issue-97715/lsv-insert-casts branch from c0aa7c9 to 395d38c Compare June 1, 2025 22:54
@gandhi56 gandhi56 marked this pull request as ready for review June 1, 2025 23:56
gandhi56 added 7 commits June 2, 2025 22:08
This commit adds tests to introduce bitcasts
to vectorize loads and stores.
This commit adds tests to introduce bitcasts
for increased vectorization of loads and stores.
NFC.
After collecting equivalence classes, loop over
each distinct pair of them and check if they
could be merged into one.

Consider classes A and B such that their leaders
differ only by their scalar bitwidths. We do not
yet merge them otherwise. Let N be the scalar
bitwidth of the leader instruction in A. Iterate
over all instructions in B and ensure their total
bitwidths match the total bitwidth of the leader
instruction of A. Finally, cast each instruction
in B with a mismatched type to a pointer, integer
or floating point type.

Resolve issue llvm#97715

Change-Id: Ib64fd98de5c908262947648ad14dc53b61814642
@gandhi56 gandhi56 force-pushed the issue-97715/lsv-insert-casts branch from 0374ac2 to c8b279f Compare June 3, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants