Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] Use pointer to member 0 as VPInterleaveRecipe's pointer arg. #106431

Merged
merged 12 commits into from
Oct 6, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Aug 28, 2024

Update VPInterleaveRecipe to always use the pointer to member 0 as pointer argument. This in many cases helps to remove unneeded index adjustments and simplifies VPInterleaveRecipe::execute.

In some rare cases, the address of member 0 does not dominate the insert position of the interleave group. In those cases a PtrAdd VPInstruction is emitted to compute the address of member 0 based on the address of the insert position. Alternatively we could hoist the recipe computing the address of member 0.

Update VPInterleaveRecipe to always use the pointer to member 0 as
pointer argument. This in many cases helps to remove unneeded index
adjustments and simplifies VPInterleaveRecipe::execute.

In some rare cases, the address of member 0 does not dominate the insert
position of the interleave group. In those cases a PtrAdd VPInstruction
is emitted to compute the address of member 0 based on the address of
the insert position. Alternatively we could hoist the recipe computing
the address of member 0.
@llvmbot
Copy link
Collaborator

llvmbot commented Aug 28, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

Update VPInterleaveRecipe to always use the pointer to member 0 as pointer argument. This in many cases helps to remove unneeded index adjustments and simplifies VPInterleaveRecipe::execute.

In some rare cases, the address of member 0 does not dominate the insert position of the interleave group. In those cases a PtrAdd VPInstruction is emitted to compute the address of member 0 based on the address of the insert position. Alternatively we could hoist the recipe computing the address of member 0.


Patch is 32.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106431.diff

10 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+7)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+7-5)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+42-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+3-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll (+7-10)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll (+6-12)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll (+5-3)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+9-16)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+2-3)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cb104c4ed2d03d..1e92bda5494248 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8993,8 +8993,8 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
   // Interleave memory: for each Interleave Group we marked earlier as relevant
   // for this VPlan, replace the Recipes widening its memory instructions with a
   // single VPInterleaveRecipe at its insertion point.
-  VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder,
-                                          CM.isScalarEpilogueAllowed());
+  VPlanTransforms::createInterleaveGroups(
+      *Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed());
 
   for (ElementCount VF : Range)
     Plan->addVF(VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e7ea5cb23b90d3..b08cba798a9a9b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags {
     assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
   }
 
+  VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {},
+                const Twine &Name = "")
+      : VPRecipeWithIRFlags(VPDef::VPInstructionSC,
+                            ArrayRef<VPValue *>({Ptr, Offset}),
+                            GEPFlagsTy(InBounds), DL),
+        Opcode(VPInstruction::PtrAdd), Name(Name.str()) {}
+
   VPInstruction(unsigned Opcode, std::initializer_list<VPValue *> Operands,
                 FastMathFlags FMFs, DebugLoc DL = {}, const Twine &Name = "");
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index f84317ba51257a..df4e3a54812626 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -646,7 +646,9 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
            "can only generate first lane for PtrAdd");
     Value *Ptr = State.get(getOperand(0), Part, /* IsScalar */ true);
     Value *Addend = State.get(getOperand(1), Part, /* IsScalar */ true);
-    return Builder.CreatePtrAdd(Ptr, Addend, Name);
+    return Builder.CreatePtrAdd(Ptr, Addend, Name,
+                                isInBounds() ? GEPNoWrapFlags::inBounds()
+                                             : GEPNoWrapFlags::none());
   }
   case VPInstruction::ResumePhi: {
     if (Part != 0)
@@ -2393,7 +2395,6 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
 
   // Prepare for the new pointers.
   SmallVector<Value *, 2> AddrParts;
-  unsigned Index = Group->getIndex(Instr);
 
   // TODO: extend the masked interleaved-group support to reversed access.
   VPValue *BlockInMask = getMask();
@@ -2413,10 +2414,11 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
     Idx = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
     Idx = State.Builder.CreateMul(Idx,
                                   State.Builder.getInt32(Group->getFactor()));
-    Idx = State.Builder.CreateAdd(Idx, State.Builder.getInt32(Index));
     Idx = State.Builder.CreateNeg(Idx);
-  } else
-    Idx = State.Builder.getInt32(-Index);
+  } else {
+    // TODO: Drop redundant 0-index GEP as follow-up.
+    Idx = State.Builder.getInt32(0);
+  }
 
   VPValue *Addr = getAddr();
   for (unsigned Part = 0; Part < State.UF; Part++) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index ee7c7cea0b7670..4dae6a54b017f6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -596,8 +596,7 @@ static void legalizeAndOptimizeInductions(VPlan &Plan, ScalarEvolution &SE) {
           Plan, InductionDescriptor::IK_IntInduction, Instruction::Add, nullptr,
           SE, nullptr, StartV, StepV, InsertPt);
 
-      auto *Recipe = new VPInstruction(VPInstruction::PtrAdd,
-                                       {PtrIV->getStartValue(), Steps},
+      auto *Recipe = new VPInstruction(PtrIV->getStartValue(), Steps, false,
                                        PtrIV->getDebugLoc(), "next.gep");
 
       Recipe->insertAfter(Steps);
@@ -1522,14 +1521,19 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
 }
 
 void VPlanTransforms::createInterleaveGroups(
-    const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+    VPlan &Plan,
+    const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+        &InterleaveGroups,
     VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed) {
+  if (InterleaveGroups.empty())
+    return;
+
   // Interleave memory: for each Interleave Group we marked earlier as relevant
   // for this VPlan, replace the Recipes widening its memory instructions with a
   // single VPInterleaveRecipe at its insertion point.
+  VPDominatorTree VPDT;
+  VPDT.recalculate(Plan);
   for (const auto *IG : InterleaveGroups) {
-    auto *Recipe =
-        cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getInsertPos()));
     SmallVector<VPValue *, 4> StoredValues;
     for (unsigned i = 0; i < IG->getFactor(); ++i)
       if (auto *SI = dyn_cast_or_null<StoreInst>(IG->getMember(i))) {
@@ -1539,9 +1543,39 @@ void VPlanTransforms::createInterleaveGroups(
 
     bool NeedsMaskForGaps =
         IG->requiresScalarEpilogue() && !ScalarEpilogueAllowed;
-    auto *VPIG = new VPInterleaveRecipe(IG, Recipe->getAddr(), StoredValues,
-                                        Recipe->getMask(), NeedsMaskForGaps);
-    VPIG->insertBefore(Recipe);
+
+    Instruction *IRInsertPos = IG->getInsertPos();
+    auto *InsertPos =
+        cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IRInsertPos));
+    VPRecipeBase *IP = InsertPos;
+
+    // Get or create the start address for the interleave group.
+    auto *Start =
+        cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
+    VPValue *Addr = Start->getAddr();
+    if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
+      bool InBounds = false;
+      if (auto *gep = dyn_cast<GetElementPtrInst>(
+              getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts()))
+        InBounds = gep->isInBounds();
+
+      // We cannot re-use the address of the first member because it does not
+      // dominate the insert position. Use the address of the insert position
+      // and create a PtrAdd to adjust the index to start at the first member.
+      APInt Offset(32,
+                   getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 *
+                       IG->getIndex(IRInsertPos),
+                   /*IsSigned=*/true);
+      VPValue *OffsetVPV = Plan.getOrAddLiveIn(
+          ConstantInt::get(IRInsertPos->getParent()->getContext(), -Offset));
+      Addr = new VPInstruction(InsertPos->getAddr(), OffsetVPV, InBounds);
+      Addr->getDefiningRecipe()->insertAfter(InsertPos);
+      IP = Addr->getDefiningRecipe();
+    }
+    auto *VPIG = new VPInterleaveRecipe(IG, Addr, StoredValues,
+                                        InsertPos->getMask(), NeedsMaskForGaps);
+    VPIG->insertAfter(IP);
+
     unsigned J = 0;
     for (unsigned i = 0; i < IG->getFactor(); ++i)
       if (Instruction *Member = IG->getMember(i)) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index e714f69eeff1ab..434545c5ea67e1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -112,7 +112,9 @@ struct VPlanTransforms {
   // widening its memory instructions with a single VPInterleaveRecipe at its
   // insertion point.
   static void createInterleaveGroups(
-      const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+      VPlan &Plan,
+      const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+          &InterleaveGroups,
       VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed);
 };
 
diff --git a/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll b/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
index cc1d11754b27ed..9383799b181c82 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
@@ -89,12 +89,11 @@ define void @test_free_instructions_feeding_geps_for_interleave_groups(ptr noali
 ; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x float> poison, float [[TMP40]], i64 0
 ; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLATINSERT]], <2 x float> poison, <2 x i32> zeroinitializer
 ; CHECK-NEXT:    [[TMP41:%.*]] = shl i64 [[TMP39]], 2
+; CHECK-NEXT:    [[TMP44:%.*]] = getelementptr float, ptr [[DST_1]], i64 [[TMP41]]
 ; CHECK-NEXT:    [[TMP42:%.*]] = load float, ptr [[P_INVAR]], align 4
 ; CHECK-NEXT:    [[BROADCAST_SPLATINSERT27:%.*]] = insertelement <2 x float> poison, float [[TMP42]], i64 0
 ; CHECK-NEXT:    [[BROADCAST_SPLAT28:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLATINSERT27]], <2 x float> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP43:%.*]] = or disjoint i64 [[TMP41]], 3
-; CHECK-NEXT:    [[TMP44:%.*]] = getelementptr float, ptr [[DST_1]], i64 [[TMP43]]
-; CHECK-NEXT:    [[TMP45:%.*]] = getelementptr float, ptr [[TMP44]], i32 -3
+; CHECK-NEXT:    [[TMP45:%.*]] = getelementptr float, ptr [[TMP44]], i32 0
 ; CHECK-NEXT:    [[TMP46:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLAT]], <2 x float> [[BROADCAST_SPLAT28]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
 ; CHECK-NEXT:    [[TMP47:%.*]] = shufflevector <4 x float> [[TMP46]], <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x float> [[TMP47]], <8 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
@@ -102,8 +101,8 @@ define void @test_free_instructions_feeding_geps_for_interleave_groups(ptr noali
 ; CHECK-NEXT:    [[TMP48:%.*]] = load float, ptr [[P_INVAR]], align 4
 ; CHECK-NEXT:    [[BROADCAST_SPLATINSERT29:%.*]] = insertelement <2 x float> poison, float [[TMP48]], i64 0
 ; CHECK-NEXT:    [[BROADCAST_SPLAT30:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLATINSERT29]], <2 x float> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP49:%.*]] = getelementptr float, ptr [[DST_2]], i64 [[TMP43]]
-; CHECK-NEXT:    [[TMP50:%.*]] = getelementptr float, ptr [[TMP49]], i32 -3
+; CHECK-NEXT:    [[TMP49:%.*]] = getelementptr float, ptr [[DST_2]], i64 [[TMP41]]
+; CHECK-NEXT:    [[TMP50:%.*]] = getelementptr float, ptr [[TMP49]], i32 0
 ; CHECK-NEXT:    [[BROADCAST_SPLAT36:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLAT30]], <2 x float> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
 ; CHECK-NEXT:    [[TMP51:%.*]] = shufflevector <4 x float> [[BROADCAST_SPLAT36]], <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    [[INTERLEAVED_VEC31:%.*]] = shufflevector <8 x float> [[TMP51]], <8 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
@@ -266,8 +265,7 @@ define void @geps_feeding_interleave_groups_with_reuse(ptr %arg, i64 %arg1, ptr
 ; CHECK-NEXT:    [[TMP35:%.*]] = fmul <2 x float> [[TMP34]], zeroinitializer
 ; CHECK-NEXT:    [[TMP36:%.*]] = fadd <2 x float> [[STRIDED_VEC16]], [[STRIDED_VEC20]]
 ; CHECK-NEXT:    [[TMP37:%.*]] = fmul <2 x float> [[TMP36]], zeroinitializer
-; CHECK-NEXT:    [[TMP38:%.*]] = getelementptr i8, ptr [[TMP28]], i64 12
-; CHECK-NEXT:    [[TMP39:%.*]] = getelementptr float, ptr [[TMP38]], i32 -3
+; CHECK-NEXT:    [[TMP39:%.*]] = getelementptr float, ptr [[TMP28]], i32 0
 ; CHECK-NEXT:    [[TMP40:%.*]] = shufflevector <2 x float> [[TMP31]], <2 x float> [[TMP33]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
 ; CHECK-NEXT:    [[TMP41:%.*]] = shufflevector <2 x float> [[TMP35]], <2 x float> [[TMP37]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
 ; CHECK-NEXT:    [[TMP42:%.*]] = shufflevector <4 x float> [[TMP40]], <4 x float> [[TMP41]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
@@ -490,11 +488,10 @@ define void @geps_feeding_interleave_groups_with_reuse2(ptr %A, ptr %B, i64 %N)
 ; CHECK-NEXT:    [[WIDE_VEC:%.*]] = load <16 x i32>, ptr [[TMP53]], align 4
 ; CHECK-NEXT:    [[STRIDED_VEC:%.*]] = shufflevector <16 x i32> [[WIDE_VEC]], <16 x i32> poison, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
 ; CHECK-NEXT:    [[STRIDED_VEC34:%.*]] = shufflevector <16 x i32> [[WIDE_VEC]], <16 x i32> poison, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
+; CHECK-NEXT:    [[TMP56:%.*]] = getelementptr i32, ptr [[A]], i64 [[TMP50]]
 ; CHECK-NEXT:    [[TMP54:%.*]] = getelementptr i32, ptr [[B]], <4 x i64> [[VEC_IND]]
 ; CHECK-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> [[TMP54]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison), !alias.scope [[META6:![0-9]+]]
-; CHECK-NEXT:    [[TMP55:%.*]] = or disjoint i64 [[TMP50]], 7
-; CHECK-NEXT:    [[TMP56:%.*]] = getelementptr i32, ptr [[A]], i64 [[TMP55]]
-; CHECK-NEXT:    [[TMP57:%.*]] = getelementptr i32, ptr [[TMP56]], i32 -7
+; CHECK-NEXT:    [[TMP57:%.*]] = getelementptr i32, ptr [[TMP56]], i32 0
 ; CHECK-NEXT:    [[TMP58:%.*]] = shufflevector <4 x i32> [[STRIDED_VEC]], <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    [[TMP59:%.*]] = shufflevector <4 x i32> [[STRIDED_VEC34]], <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    [[TMP60:%.*]] = shufflevector <4 x i32> [[WIDE_MASKED_GATHER]], <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
diff --git a/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll b/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
index 6b52023cfbcaec..968058134690bc 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
@@ -1419,13 +1419,11 @@ define dso_local void @masked_strided2(ptr noalias nocapture readonly %p, ptr no
 ; ENABLED_MASKED_STRIDED-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP2]], i32 1, <16 x i1> [[INTERLEAVED_MASK]], <16 x i8> poison)
 ; ENABLED_MASKED_STRIDED-NEXT:    [[STRIDED_VEC:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
 ; ENABLED_MASKED_STRIDED-NEXT:    [[STRIDED_VEC1:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP3:%.*]] = or disjoint i32 [[TMP1]], 1
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP4:%.*]] = call <8 x i8> @llvm.smax.v8i8(<8 x i8> [[STRIDED_VEC]], <8 x i8> [[STRIDED_VEC1]])
+; ENABLED_MASKED_STRIDED-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP1]]
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP5:%.*]] = sub <8 x i8> zeroinitializer, [[TMP4]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP3]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[TMP6]], i32 -1
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP4]], <8 x i8> [[TMP5]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP7]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
+; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP6]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
 ; ENABLED_MASKED_STRIDED-NEXT:    [[VEC_IND_NEXT]] = add <8 x i32> [[VEC_IND]], <i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8>
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1024
@@ -2555,13 +2553,11 @@ define dso_local void @masked_strided2_unknown_tc(ptr noalias nocapture readonly
 ; ENABLED_MASKED_STRIDED-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP3]], i32 1, <16 x i1> [[INTERLEAVED_MASK]], <16 x i8> poison)
 ; ENABLED_MASKED_STRIDED-NEXT:    [[STRIDED_VEC:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
 ; ENABLED_MASKED_STRIDED-NEXT:    [[STRIDED_VEC3:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP5:%.*]] = or disjoint i32 [[TMP2]], 1
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP6:%.*]] = call <8 x i8> @llvm.smax.v8i8(<8 x i8> [[STRIDED_VEC]], <8 x i8> [[STRIDED_VEC3]])
+; ENABLED_MASKED_STRIDED-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP2]]
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP7:%.*]] = sub <8 x i8> zeroinitializer, [[TMP6]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP5]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[TMP8]], i32 -1
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP9]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
+; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP8]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INDEX_NEXT]] = add i32 [[INDEX]], 8
 ; ENABLED_MASKED_STRIDED-NEXT:    [[VEC_IND_NEXT]] = add <8 x i32> [[VEC_IND]], <i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8>
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
@@ -2989,13 +2985,11 @@ define dso_local void @unconditional_masked_strided2_unknown_tc(ptr noalias noca
 ; ENABLED_MASKED_STRIDED-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP2]], i32 1, <16 x i1> [[INTERLEAVED_MASK]], <16 x i8> poison)
 ; ENABLED_MASKED_STRIDED-NEXT:    [[STRIDED_VEC:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
 ; ENABLED_MASKED_STRIDED-NEXT:    [[STRIDED_VEC3:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP3:%.*]] = or disjoint i32 [[TMP1]], 1
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP4:%.*]] = call <8 x i8> @llvm.smax.v8i8(<8 x i8> [[STRIDED_VEC]], <8 x i8> [[STRIDED_VEC3]])
+; ENABLED_MASKED_STRIDED-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[Q:%.*]], i32 [[TMP1]]
 ; ENABLED_MASKED_STRIDED-NEXT:    [[TMP5:%.*]] = sub <8 x i8> zeroinitializer, [[TMP4]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[Q:%.*]], i32 [[TMP3]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 -1
 ; ENABLED_MASKED_STRIDED-NEXT:    [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP4]], <8 x i8> [[TMP5]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP7]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
+; ENABLED_MASKED_STRIDED-NEXT:    call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP6]], i32 1, <16 x i1> [[...
[truncated]

fhahn added a commit to fhahn/llvm-project that referenced this pull request Aug 28, 2024
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm#106431.

Fixes llvm#82936
@fhahn
Copy link
Contributor Author

fhahn commented Sep 12, 2024

ping :)

Copy link
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me, thanks!

llvm/lib/Transforms/Vectorize/VPlan.h Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification! Can provide a step towards facilitating SLP.

@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags {
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
}

VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit odd to construct PtrAdd given two VPValues and a bool. Perhaps pass a GEPFlagsTy as third parameter instead of bool, analogous to passing FMFs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment on lines 650 to 651
isInBounds() ? GEPNoWrapFlags::inBounds()
: GEPNoWrapFlags::none());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of this patch, but exercised only with it?

Comment on lines 649 to 651
return Builder.CreatePtrAdd(Ptr, Addend, Name,
isInBounds() ? GEPNoWrapFlags::inBounds()
: GEPNoWrapFlags::none());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Builder.CreatePtrAdd(Ptr, Addend, Name,
isInBounds() ? GEPNoWrapFlags::inBounds()
: GEPNoWrapFlags::none());
auto Flags = isInBounds() ? GEPNoWrapFlags::inBounds()
: GEPNoWrapFlags::none());
return Builder.CreatePtrAdd(Ptr, Addend, Name, Flags);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use CreateInBoundsPtrAdd/CreatePtrAdd.

@@ -2413,10 +2414,11 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
Idx = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
Idx = State.Builder.CreateMul(Idx,
State.Builder.getInt32(Group->getFactor()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the above comment of adjusting the index from the first vector lane (lane zero?) rather than using the pointer of the last lane, still hold? If so, this adjusting is invariant and better be placed in the preheader, possibly later by licm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so, we just don't have to compute the index for the first lane (used to be done via the removed add)

} else {
// TODO: Drop redundant 0-index GEP as follow-up.
Idx = State.Builder.getInt32(0);
}

VPValue *Addr = getAddr();
for (unsigned Part = 0; Part < State.UF; Part++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the "Notice current instruction ..." explanation below be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be dropped now I think, done, thanks!

auto *Start =
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
VPValue *Addr = Start->getAddr();
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then hoist Addr's defining recipe to insert pos, possibly hoisting additional defining recipes as needed? And/or sink loads above member zero to join it.

In general, all members of an interleave group conceptually move to its insert pos, so may as well perform actual movement. This should facilitate subsequent SLP'ing.

Interleave stores are inserted at the last member, so the address of any member should be available there. Interleaved loads are inserted at the first member, so only its address is guaranteed to be available there, although others may as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, should be done as potential follow-up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, worth leaving behind a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Comment on lines +96 to +97
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[TMP4]], i32 -4
; CHECK-NEXT: [[TMP51:%.*]] = getelementptr i16, ptr [[TMP5]], i32 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: here a redundant gep with zero is added, replacing TMP4[-2:i16] with TMP4[-4:i8 + 0]

; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLAT]], <4 x i32> [[BROADCAST_SPLAT2]], <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
; CHECK-NEXT: store <8 x i32> [[INTERLEAVED_VEC]], ptr [[TMP12]], align 4
; CHECK-NEXT: store <8 x i32> [[INTERLEAVED_VEC]], ptr [[TMP7]], align 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this stores INTERLEAVED_VEC at A[OFFSET_IDX], rather than to A[OFFSET_IDX+(1*4)-4]
Stores of X should be dce'd.

; ENABLED_MASKED_STRIDED-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP4]], <8 x i8> [[TMP5]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP7]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP6]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Q[(TMP1+1)-1] --> Q[TMP1]

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp Outdated Show resolved Hide resolved
fhahn added a commit that referenced this pull request Sep 24, 2024
fhahn added a commit to fhahn/llvm-project that referenced this pull request Sep 25, 2024
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm#106431.

Fixes llvm#82936
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, adding a few minor comments.

@@ -1323,6 +1322,13 @@ class VPInstruction : public VPRecipeWithIRFlags,
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
}

VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags = {false},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags = {false},
VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags,

default unneeded and may lead to potential ambiguation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

DebugLoc DL = {}, const Twine &Name = "")
: VPRecipeWithIRFlags(VPDef::VPInstructionSC,
ArrayRef<VPValue *>({Ptr, Offset}),
GEPFlagsTy(Flags), DL),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GEPFlagsTy(Flags), DL),
Flags, DL),

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

} else
Idx = State.Builder.getInt32(-Index);
} else {
// TODO: Drop redundant 0-index GEP as follow-up.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call; generating gep with zero also causes some test discrepancies.

auto *Start =
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
VPValue *Addr = Start->getAddr();
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also handle(s) a live-in Addr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by checking if there's a defining recipe, thanks!

// and create a PtrAdd to adjust the index to start at the first member.
APInt Offset(32,
getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 *
IG->getIndex(IRInsertPos),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth asserting Offset or index of IRInsertPos is non zero?
I.e., if insert pos is the first member, its address operand must dominate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assert, thanks!

getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts()))
InBounds = Gep->isInBounds();

// We cannot re-use the address of the first member because it does not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We cannot re-use the address of the first member because it does not
// We cannot re-use the address of member zero because it does not

(may be confusing: insertion point appears "first" (for loads, last for stores); here we mean "first" in terms of memory addresses.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

InBounds = Gep->isInBounds();

// We cannot re-use the address of the first member because it does not
// dominate the insert position. Use the address of the insert position
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// dominate the insert position. Use the address of the insert position
// dominate the insert position. Instead, use the address of the insert position

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks.


// We cannot re-use the address of the first member because it does not
// dominate the insert position. Use the address of the insert position
// and create a PtrAdd to adjust the index to start at the first member.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// and create a PtrAdd to adjust the index to start at the first member.
// and create a PtrAdd adjusting it to the address of member zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

auto *Start =
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
VPValue *Addr = Start->getAddr();
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, worth leaving behind a TODO.

@fhahn fhahn merged commit 7f74651 into llvm:main Oct 6, 2024
6 of 8 checks passed
@fhahn fhahn deleted the vplan-interleave-start-address branch October 6, 2024 21:53
Kyvangka1610 added a commit to Kyvangka1610/llvm-project that referenced this pull request Oct 7, 2024
* commit 'FETCH_HEAD':
  [X86] getIntImmCostInst - pull out repeated Imm.getBitWidth() calls. NFC.
  [X86] Add test coverage for llvm#111323
  [Driver] Use empty multilib file in another test (llvm#111352)
  [clang][OpenMP][test] Use x86_64-linux-gnu triple for test referencing avx512f feature (llvm#111337)
  [doc] Fix Kaleidoscope tutorial chapter 3 code snippet and full listing discrepancies (llvm#111289)
  [Flang][OpenMP] Improve entry block argument creation and binding (llvm#110267)
  [x86] combineMul - handle 0/-1 KnownBits cases before MUL_IMM logic (REAPPLIED)
  [llvm-dis] Fix non-deterministic disassembly across multiple inputs (llvm#110988)
  [lldb][test] TestDataFormatterLibcxxOptionalSimulator.py: change order of ifdefs
  [lldb][test] Add libcxx-simulators test for std::optional (llvm#111133)
  [x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat. (REAPPLIED)
  Reland "[lldb][test] TestDataFormatterLibcxxStringSimulator.py: add new padding layout" (llvm#111123)
  Revert "[x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat."
  update_test_checks: fix a simple regression  (llvm#111347)
  [LegalizeVectorTypes] Always widen fabs (llvm#111298)
  [lsan] Make ReportUnsuspendedThreads return bool also for Fuchsia
  [mlir][vector] Add more tests for ConvertVectorToLLVM (6/n) (llvm#111121)
  [bazel] port 9144fed
  [SystemZ] Remove inlining threshold multiplier. (llvm#106058)
  [LegalizeVectorTypes] When widening don't check for libcalls if promoted (llvm#111297)
  [clang][Driver] Improve multilib custom error reporting (llvm#110804)
  [clang][Driver] Rename "FatalError" key to "Error" in multilib.yaml (llvm#110804)
  [LLVM][Maintainers] Update release managers (llvm#111164)
  [Clang][Driver] Add option to provide path for multilib's YAML config file (llvm#109640)
  [LoopVectorize] Remove redundant code in emitSCEVChecks (llvm#111132)
  [AMDGPU] Only emit SCOPE_SYS global_wb (llvm#110636)
  [ELF] Change Ctx::target to unique_ptr (llvm#111260)
  [ELF] Pass Ctx & to some free functions
  [RISCV] Only disassemble fcvtmod.w.d if the rounding mode is rtz. (llvm#111308)
  [Clang] Remove the special-casing for RequiresExprBodyDecl in BuildResolvedCallExpr() after fd87d76 (llvm#111277)
  [ELF] Pass Ctx & to InputFile
  [clang-format] Add AlignFunctionDeclarations to AlignConsecutiveDeclarations (llvm#108241)
  [AMDGPU] Support preloading hidden kernel arguments (llvm#98861)
  [ELF] Move static nextGroupId isInGroup to LinkerDriver
  [clangd] Add ArgumentLists config option under Completion (llvm#111322)
  [ELF] Pass Ctx & to SyntheticSections
  [ELF] Pass Ctx & to Symbols
  [ELF] Pass Ctx & to Symbols
  [ELF] getRelocTargetVA: pass Ctx and Relocation. NFC
  [clang-tidy] Avoid capturing a local variable in a static lambda in UseRangesCheck (llvm#111282)
  [VPlan] Use pointer to member 0 as VPInterleaveRecipe's pointer arg. (llvm#106431)
  [clangd] Simplify ternary expressions with std::optional::value_or (NFC) (llvm#111309)
  [libc++][format][2/3] Optimizes c-string arguments. (llvm#101805)
  [RISCV] Combine RVBUnary and RVKUnary into classes that are more similar to ALU(W)_r(r/i). NFC (llvm#111279)
  [ELF] Pass Ctx & to InputFiles
  [libc] GPU RPC interface: add return value to `rpc_host_call` (llvm#111288)

Signed-off-by: kyvangka1610 <kyvangka2002@gmail.com>
fhahn added a commit that referenced this pull request Oct 8, 2024
The GEP with offet 0 is redundant, remove it. This addresses a TODO
from 7f74651 ((#106431).
@dyung
Copy link
Collaborator

dyung commented Oct 8, 2024

Hi @fhahn, we are seeing an assertion failure on an internal test which I bisected back to this change. I have put the details in #111606, can you take a look?

@fhahn
Copy link
Contributor Author

fhahn commented Oct 9, 2024

Hi @fhahn, we are seeing an assertion failure on an internal test which I bisected back to this change. I have put the details in #111606, can you take a look?

Thanks @dyung, should be fixed by 01cbbc5

@alexfh
Copy link
Contributor

alexfh commented Oct 15, 2024

Hi @fhahn, a heads up: some of our tests started failing on aarch64 after this commit. Bisecting with -opt-bisect-limit points to LoopVectorizePass on a certain function. I'm trying to figure out if there's a UB in the original code, but so far I don't see anything suspicious. I'll try to provide a test case, if I don't find problems with the original code.

@alexfh
Copy link
Contributor

alexfh commented Oct 15, 2024

I looked closer at the original C++ code and it actually looks like a miscompile. The diff in the generated IR before and after this commit (and the follow-up):

--- /tmp/q-good.ll      2024-10-15 18:29:02.491364585 +0200
+++ /tmp/q-bad.ll       2024-10-15 18:28:29.535008374 +0200
@@ -1281,8 +1281,8 @@
   %41 = mul i64 %40, 24
   %42 = getelementptr i8, ptr %4, i64 %41
   %43 = insertelement <2 x i64> <i64 poison, i64 0>, i64 %12, i64 0
-  %44 = getelementptr i8, ptr %4, i64 8
-  %45 = getelementptr i8, ptr %4, i64 56
+  %44 = getelementptr i8, ptr %4, i64 16
+  %45 = getelementptr i8, ptr %4, i64 64
   br label %46

 46:                                               ; preds = %46, %36
@@ -1291,16 +1291,16 @@
   %49 = phi <2 x i64> [ zeroinitializer, %36 ], [ %68, %46 ]
   %50 = mul i64 %47, 24
   %51 = getelementptr i8, ptr %44, i64 %50
-  %52 = load <6 x ptr>, ptr %51, align 8
-  %53 = shufflevector <6 x ptr> %52, <6 x ptr> poison, <2 x i32> <i32 0, i32 3>
-  %54 = shufflevector <6 x ptr> %52, <6 x ptr> poison, <2 x i32> <i32 1, i32 4>
-  %55 = getelementptr i8, ptr %45, i64 %50
-  %56 = load <6 x ptr>, ptr %55, align 8
+  %52 = getelementptr i8, ptr %45, i64 %50
+  %53 = load <6 x ptr>, ptr %51, align 8
+  %54 = shufflevector <6 x ptr> %53, <6 x ptr> poison, <2 x i32> <i32 0, i32 3>
+  %55 = shufflevector <6 x ptr> %53, <6 x ptr> poison, <2 x i32> <i32 1, i32 4>
+  %56 = load <6 x ptr>, ptr %52, align 8
   %57 = shufflevector <6 x ptr> %56, <6 x ptr> poison, <2 x i32> <i32 0, i32 3>
   %58 = shufflevector <6 x ptr> %56, <6 x ptr> poison, <2 x i32> <i32 1, i32 4>
-  %59 = ptrtoint <2 x ptr> %54 to <2 x i64>
+  %59 = ptrtoint <2 x ptr> %55 to <2 x i64>
   %60 = ptrtoint <2 x ptr> %58 to <2 x i64>
-  %61 = ptrtoint <2 x ptr> %53 to <2 x i64>
+  %61 = ptrtoint <2 x ptr> %54 to <2 x i64>
   %62 = ptrtoint <2 x ptr> %57 to <2 x i64>
   %63 = sub <2 x i64> %59, %61
   %64 = sub <2 x i64> %60, %62
@@ -8549,7 +8549,7 @@
 !0 = !{i32 1, !"wchar_size", i32 4}
 !1 = !{i32 8, !"PIC Level", i32 2}
 !2 = !{i32 7, !"frame-pointer", i32 1}
-!3 = !{!"clang version  (00128a20eec27246719d73ba427bf821883b00b4)"}
+!3 = !{!"clang version  (efcfa6e711689ada546c323316145ecd749d380a)"}
 !4 = !{}
 !5 = distinct !{!5, !6}
 !6 = !{!"llvm.loop.mustprogress"}

I've reduced this to

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

define i64 @_f(ptr %0, ptr %1) {
  br label %4

3:                                                ; preds = %4
  ret i64 %14

4:                                                ; preds = %4, %2
  %5 = phi i64 [ %14, %4 ], [ 0, %2 ]
  %6 = phi ptr [ %15, %4 ], [ %1, %2 ]
  %7 = getelementptr i8, ptr %6, i64 16
  %8 = load ptr, ptr %7, align 8
  %9 = ptrtoint ptr %8 to i64
  %10 = getelementptr i8, ptr %6, i64 8
  %11 = load ptr, ptr %10, align 8
  %12 = ptrtoint ptr %11 to i64
  %13 = or i64 %9, %12
  %14 = or i64 %13, %5
  %15 = getelementptr nusw i8, ptr %6, i64 24
  %16 = icmp eq ptr %6, %0
  br i1 %16, label %3, label %4

; uselistorder directives
  uselistorder i64 %14, { 1, 0 }
}
$ diff -u100 <(./clang-good --target=aarch64-linux-gnu -O3 -emit-llvm -S reduced.ll -o -) <(./clang-bad --target=aarch64-linux-gnu -O3 -emit-llvm -S reduced.ll -o -)
--- /dev/fd/63  2024-10-15 18:55:25.820702792 +0200
+++ /dev/fd/62  2024-10-15 18:55:25.824702835 +0200
@@ -1,90 +1,90 @@
 ; ModuleID = 'reduced.ll'
 source_filename = "reduced.ll"
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
 target triple = "aarch64-unknown-linux-gnu"

 ; Function Attrs: nofree norecurse nosync nounwind memory(argmem: read)
 define i64 @_f(ptr readnone %0, ptr readonly %1) local_unnamed_addr #0 {
   %3 = ptrtoint ptr %1 to i64
   %4 = ptrtoint ptr %0 to i64
   %5 = sub i64 %4, %3
   %min.iters.check = icmp ult i64 %5, 96
   br i1 %min.iters.check, label %scalar.ph.preheader, label %vector.ph

 scalar.ph.preheader:                              ; preds = %middle.block, %2
   %.ph = phi i64 [ 0, %2 ], [ %20, %middle.block ]
   %.ph7 = phi ptr [ %1, %2 ], [ %ind.end, %middle.block ]
   br label %scalar.ph

 vector.ph:                                        ; preds = %2
   %6 = udiv i64 %5, 24
   %7 = add nuw nsw i64 %6, 1
   %n.mod.vf = and i64 %7, 3
   %8 = icmp eq i64 %n.mod.vf, 0
   %9 = select i1 %8, i64 4, i64 %n.mod.vf
   %n.vec = sub nsw i64 %7, %9
   %10 = mul i64 %n.vec, 24
   %ind.end = getelementptr i8, ptr %1, i64 %10
-  %invariant.gep = getelementptr i8, ptr %1, i64 8
-  %invariant.gep12 = getelementptr i8, ptr %1, i64 56
+  %invariant.gep = getelementptr i8, ptr %1, i64 16
+  %invariant.gep12 = getelementptr i8, ptr %1, i64 64
   br label %vector.body

 vector.body:                                      ; preds = %vector.body, %vector.ph
   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
   %vec.phi = phi <2 x i64> [ zeroinitializer, %vector.ph ], [ %17, %vector.body ]
   %vec.phi1 = phi <2 x i64> [ zeroinitializer, %vector.ph ], [ %18, %vector.body ]
   %offset.idx = mul i64 %index, 24
   %gep = getelementptr i8, ptr %invariant.gep, i64 %offset.idx
+  %gep13 = getelementptr i8, ptr %invariant.gep12, i64 %offset.idx
   %wide.vec = load <6 x ptr>, ptr %gep, align 8
   %strided.vec = shufflevector <6 x ptr> %wide.vec, <6 x ptr> poison, <2 x i32> <i32 0, i32 3>
   %strided.vec3 = shufflevector <6 x ptr> %wide.vec, <6 x ptr> poison, <2 x i32> <i32 1, i32 4>
-  %gep13 = getelementptr i8, ptr %invariant.gep12, i64 %offset.idx
   %wide.vec4 = load <6 x ptr>, ptr %gep13, align 8
   %strided.vec5 = shufflevector <6 x ptr> %wide.vec4, <6 x ptr> poison, <2 x i32> <i32 0, i32 3>
   %strided.vec6 = shufflevector <6 x ptr> %wide.vec4, <6 x ptr> poison, <2 x i32> <i32 1, i32 4>
   %11 = ptrtoint <2 x ptr> %strided.vec3 to <2 x i64>
   %12 = ptrtoint <2 x ptr> %strided.vec6 to <2 x i64>
   %13 = ptrtoint <2 x ptr> %strided.vec to <2 x i64>
   %14 = ptrtoint <2 x ptr> %strided.vec5 to <2 x i64>
   %15 = or <2 x i64> %vec.phi, %11
   %16 = or <2 x i64> %vec.phi1, %12
   %17 = or <2 x i64> %15, %13
   %18 = or <2 x i64> %16, %14
   %index.next = add nuw i64 %index, 4
   %19 = icmp eq i64 %index.next, %n.vec
   br i1 %19, label %middle.block, label %vector.body, !llvm.loop !0

 middle.block:                                     ; preds = %vector.body
   %bin.rdx = or <2 x i64> %18, %17
   %20 = tail call i64 @llvm.vector.reduce.or.v2i64(<2 x i64> %bin.rdx)
   br label %scalar.ph.preheader

 21:                                               ; preds = %scalar.ph
   ret i64 %31

 scalar.ph:                                        ; preds = %scalar.ph.preheader, %scalar.ph
   %22 = phi i64 [ %31, %scalar.ph ], [ %.ph, %scalar.ph.preheader ]
   %23 = phi ptr [ %32, %scalar.ph ], [ %.ph7, %scalar.ph.preheader ]
   %24 = getelementptr i8, ptr %23, i64 16
   %25 = load ptr, ptr %24, align 8
   %26 = ptrtoint ptr %25 to i64
   %27 = getelementptr i8, ptr %23, i64 8
   %28 = load ptr, ptr %27, align 8
   %29 = ptrtoint ptr %28 to i64
   %30 = or i64 %22, %26
   %31 = or i64 %30, %29
   %32 = getelementptr nusw i8, ptr %23, i64 24
   %33 = icmp eq ptr %23, %0
   br i1 %33, label %21, label %scalar.ph, !llvm.loop !3
 }

 ; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
 declare i64 @llvm.vector.reduce.or.v2i64(<2 x i64>) #1

 attributes #0 = { nofree norecurse nosync nounwind memory(argmem: read) }
 attributes #1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

 !0 = distinct !{!0, !1, !2}
 !1 = !{!"llvm.loop.isvectorized", i32 1}
 !2 = !{!"llvm.loop.unroll.runtime.disable"}
 !3 = distinct !{!3, !2, !1}

Hopefully, no important details were lost during reduction.

fhahn added a commit that referenced this pull request Oct 16, 2024
fhahn added a commit that referenced this pull request Oct 16, 2024
Use getAllocTypeSize to get compute the offset to the start of
interleave groups instead getScalarSizeInBits, which may return 0 for
pointers. This is in line with the analysis building the interleave
groups and fixes a mis-compile reported for
#106431.
@fhahn
Copy link
Contributor Author

fhahn commented Oct 16, 2024

@AlexHF thanks! Should be fixed in bbff5b8. The issue was that the patch was using getScalarSizeInBits, which returned 0 for pointers. Fixed by using getAllocTypeSize (in line with what the interleave analysis uses)

bricknerb pushed a commit to bricknerb/llvm-project that referenced this pull request Oct 16, 2024
bricknerb pushed a commit to bricknerb/llvm-project that referenced this pull request Oct 16, 2024
Use getAllocTypeSize to get compute the offset to the start of
interleave groups instead getScalarSizeInBits, which may return 0 for
pointers. This is in line with the analysis building the interleave
groups and fixes a mis-compile reported for
llvm#106431.
@alexfh
Copy link
Contributor

alexfh commented Oct 16, 2024

Thanks! bbff5b8 resolved the issues we found so far.

bricknerb pushed a commit to bricknerb/llvm-project that referenced this pull request Oct 17, 2024
bricknerb pushed a commit to bricknerb/llvm-project that referenced this pull request Oct 17, 2024
Use getAllocTypeSize to get compute the offset to the start of
interleave groups instead getScalarSizeInBits, which may return 0 for
pointers. This is in line with the analysis building the interleave
groups and fixes a mis-compile reported for
llvm#106431.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants