Skip to content

Conversation

@lukel97
Copy link
Contributor

@lukel97 lukel97 commented Aug 18, 2025

If we end up with a extract_element VPInstruction where both operands are live-ins, we will try to fold the live-ins even though the first operand is a vector whilst the live-in is scalar.

This fixes it by just returning the vector live-in instead of calling the folder, and removes the handling for insertelement where we aren't able to do the fold. From some quick testing we previously never hit this fold anyway, and were probably just missing test coverage.

Fixes #154045

If we end up with a extract_element VPInstruction where both operands are live-ins, we will try to fold the live-ins even though the first operand is vector and the live-in is scalar.

I don't think it's possible for us to fold these. We can't create a vector version of the live-in either because we don't know VF at this stage.

This removes the handling for opcodes that may have both scalar and vector operands. From some quick testing we previously never hit these folds anyway, and were probably just missing test coverage.

Fixes llvm#154045
@llvmbot
Copy link
Member

llvmbot commented Aug 18, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Luke Lau (lukel97)

Changes

If we end up with a extract_element VPInstruction where both operands are live-ins, we will try to fold the live-ins even though the first operand is vector and the live-in is scalar.

I don't think it's possible for us to fold these. We can't create a vector version of the live-in either because we don't know VF at this stage.

This removes the handling for opcodes that may have both scalar and vector operands. From some quick testing we previously never hit these folds anyway, and were probably just missing test coverage.

Fixes #154045


Full diff: https://github.com/llvm/llvm-project/pull/154067.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (-5)
  • (added) llvm/test/Transforms/LoopVectorize/X86/pr154045-dont-fold-extractelement-livein.ll (+75)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index de0c1e4d177bb..022d9b749f978 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -976,14 +976,9 @@ static Value *tryToFoldLiveIns(const VPRecipeBase &R, unsigned Opcode,
                           RFlags.getGEPNoWrapFlags());
   }
   case VPInstruction::PtrAdd:
-  case VPInstruction::WidePtrAdd:
     return Folder.FoldGEP(IntegerType::getInt8Ty(TypeInfo.getContext()), Ops[0],
                           Ops[1],
                           cast<VPRecipeWithIRFlags>(R).getGEPNoWrapFlags());
-  case Instruction::InsertElement:
-    return Folder.FoldInsertElement(Ops[0], Ops[1], Ops[2]);
-  case Instruction::ExtractElement:
-    return Folder.FoldExtractElement(Ops[0], Ops[1]);
   }
   return nullptr;
 }
diff --git a/llvm/test/Transforms/LoopVectorize/X86/pr154045-dont-fold-extractelement-livein.ll b/llvm/test/Transforms/LoopVectorize/X86/pr154045-dont-fold-extractelement-livein.ll
new file mode 100644
index 0000000000000..185071707b6e8
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/X86/pr154045-dont-fold-extractelement-livein.ll
@@ -0,0 +1,75 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -p loop-vectorize -S %s | FileCheck %s
+
+; Make sure we don't try to fold a Instruction::ExtractElement ir<0>, ir<0>, ,
+; since we can't materialize the live-in for the vector operand.
+
+target triple = "x86_64"
+
+define void @test(ptr %p, i1 %c, i64 %x) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]], i1 [[C:%.*]], i64 [[X:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i1> poison, i1 [[C]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i1> [[BROADCAST_SPLATINSERT]], <2 x i1> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP0:%.*]] = xor <2 x i1> [[BROADCAST_SPLAT]], splat (i1 true)
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <2 x i1> [[TMP0]], i32 0
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[PRED_SREM_IF:.*]], label %[[PRED_SREM_CONTINUE:.*]]
+; CHECK:       [[PRED_SREM_IF]]:
+; CHECK-NEXT:    br label %[[PRED_SREM_CONTINUE]]
+; CHECK:       [[PRED_SREM_CONTINUE]]:
+; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <2 x i1> [[TMP0]], i32 1
+; CHECK-NEXT:    br i1 [[TMP2]], label %[[PRED_SREM_IF1:.*]], label %[[PRED_SREM_CONTINUE2:.*]]
+; CHECK:       [[PRED_SREM_IF1]]:
+; CHECK-NEXT:    br label %[[PRED_SREM_CONTINUE2]]
+; CHECK:       [[PRED_SREM_CONTINUE2]]:
+; CHECK-NEXT:    store i32 0, ptr [[P]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C]], label %[[LATCH]], label %[[ELSE:.*]]
+; CHECK:       [[ELSE]]:
+; CHECK-NEXT:    [[REM:%.*]] = srem i64 0, [[X]]
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[PHI:%.*]] = phi i64 [ [[REM]], %[[ELSE]] ], [ 0, %[[LOOP]] ]
+; CHECK-NEXT:    [[PHI_TRUNC:%.*]] = trunc i64 [[PHI]] to i32
+; CHECK-NEXT:    [[SHL:%.*]] = shl i32 [[PHI_TRUNC]], 0
+; CHECK-NEXT:    store i32 [[SHL]], ptr [[P]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 1
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c, label %latch, label %else
+
+else:
+  %rem = srem i64 0, %x
+  br label %latch
+
+latch:
+  %phi = phi i64 [ %rem, %else], [ 0, %loop ]
+  %phi.trunc = trunc i64 %phi to i32
+  %shl = shl i32 %phi.trunc, 0
+  store i32 %shl, ptr %p
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv, 1
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}

Copy link
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure I understand the problem yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this test target-independent and combine it with constantfolder.ll, for instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try but I couldn't get it to trigger without -mtriple x86_64. I think it has something to do with how the srem can't be widened on x86 and needs to be replicated instead. I'm not sure if there's an easy way of forcing that on the command line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to move this to the target-independent file, it probably needs -force-widen-divrem-via-safe-divisor=false passing and the loop must have a trip count of at least 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it specifically needed a trip count of 2, otherwise a widened canonical IV recipe was created which seemed to throw things off. I've moved it into a separate target-agnostic file in caca6ea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's strange, it worked as expected for me when I changed it to have trip count of 4 with VF=4. The wide canonical IV is created if the trip count is < VF, as then it tries to fold the tail instead of creating an un-excutable vector loop

@lukel97
Copy link
Contributor Author

lukel97 commented Aug 18, 2025

Not 100% sure I understand the problem yet.

live-ins are always scalar, and just broadcasted during recipe execution if a vector is needed AFAIK.

So currently we try to constant fold FoldExtractElement(Ops[0], Ops[1]) with two scalars. But the first operand needs to be a vector, hence the crash.

RFlags.getGEPNoWrapFlags());
}
case VPInstruction::PtrAdd:
case VPInstruction::WidePtrAdd:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? If both are constants, they both should be scalars, and the fold should just work, and the result re-broadcasted?

Copy link
Contributor Author

@lukel97 lukel97 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second operand to WidePtrAdd is a vector. I don't think it will ever end up being a live-in? At least not the way it's used by expandVPWidenPointerInduction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment yes, but that may change n the future. If it is a live-in, the value must be a scalar.

Comment on lines 983 to 986
case Instruction::InsertElement:
return Folder.FoldInsertElement(Ops[0], Ops[1], Ops[2]);
case Instruction::ExtractElement:
return Folder.FoldExtractElement(Ops[0], Ops[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could still fold extracts if we have a constant for Ops[0], for which we could create a vector constant fairly easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't create a vector constant because we don't know the exact VF, so we can't create a vector type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so extracting any lane will return the live-in operand, which we can do w/o calling the folder, together with comment for why we are not calling the folder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't think we can do that for InsertElement though, so are we ok to remove that fold?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to move this to the target-independent file, it probably needs -force-widen-divrem-via-safe-divisor=false passing and the loop must have a trip count of at least 4

@lukel97 lukel97 enabled auto-merge (squash) August 19, 2025 03:31
@lukel97 lukel97 merged commit 144736b into llvm:main Aug 19, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LoopVectorize] Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

4 participants