[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

ritter-x2a · 2025-01-21T13:36:44Z

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically divergent values for the branch condition.
This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so.

This is an alternative to PR #117567, using a simpler heuristic and passing the LikelyDivergent information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyDivergent argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-divergent-branches.ll.

For SWDEV-483228.

llvmbot · 2025-01-21T13:37:08Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically varying values for the branch condition.
This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so.

This is an alternative to PR #117567, using a simpler heuristic and passing the LikelyVarying information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-varying-branches.ll.

For SWDEV-483228.

Patch is 273.07 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123749.diff

75 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+4-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+3)
(modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+93-4)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+5-5)
(modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+29-3)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+2-2)
(modified) llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp (+4)
(modified) llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir (+4-4)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir (+2-2)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+1-1)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll (+10-10)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/irreducible-1.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+20-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-brcond.mir (+29-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-regbanklegalize.mir (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-regbankselect.mir (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.mir (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+6-12)
(added) llvm/test/CodeGen/AMDGPU/annotate-likely-varying-branches.ll (+621)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+11-22)
(modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+12-24)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+31-31)
(added) llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll (+124)
(modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+16-17)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+14-29)
(modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-intervals.mir (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.mir (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.xfail.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-other-terminators.mir (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/lower-i1-copies-clear-kills.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-atomic-insert-end.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-multiple-mem-operands-nontemporal-1.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/mmra.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/multi-divergent-exit-region.ll (+17-17)
(modified) llvm/test/CodeGen/AMDGPU/multilevel-break.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/opt-sgpr-to-vgpr-copy.mir (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/opt-vgpr-live-range-verifier-error.mir (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/phi-elimination-end-cf.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/si-annotate-dbg-info.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/si-fix-sgpr-copies.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow.mir (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/si-lower-i1-copies-order-of-phi-incomings.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/si-lower-i1-copies.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/si-opt-vgpr-liverange-bug-deadlanes.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-multiple-unreachables.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-return-unreachable.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/stale-livevar-in-twoaddr-pass.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/stop-tail-duplicate-cfg-intrinsic.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+7-14)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll (+2-2)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-metadata.mir (+2-2)
(modified) llvm/test/CodeGen/MIR/AMDGPU/syncscopes.mir (+2-2)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index b529642a558710..63c3e3c4d0eac8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3407,11 +3407,11 @@ def int_amdgcn_ashr_pk_u8_i32 : ClangBuiltin<"__builtin_amdgcn_ashr_pk_u8_i32">,
 // having side effects, which is sufficient to prevent optimizations without
 // having to mark them as convergent.
 def int_amdgcn_if : Intrinsic<[llvm_i1_ty, llvm_anyint_ty],
-  [llvm_i1_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
+  [llvm_i1_ty, llvm_i1_ty], [ImmArg<ArgIndex<1>>, IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
 def int_amdgcn_else : Intrinsic<[llvm_i1_ty, llvm_anyint_ty],
-  [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
+  [llvm_anyint_ty, llvm_i1_ty], [ImmArg<ArgIndex<1>>, IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
 def int_amdgcn_if_break : Intrinsic<[llvm_anyint_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index bec294a945d2fe..89304bfe51efe6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -42,12 +42,12 @@ def AMDGPUFmasOp : SDTypeProfile<1, 4,
 def ImmOp : SDTypeProfile<0, 1, [SDTCisInt<0>]>;
 def AMDGPUKillSDT : SDTypeProfile<0, 1, [SDTCisInt<0>]>;
 
-def AMDGPUIfOp : SDTypeProfile<1, 2,
-  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, OtherVT>]
+def AMDGPUIfOp : SDTypeProfile<1, 3,
+  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>, SDTCisVT<3, OtherVT>]
 >;
 
-def AMDGPUElseOp : SDTypeProfile<1, 2,
-  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, OtherVT>]
+def AMDGPUElseOp : SDTypeProfile<1, 3,
+  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>, SDTCisVT<3, OtherVT>]
 >;
 
 def AMDGPULoopOp : SDTypeProfile<0, 2,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index e9e47eaadd557f..9fb756232e83bc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7264,6 +7264,7 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
 
       Register Def = MI.getOperand(1).getReg();
       Register Use = MI.getOperand(3).getReg();
+      auto LikelyVarying = MI.getOperand(4).getImm();
 
       MachineBasicBlock *CondBrTarget = BrCond->getOperand(1).getMBB();
 
@@ -7275,11 +7276,13 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
         B.buildInstr(AMDGPU::SI_IF)
           .addDef(Def)
           .addUse(Use)
+          .addImm(LikelyVarying)
           .addMBB(UncondBrTarget);
       } else {
         B.buildInstr(AMDGPU::SI_ELSE)
             .addDef(Def)
             .addUse(Use)
+            .addImm(LikelyVarying)
             .addMBB(UncondBrTarget);
       }
 
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 4ff6fc32b642dd..854eb7706416ae 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -22,6 +22,7 @@
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/BasicBlockUtils.h"
 #include "llvm/Transforms/Utils/Local.h"
@@ -36,6 +37,24 @@ namespace {
 using StackEntry = std::pair<BasicBlock *, Value *>;
 using StackVector = SmallVector<StackEntry, 16>;
 
+class LikelyVaryingHeuristic {
+public:
+  LikelyVaryingHeuristic(const Function &F, const GCNSubtarget &ST) {
+    IsSingleLaneExecution = ST.isSingleLaneExecution(F);
+  }
+
+  /// Check if \p V is likely to be have dynamically varying values among the
+  /// workitems in each wavefront.
+  bool isLikelyVarying(const Value *V);
+
+private:
+  bool IsSingleLaneExecution = false;
+
+  bool isRelevantSourceOfDivergence(const Value *V) const;
+
+  ValueMap<const Value *, bool> LikelyVaryingCache;
+};
+
 class SIAnnotateControlFlow {
 private:
   Function *F;
@@ -62,6 +81,8 @@ class SIAnnotateControlFlow {
 
   LoopInfo *LI;
 
+  LikelyVaryingHeuristic LVHeuristic;
+
   void initialize(const GCNSubtarget &ST);
 
   bool isUniform(BranchInst *T);
@@ -99,7 +120,7 @@ class SIAnnotateControlFlow {
 public:
   SIAnnotateControlFlow(Function &F, const GCNSubtarget &ST, DominatorTree &DT,
                         LoopInfo &LI, UniformityInfo &UA)
-      : F(&F), UA(&UA), DT(&DT), LI(&LI) {
+      : F(&F), UA(&UA), DT(&DT), LI(&LI), LVHeuristic(F, ST) {
     initialize(ST);
   }
 
@@ -186,9 +207,14 @@ bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
   if (isUniform(Term))
     return false;
 
+  // Check if it's likely that at least one lane will always follow the
+  // then-branch, i.e., the then-branch is never skipped completly.
+  Value *IsLikelyVarying =
+      LVHeuristic.isLikelyVarying(Term->getCondition()) ? BoolTrue : BoolFalse;
+
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(getDecl(If, Intrinsic::amdgcn_if, IntMask),
-                                 {Term->getCondition()});
+                                 {Term->getCondition(), IsLikelyVarying});
   Value *Cond = IRB.CreateExtractValue(IfCall, {0});
   Value *Mask = IRB.CreateExtractValue(IfCall, {1});
   Term->setCondition(Cond);
@@ -202,9 +228,16 @@ bool SIAnnotateControlFlow::insertElse(BranchInst *Term) {
     return false;
   }
 
+  Value *IncomingMask = popSaved();
+  // Check if it's likely that at least one lane will always follow the
+  // else-branch, i.e., the else-branch is never skipped completly.
+  Value *IsLikelyVarying =
+      LVHeuristic.isLikelyVarying(IncomingMask) ? BoolTrue : BoolFalse;
+
   IRBuilder<> IRB(Term);
-  Value *ElseCall = IRB.CreateCall(
-      getDecl(Else, Intrinsic::amdgcn_else, {IntMask, IntMask}), {popSaved()});
+  Value *ElseCall =
+      IRB.CreateCall(getDecl(Else, Intrinsic::amdgcn_else, {IntMask, IntMask}),
+                     {IncomingMask, IsLikelyVarying});
   Value *Cond = IRB.CreateExtractValue(ElseCall, {0});
   Value *Mask = IRB.CreateExtractValue(ElseCall, {1});
   Term->setCondition(Cond);
@@ -385,6 +418,62 @@ bool SIAnnotateControlFlow::run() {
   return Changed;
 }
 
+bool LikelyVaryingHeuristic::isRelevantSourceOfDivergence(
+    const Value *V) const {
+  auto *II = dyn_cast<IntrinsicInst>(V);
+  if (!II)
+    return false;
+
+  switch (II->getIntrinsicID()) {
+  case Intrinsic::amdgcn_workitem_id_z:
+  case Intrinsic::r600_read_tidig_z:
+  case Intrinsic::amdgcn_workitem_id_y:
+  case Intrinsic::r600_read_tidig_y:
+  case Intrinsic::amdgcn_workitem_id_x:
+  case Intrinsic::r600_read_tidig_x:
+  case Intrinsic::amdgcn_mbcnt_hi:
+  case Intrinsic::amdgcn_mbcnt_lo:
+    return true;
+  default:
+    return false;
+  }
+}
+
+bool LikelyVaryingHeuristic::isLikelyVarying(const Value *V) {
+  if (IsSingleLaneExecution)
+    return false;
+
+  if (isRelevantSourceOfDivergence(V))
+    return true;
+
+  auto *I = dyn_cast<Instruction>(V);
+  if (!I)
+    return false;
+
+  // ExtractValueInst and IntrinsicInst enable looking through the
+  // amdgcn_if/else intrinsics inserted by SIAnnotateControlFlow.
+  // This condition excludes PHINodes, which prevents infinite recursion.
+  if (!isa<BinaryOperator>(I) && !isa<UnaryOperator>(I) && !isa<CastInst>(I) &&
+      !isa<CmpInst>(I) && !isa<ExtractValueInst>(I) && !isa<IntrinsicInst>(I))
+    return false;
+
+  // Have we already checked V?
+  auto CacheEntry = LikelyVaryingCache.find(V);
+  if (CacheEntry != LikelyVaryingCache.end())
+    return CacheEntry->second;
+
+  // Does it use a likely varying Value?
+  bool Result = false;
+  for (const auto &Use : I->operands()) {
+    Result |= isLikelyVarying(Use);
+    if (Result)
+      break;
+  }
+
+  LikelyVaryingCache.insert({V, Result});
+  return Result;
+}
+
 PreservedAnalyses SIAnnotateControlFlowPass::run(Function &F,
                                                  FunctionAnalysisManager &FAM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 40a20fa9cb15ea..4aafb7d502570d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -417,8 +417,8 @@ def IGLP_OPT : SPseudoInstSI<(outs), (ins i32imm:$mask),
 let isTerminator = 1, isNotDuplicable = 1 in {
 
 def SI_IF: CFPseudoInstSI <
-  (outs SReg_1:$dst), (ins SReg_1:$vcc, brtarget:$target),
-  [(set i1:$dst, (AMDGPUif i1:$vcc, bb:$target))], 1, 1> {
+  (outs SReg_1:$dst), (ins SReg_1:$vcc, i1imm:$likelyvarying, brtarget:$target),
+  [(set i1:$dst, (AMDGPUif i1:$vcc, (i1 timm:$likelyvarying), bb:$target))], 1, 1> {
   let Constraints = "";
   let Size = 12;
   let hasSideEffects = 1;
@@ -427,7 +427,7 @@ def SI_IF: CFPseudoInstSI <
 
 def SI_ELSE : CFPseudoInstSI <
   (outs SReg_1:$dst),
-  (ins SReg_1:$src, brtarget:$target), [], 1, 1> {
+  (ins SReg_1:$src, i1imm:$likelyvarying, brtarget:$target), [], 1, 1> {
   let Size = 12;
   let hasSideEffects = 1;
   let IsNeverUniform = 1;
@@ -1049,8 +1049,8 @@ def : GCNPat<
 >;
 
 def : GCNPat<
-  (AMDGPUelse i1:$src, bb:$target),
-  (SI_ELSE $src, $target)
+  (AMDGPUelse i1:$src, i1:$likelyvarying, bb:$target),
+  (SI_ELSE $src, $likelyvarying, $target)
 >;
 
 def : Pat <
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index f8878f32f829d1..d59f3504a2e342 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -55,8 +55,11 @@
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/LiveVariables.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/IR/LLVMContext.h"
+#include "llvm/Support/BranchProbability.h"
 #include "llvm/Target/TargetMachine.h"
 
 using namespace llvm;
@@ -221,9 +224,11 @@ void SILowerControlFlow::emitIf(MachineInstr &MI) {
   MachineOperand& Cond = MI.getOperand(1);
   assert(Cond.getSubReg() == AMDGPU::NoSubRegister);
 
-  MachineOperand &ImpDefSCC = MI.getOperand(4);
+  MachineOperand &ImpDefSCC = MI.getOperand(5);
   assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());
 
+  bool LikelyVarying = MI.getOperand(2).getImm();
+
   // If there is only one use of save exec register and that use is SI_END_CF,
   // we can optimize SI_IF by returning the full saved exec mask instead of
   // just cleared bits.
@@ -281,7 +286,17 @@ void SILowerControlFlow::emitIf(MachineInstr &MI) {
   // Insert the S_CBRANCH_EXECZ instruction which will be optimized later
   // during SIPreEmitPeephole.
   MachineInstr *NewBr = BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
-                            .add(MI.getOperand(2));
+                            .add(MI.getOperand(3));
+
+  if (LikelyVarying) {
+    MachineBasicBlock *ExeczDest = MI.getOperand(3).getMBB();
+    auto **E = MBB.succ_end();
+    for (auto **SI = MBB.succ_begin(); SI != E; ++SI) {
+      if (*SI == ExeczDest)
+        MBB.setSuccProbability(SI, BranchProbability::getZero());
+    }
+    MBB.normalizeSuccProbs();
+  }
 
   if (!LIS) {
     MI.eraseFromParent();
@@ -329,7 +344,9 @@ void SILowerControlFlow::emitElse(MachineInstr &MI) {
   if (LV)
     LV->replaceKillInstruction(SrcReg, MI, *OrSaveExec);
 
-  MachineBasicBlock *DestBB = MI.getOperand(2).getMBB();
+  bool LikelyVarying = MI.getOperand(2).getImm();
+
+  MachineBasicBlock *DestBB = MI.getOperand(3).getMBB();
 
   MachineBasicBlock::iterator ElsePt(MI);
 
@@ -352,6 +369,15 @@ void SILowerControlFlow::emitElse(MachineInstr &MI) {
       BuildMI(MBB, ElsePt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
           .addMBB(DestBB);
 
+  if (LikelyVarying) {
+    auto **E = MBB.succ_end();
+    for (auto **SI = MBB.succ_begin(); SI != E; ++SI) {
+      if (*SI == DestBB)
+        MBB.setSuccProbability(SI, BranchProbability::getZero());
+    }
+    MBB.normalizeSuccProbs();
+  }
+
   if (!LIS) {
     MI.eraseFromParent();
     return;
diff --git a/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp b/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
index ff0b9b4a7574bf..f0791efe7d191c 100644
--- a/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
+++ b/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
@@ -184,7 +184,7 @@ MachineBasicBlock *
 SIOptimizeVGPRLiveRange::getElseTarget(MachineBasicBlock *MBB) const {
   for (auto &BR : MBB->terminators()) {
     if (BR.getOpcode() == AMDGPU::SI_ELSE)
-      return BR.getOperand(2).getMBB();
+      return BR.getOperand(3).getMBB();
   }
   return nullptr;
 }
@@ -682,7 +682,7 @@ bool SIOptimizeVGPRLiveRange::run(MachineFunction &MF) {
     for (auto &MI : MBB.terminators()) {
       // Detect the if-else blocks
       if (MI.getOpcode() == AMDGPU::SI_IF) {
-        MachineBasicBlock *IfTarget = MI.getOperand(2).getMBB();
+        MachineBasicBlock *IfTarget = MI.getOperand(3).getMBB();
         auto *Endif = getElseTarget(IfTarget);
         if (!Endif)
           continue;
diff --git a/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp b/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp
index 2bb70c138a50c4..55777d7da366b0 100644
--- a/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp
@@ -14,6 +14,7 @@
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/ADT/Statistic.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/TargetSchedule.h"
 #include "llvm/Support/BranchProbability.h"
@@ -22,6 +23,8 @@ using namespace llvm;
 
 #define DEBUG_TYPE "si-pre-emit-peephole"
 
+STATISTIC(NumCBranchExeczElim, "Number of s_cbranch_execz eliminated.");
+
 namespace {
 
 class SIPreEmitPeephole : public MachineFunctionPass {
@@ -404,6 +407,7 @@ bool SIPreEmitPeephole::removeExeczBranch(MachineInstr &MI,
     return false;
 
   LLVM_DEBUG(dbgs() << "Removing the execz branch: " << MI);
+  ++NumCBranchExeczElim;
   MI.eraseFromParent();
   SrcMBB.removeSuccessor(TrueMBB);
 
diff --git a/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir b/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir
index 56ea4b528ba8f1..d618f21fff694b 100644
--- a/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir
+++ b/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir
@@ -14,10 +14,10 @@
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.3:\l|\l successors: %bb.4(0x80000000)\l\l  %4:vgpr_32 = PHI %5:vgpr_32, %bb.1, %7:vgpr_32, %bb.2\l}"];
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
-# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.4:\l|\l successors: %bb.2(0x40000000), %bb.5(0x40000000)\l\l  %8:vgpr_32 = V_AND_B32_e32 3, %1:vgpr_32, implicit $exec\l  %9:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 2, implicit $exec\l  %10:sreg_64 = SI_IF killed %9:sreg_64, %bb.2, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
+# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.4:\l|\l successors: %bb.2(0x40000000), %bb.5(0x40000000)\l\l  %8:vgpr_32 = V_AND_B32_e32 3, %1:vgpr_32, implicit $exec\l  %9:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 2, implicit $exec\l  %10:sreg_64 = SI_IF killed %9:sreg_64, 0, %bb.2, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
-# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.5:\l|\l successors: %bb.1(0x40000000), %bb.6(0x40000000)\l\l  %11:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 1, implicit $exec\l  %12:sreg_64 = SI_IF killed %11:sreg_64, %bb.1, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
+# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.5:\l|\l successors: %bb.1(0x40000000), %bb.6(0x40000000)\l\l  %11:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 1, implicit $exec\l  %12:sreg_64 = SI_IF killed %11:sreg_64, 0, %bb.1, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.6:\l|\l\l S_ENDPGM 0\l}"];
@@ -74,12 +74,12 @@ body:             |
 
     %50:vgpr_32 = V_AND_B32_e32 3, %2, implicit $exec
     %51:sreg_64 = V_CMP_EQ_U32_e64 %50, 2, implicit $exec
-    %52:sreg_64 = SI_IF killed %51:sreg_64, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    %52:sreg_64 = SI_IF killed %51:sreg_64, 0, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
 
   bb.5:
     successors: %bb.1, %bb.6
     %61:sreg_64 = V_CMP_EQ_U32_e64 %50, 1, implicit $exec
-    %62:sreg_64 = SI_IF killed %61:sreg_64, %bb.1, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    %62:sreg_64 = SI_IF killed %61:sreg_64, 0, %bb.1, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
 
   bb.6:
     S_ENDPGM 0
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir
index dec55e5662c8c6..58c710389b2b2b 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir
@@ -10,7 +10,7 @@ body:             |
     ; CHECK-NOT: DIVERGENT: %1
     %1:sreg_64(s64) = G_IMPLICIT_DEF
     ; CHECK: DIVERGENT: {{.*}} SI_IF
-    %2:sreg_64 = SI_IF %1, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
+    %2:sreg_64 = SI_IF %1, 0, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
 
   bb.1:
     SI_RETURN
@@ -30,7 +30,7 @@ body:             |
     ; CHECK-NOT: DIVERGENT: %1
     %1:sreg_64(s64) = G_IMPLICIT_DEF
     ; CHECK: DIVERGENT: {{.*}} SI_ELSE
-    %2:sreg_64 = SI_ELSE %1, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
+    %2:sreg_64 = SI_ELSE %1, 0, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
 
   bb.1:
     SI_RETURN
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir
index b7e0d5449d2e8b..601566033c23cc 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir
@@ -35,7 +35,7 @@ body:             |
     %13:_(s32) = G_LOAD %11(p0) :: (load (s32))
     %37:_(s32) = G_CONSTANT i32 0
     %14:sreg_32_xm0_xexec(s1) = G_ICMP intpred(slt), %13(s32), %37
-    %16:sreg_32_xm0_xexec(s32) = SI_IF %14(s1), %bb.5, implicit-def $exec, implicit-def $scc, implicit $exec
+    %16:sreg_32_xm0_xexec(s32) = SI_IF %14(s1), 0, %bb.5, implicit-def $exec, implicit-def $scc, implicit $exec
     G_BR %bb.7
   
   bb.5:
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll
index b92daa64040e43..9fadd900a7c64f 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll
@@ -16,14 +16,14 @@ entry:
 
 ; CHECK: for function 'test_if':
 ; CHECK: DIVERGENT: %cond = icmp eq i32 %arg0, 0
-; CHECK-NEXT: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond)
+; CHECK-NEXT: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond, i1 false)
 ; CHECK-NEXT: DIVERGENT: %if.bool = extractvalue { i1, i64 } %if, 0
 ; CHECK-NOT: DIVERGENT
 ; CHECK: DIVERGENT: %if.bool.ext = zext i1 %if.bool to i32
 define void @test_if(i32 %arg0) {
 entry:
   %cond = icmp eq i32 %arg0, 0
-  %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond)
+  %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond, i1 false)
   %if.bool = extractvalue { i1, i64 } %if, 0
   %if.mask = extractvalue { i1, i64 } %if, 1
   %if.bool.ext = zext i1 %if.bool to i32
@@ -35,14 +35,14 @@ entry:
 ; The result should still be treated as divergent, even with a uniform source.
 ; CHECK: for function 'test_if_unifor...
[truncated]

github-actions · 2025-01-21T13:41:30Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically varying values for the branch condition. This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities. This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so. This is an alternative to PR llvm#117567, using a simpler heuristic and passing the LikelyVarying information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata. Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-varying-branches.ll. For SWDEV-483228.

…e workitem id.

ritter-x2a · 2025-01-27T09:46:37Z

Rebased to resolve merge conflics in recently-updated tests:

  llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-regbanklegalize.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-regbankselect.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.mir
  llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir

ritter-x2a · 2025-01-29T08:31:14Z

Ping, @arsenm

ritter-x2a · 2025-02-06T11:29:57Z

I found a case where the heuristic fails in practice: When it's applied to conditionals introduced by BypassSlowDivision as part of the codegenprepare pass.

This transformation essentially replaces a udiv i64 A, B with (do values fit in i32) ? (udiv i32 A, B) : (udiv i64 A, B). I'd guess that this condition usually doesn't diverge within a wavefront.
With the current heuristic, we would always "execute" (with potentially empty EXEC mask) the very lengthy code for i64 division and that for i32 division if the values depend on the thread id.
I have code where that costs 10% performance.

krzysz00 · 2025-03-03T16:58:07Z

Re the slow division condition ... there's nothing about the value being divided that makes it obviously divergent unless the value is a function of the thread/lane ID? I wander why this is tripping the divergence check.

ritter-x2a · 2025-03-03T17:04:28Z

@krzysz00 the value does depend on the thread id. This is a simplified concrete example:

#include <hip/hip_runtime.h>

#define BLOCK_SIZE 256
#define SOME_CONST 32

__device__ void foo(uint64_t x);

__global__ void kernel(char *B_in, int numBlocks) {
  uint64_t B = *B_in;
  for (int blockId = blockIdx.x; blockId < numBlocks; blockId += gridDim.x) {
    int outputIdx = threadIdx.x + blockId * BLOCK_SIZE;
    uint64_t A = outputIdx * SOME_CONST;

    uint64_t res;
    if (A <= UINT32_MAX) { // is this likely divergent?
      res = ((uint32_t) A) / ((uint32_t) B);
    } else {
      res = A / B;
    }
    foo(res);
  }
}

The numerator A directly depends on threadIdx.x.

ritter-x2a · 2025-03-13T09:06:15Z

Closing this PR since it introduces performance regressions. Users can use __builtin_expect() or [[likely]] where appropriate to avoid execz branches.

ritter-x2a added the backend:AMDGPU label Jan 21, 2025

ritter-x2a requested review from jayfoad, arsenm, krzysz00, jmmartinez and cdevadas January 21, 2025 13:36

ritter-x2a self-assigned this Jan 21, 2025

llvmbot added llvm:globalisel llvm:ir llvm:analysis Includes value tracking, cost tables and constant folding labels Jan 21, 2025

arsenm reviewed Jan 21, 2025

View reviewed changes

jmmartinez reviewed Jan 24, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp Outdated Show resolved Hide resolved

ritter-x2a added 5 commits January 27, 2025 04:18

Fix formatting

d2c3192

Use "divergent" instead of "varying"

46a5fa4

Also traverse phis and selects when looking for transitive uses of th…

6de1df7

…e workitem id.

Extract common functionality into setSuccessorUnlikely

5e11857

ritter-x2a force-pushed the elim-unlikely-execz-branches-intrinsic-arg branch from 9c90691 to 5e11857 Compare January 27, 2025 09:46

ritter-x2a mentioned this pull request Feb 6, 2025

[AMDGPU] Eliminate likely-spurious execz checks #117567

Closed

Tune the likely-divergent heuristic to exclude fp computations

b385ec6

ritter-x2a closed this Mar 13, 2025

krzysz00 mentioned this pull request Mar 13, 2025

Mark subgroup-variant branches likely / __builtin_expect them iree-org/iree#20237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

Uh oh!

ritter-x2a commented Jan 21, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jan 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ritter-x2a commented Jan 27, 2025

Uh oh!

ritter-x2a commented Jan 29, 2025

Uh oh!

ritter-x2a commented Feb 6, 2025

Uh oh!

krzysz00 commented Mar 3, 2025

Uh oh!

ritter-x2a commented Mar 3, 2025

Uh oh!

ritter-x2a commented Mar 13, 2025

Uh oh!

Uh oh!

[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

Uh oh!

Conversation

ritter-x2a commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ritter-x2a commented Jan 27, 2025

Uh oh!

ritter-x2a commented Jan 29, 2025

Uh oh!

ritter-x2a commented Feb 6, 2025

Uh oh!

krzysz00 commented Mar 3, 2025

Uh oh!

ritter-x2a commented Mar 3, 2025

Uh oh!

ritter-x2a commented Mar 13, 2025

Uh oh!

Uh oh!

ritter-x2a commented Jan 21, 2025 •

edited

Loading

llvmbot commented Jan 21, 2025 •

edited

Loading

github-actions bot commented Jan 21, 2025 •

edited

Loading