Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ritter-x2a
Copy link
Member

@ritter-x2a ritter-x2a commented Jan 21, 2025

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically divergent values for the branch condition.
This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so.

This is an alternative to PR #117567, using a simpler heuristic and passing the LikelyDivergent information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyDivergent argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-divergent-branches.ll.

For SWDEV-483228.

Currently, we introduce branches to skip conditionally executed
instructions if the EXEC mask is zero and only eliminate them if the
scheduling model says that executing the skipped instructions is cheaper
than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the
lanes of a wavefront are likely to have dynamically varying values for
the branch condition.
This information is passed through new arguments/operands of the
amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to
SILowerControlFlow, where the execz branch is inserted with
corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz
branch if it is legal to do so.

This is an alternative to PR llvm#117567, using a simpler heuristic and
passing the LikelyVarying information through new arguments for the
amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead
of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else
intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying
argument is set to false/0 in these existing tests.
New tests for the functionality are in
conditional-mem-no-cbranch-execz.ll and
annotate-likely-varying-branches.ll.

For SWDEV-483228.
@llvmbot
Copy link
Member

llvmbot commented Jan 21, 2025

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically varying values for the branch condition.
This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so.

This is an alternative to PR #117567, using a simpler heuristic and passing the LikelyVarying information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-varying-branches.ll.

For SWDEV-483228.


Patch is 273.07 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123749.diff

75 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td (+4-4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+93-4)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+5-5)
  • (modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+29-3)
  • (modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp (+4)
  • (modified) llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir (+4-4)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir (+2-2)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+1-1)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll (+10-10)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/irreducible-1.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+20-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-brcond.mir (+29-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-regbanklegalize.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui-regbankselect.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.mir (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+6-12)
  • (added) llvm/test/CodeGen/AMDGPU/annotate-likely-varying-branches.ll (+621)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+11-22)
  • (modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+12-24)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+31-31)
  • (added) llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll (+124)
  • (modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+16-17)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+14-29)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-intervals.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.xfail.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-other-terminators.mir (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-i1-copies-clear-kills.mir (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-atomic-insert-end.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-multiple-mem-operands-nontemporal-1.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/mmra.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/multi-divergent-exit-region.ll (+17-17)
  • (modified) llvm/test/CodeGen/AMDGPU/multilevel-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/opt-sgpr-to-vgpr-copy.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/opt-vgpr-live-range-verifier-error.mir (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/phi-elimination-end-cf.mir (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-dbg-info.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/si-fix-sgpr-copies.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow.mir (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-i1-copies-order-of-phi-incomings.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/si-lower-i1-copies.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-opt-vgpr-liverange-bug-deadlanes.mir (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.mir (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-multiple-unreachables.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-return-unreachable.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/stale-livevar-in-twoaddr-pass.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/stop-tail-duplicate-cfg-intrinsic.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+7-14)
  • (modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll (+2-2)
  • (modified) llvm/test/CodeGen/MIR/AMDGPU/machine-metadata.mir (+2-2)
  • (modified) llvm/test/CodeGen/MIR/AMDGPU/syncscopes.mir (+2-2)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index b529642a558710..63c3e3c4d0eac8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3407,11 +3407,11 @@ def int_amdgcn_ashr_pk_u8_i32 : ClangBuiltin<"__builtin_amdgcn_ashr_pk_u8_i32">,
 // having side effects, which is sufficient to prevent optimizations without
 // having to mark them as convergent.
 def int_amdgcn_if : Intrinsic<[llvm_i1_ty, llvm_anyint_ty],
-  [llvm_i1_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
+  [llvm_i1_ty, llvm_i1_ty], [ImmArg<ArgIndex<1>>, IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
 def int_amdgcn_else : Intrinsic<[llvm_i1_ty, llvm_anyint_ty],
-  [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
+  [llvm_anyint_ty, llvm_i1_ty], [ImmArg<ArgIndex<1>>, IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
 def int_amdgcn_if_break : Intrinsic<[llvm_anyint_ty],
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
index bec294a945d2fe..89304bfe51efe6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.td
@@ -42,12 +42,12 @@ def AMDGPUFmasOp : SDTypeProfile<1, 4,
 def ImmOp : SDTypeProfile<0, 1, [SDTCisInt<0>]>;
 def AMDGPUKillSDT : SDTypeProfile<0, 1, [SDTCisInt<0>]>;
 
-def AMDGPUIfOp : SDTypeProfile<1, 2,
-  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, OtherVT>]
+def AMDGPUIfOp : SDTypeProfile<1, 3,
+  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>, SDTCisVT<3, OtherVT>]
 >;
 
-def AMDGPUElseOp : SDTypeProfile<1, 2,
-  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, OtherVT>]
+def AMDGPUElseOp : SDTypeProfile<1, 3,
+  [SDTCisVT<0, i1>, SDTCisVT<1, i1>, SDTCisVT<2, i1>, SDTCisVT<3, OtherVT>]
 >;
 
 def AMDGPULoopOp : SDTypeProfile<0, 2,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index e9e47eaadd557f..9fb756232e83bc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -7264,6 +7264,7 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
 
       Register Def = MI.getOperand(1).getReg();
       Register Use = MI.getOperand(3).getReg();
+      auto LikelyVarying = MI.getOperand(4).getImm();
 
       MachineBasicBlock *CondBrTarget = BrCond->getOperand(1).getMBB();
 
@@ -7275,11 +7276,13 @@ bool AMDGPULegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
         B.buildInstr(AMDGPU::SI_IF)
           .addDef(Def)
           .addUse(Use)
+          .addImm(LikelyVarying)
           .addMBB(UncondBrTarget);
       } else {
         B.buildInstr(AMDGPU::SI_ELSE)
             .addDef(Def)
             .addUse(Use)
+            .addImm(LikelyVarying)
             .addMBB(UncondBrTarget);
       }
 
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 4ff6fc32b642dd..854eb7706416ae 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -22,6 +22,7 @@
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/BasicBlockUtils.h"
 #include "llvm/Transforms/Utils/Local.h"
@@ -36,6 +37,24 @@ namespace {
 using StackEntry = std::pair<BasicBlock *, Value *>;
 using StackVector = SmallVector<StackEntry, 16>;
 
+class LikelyVaryingHeuristic {
+public:
+  LikelyVaryingHeuristic(const Function &F, const GCNSubtarget &ST) {
+    IsSingleLaneExecution = ST.isSingleLaneExecution(F);
+  }
+
+  /// Check if \p V is likely to be have dynamically varying values among the
+  /// workitems in each wavefront.
+  bool isLikelyVarying(const Value *V);
+
+private:
+  bool IsSingleLaneExecution = false;
+
+  bool isRelevantSourceOfDivergence(const Value *V) const;
+
+  ValueMap<const Value *, bool> LikelyVaryingCache;
+};
+
 class SIAnnotateControlFlow {
 private:
   Function *F;
@@ -62,6 +81,8 @@ class SIAnnotateControlFlow {
 
   LoopInfo *LI;
 
+  LikelyVaryingHeuristic LVHeuristic;
+
   void initialize(const GCNSubtarget &ST);
 
   bool isUniform(BranchInst *T);
@@ -99,7 +120,7 @@ class SIAnnotateControlFlow {
 public:
   SIAnnotateControlFlow(Function &F, const GCNSubtarget &ST, DominatorTree &DT,
                         LoopInfo &LI, UniformityInfo &UA)
-      : F(&F), UA(&UA), DT(&DT), LI(&LI) {
+      : F(&F), UA(&UA), DT(&DT), LI(&LI), LVHeuristic(F, ST) {
     initialize(ST);
   }
 
@@ -186,9 +207,14 @@ bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
   if (isUniform(Term))
     return false;
 
+  // Check if it's likely that at least one lane will always follow the
+  // then-branch, i.e., the then-branch is never skipped completly.
+  Value *IsLikelyVarying =
+      LVHeuristic.isLikelyVarying(Term->getCondition()) ? BoolTrue : BoolFalse;
+
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(getDecl(If, Intrinsic::amdgcn_if, IntMask),
-                                 {Term->getCondition()});
+                                 {Term->getCondition(), IsLikelyVarying});
   Value *Cond = IRB.CreateExtractValue(IfCall, {0});
   Value *Mask = IRB.CreateExtractValue(IfCall, {1});
   Term->setCondition(Cond);
@@ -202,9 +228,16 @@ bool SIAnnotateControlFlow::insertElse(BranchInst *Term) {
     return false;
   }
 
+  Value *IncomingMask = popSaved();
+  // Check if it's likely that at least one lane will always follow the
+  // else-branch, i.e., the else-branch is never skipped completly.
+  Value *IsLikelyVarying =
+      LVHeuristic.isLikelyVarying(IncomingMask) ? BoolTrue : BoolFalse;
+
   IRBuilder<> IRB(Term);
-  Value *ElseCall = IRB.CreateCall(
-      getDecl(Else, Intrinsic::amdgcn_else, {IntMask, IntMask}), {popSaved()});
+  Value *ElseCall =
+      IRB.CreateCall(getDecl(Else, Intrinsic::amdgcn_else, {IntMask, IntMask}),
+                     {IncomingMask, IsLikelyVarying});
   Value *Cond = IRB.CreateExtractValue(ElseCall, {0});
   Value *Mask = IRB.CreateExtractValue(ElseCall, {1});
   Term->setCondition(Cond);
@@ -385,6 +418,62 @@ bool SIAnnotateControlFlow::run() {
   return Changed;
 }
 
+bool LikelyVaryingHeuristic::isRelevantSourceOfDivergence(
+    const Value *V) const {
+  auto *II = dyn_cast<IntrinsicInst>(V);
+  if (!II)
+    return false;
+
+  switch (II->getIntrinsicID()) {
+  case Intrinsic::amdgcn_workitem_id_z:
+  case Intrinsic::r600_read_tidig_z:
+  case Intrinsic::amdgcn_workitem_id_y:
+  case Intrinsic::r600_read_tidig_y:
+  case Intrinsic::amdgcn_workitem_id_x:
+  case Intrinsic::r600_read_tidig_x:
+  case Intrinsic::amdgcn_mbcnt_hi:
+  case Intrinsic::amdgcn_mbcnt_lo:
+    return true;
+  default:
+    return false;
+  }
+}
+
+bool LikelyVaryingHeuristic::isLikelyVarying(const Value *V) {
+  if (IsSingleLaneExecution)
+    return false;
+
+  if (isRelevantSourceOfDivergence(V))
+    return true;
+
+  auto *I = dyn_cast<Instruction>(V);
+  if (!I)
+    return false;
+
+  // ExtractValueInst and IntrinsicInst enable looking through the
+  // amdgcn_if/else intrinsics inserted by SIAnnotateControlFlow.
+  // This condition excludes PHINodes, which prevents infinite recursion.
+  if (!isa<BinaryOperator>(I) && !isa<UnaryOperator>(I) && !isa<CastInst>(I) &&
+      !isa<CmpInst>(I) && !isa<ExtractValueInst>(I) && !isa<IntrinsicInst>(I))
+    return false;
+
+  // Have we already checked V?
+  auto CacheEntry = LikelyVaryingCache.find(V);
+  if (CacheEntry != LikelyVaryingCache.end())
+    return CacheEntry->second;
+
+  // Does it use a likely varying Value?
+  bool Result = false;
+  for (const auto &Use : I->operands()) {
+    Result |= isLikelyVarying(Use);
+    if (Result)
+      break;
+  }
+
+  LikelyVaryingCache.insert({V, Result});
+  return Result;
+}
+
 PreservedAnalyses SIAnnotateControlFlowPass::run(Function &F,
                                                  FunctionAnalysisManager &FAM) {
   const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 40a20fa9cb15ea..4aafb7d502570d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -417,8 +417,8 @@ def IGLP_OPT : SPseudoInstSI<(outs), (ins i32imm:$mask),
 let isTerminator = 1, isNotDuplicable = 1 in {
 
 def SI_IF: CFPseudoInstSI <
-  (outs SReg_1:$dst), (ins SReg_1:$vcc, brtarget:$target),
-  [(set i1:$dst, (AMDGPUif i1:$vcc, bb:$target))], 1, 1> {
+  (outs SReg_1:$dst), (ins SReg_1:$vcc, i1imm:$likelyvarying, brtarget:$target),
+  [(set i1:$dst, (AMDGPUif i1:$vcc, (i1 timm:$likelyvarying), bb:$target))], 1, 1> {
   let Constraints = "";
   let Size = 12;
   let hasSideEffects = 1;
@@ -427,7 +427,7 @@ def SI_IF: CFPseudoInstSI <
 
 def SI_ELSE : CFPseudoInstSI <
   (outs SReg_1:$dst),
-  (ins SReg_1:$src, brtarget:$target), [], 1, 1> {
+  (ins SReg_1:$src, i1imm:$likelyvarying, brtarget:$target), [], 1, 1> {
   let Size = 12;
   let hasSideEffects = 1;
   let IsNeverUniform = 1;
@@ -1049,8 +1049,8 @@ def : GCNPat<
 >;
 
 def : GCNPat<
-  (AMDGPUelse i1:$src, bb:$target),
-  (SI_ELSE $src, $target)
+  (AMDGPUelse i1:$src, i1:$likelyvarying, bb:$target),
+  (SI_ELSE $src, $likelyvarying, $target)
 >;
 
 def : Pat <
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index f8878f32f829d1..d59f3504a2e342 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -55,8 +55,11 @@
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/LiveVariables.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/IR/LLVMContext.h"
+#include "llvm/Support/BranchProbability.h"
 #include "llvm/Target/TargetMachine.h"
 
 using namespace llvm;
@@ -221,9 +224,11 @@ void SILowerControlFlow::emitIf(MachineInstr &MI) {
   MachineOperand& Cond = MI.getOperand(1);
   assert(Cond.getSubReg() == AMDGPU::NoSubRegister);
 
-  MachineOperand &ImpDefSCC = MI.getOperand(4);
+  MachineOperand &ImpDefSCC = MI.getOperand(5);
   assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());
 
+  bool LikelyVarying = MI.getOperand(2).getImm();
+
   // If there is only one use of save exec register and that use is SI_END_CF,
   // we can optimize SI_IF by returning the full saved exec mask instead of
   // just cleared bits.
@@ -281,7 +286,17 @@ void SILowerControlFlow::emitIf(MachineInstr &MI) {
   // Insert the S_CBRANCH_EXECZ instruction which will be optimized later
   // during SIPreEmitPeephole.
   MachineInstr *NewBr = BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
-                            .add(MI.getOperand(2));
+                            .add(MI.getOperand(3));
+
+  if (LikelyVarying) {
+    MachineBasicBlock *ExeczDest = MI.getOperand(3).getMBB();
+    auto **E = MBB.succ_end();
+    for (auto **SI = MBB.succ_begin(); SI != E; ++SI) {
+      if (*SI == ExeczDest)
+        MBB.setSuccProbability(SI, BranchProbability::getZero());
+    }
+    MBB.normalizeSuccProbs();
+  }
 
   if (!LIS) {
     MI.eraseFromParent();
@@ -329,7 +344,9 @@ void SILowerControlFlow::emitElse(MachineInstr &MI) {
   if (LV)
     LV->replaceKillInstruction(SrcReg, MI, *OrSaveExec);
 
-  MachineBasicBlock *DestBB = MI.getOperand(2).getMBB();
+  bool LikelyVarying = MI.getOperand(2).getImm();
+
+  MachineBasicBlock *DestBB = MI.getOperand(3).getMBB();
 
   MachineBasicBlock::iterator ElsePt(MI);
 
@@ -352,6 +369,15 @@ void SILowerControlFlow::emitElse(MachineInstr &MI) {
       BuildMI(MBB, ElsePt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
           .addMBB(DestBB);
 
+  if (LikelyVarying) {
+    auto **E = MBB.succ_end();
+    for (auto **SI = MBB.succ_begin(); SI != E; ++SI) {
+      if (*SI == DestBB)
+        MBB.setSuccProbability(SI, BranchProbability::getZero());
+    }
+    MBB.normalizeSuccProbs();
+  }
+
   if (!LIS) {
     MI.eraseFromParent();
     return;
diff --git a/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp b/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
index ff0b9b4a7574bf..f0791efe7d191c 100644
--- a/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
+++ b/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
@@ -184,7 +184,7 @@ MachineBasicBlock *
 SIOptimizeVGPRLiveRange::getElseTarget(MachineBasicBlock *MBB) const {
   for (auto &BR : MBB->terminators()) {
     if (BR.getOpcode() == AMDGPU::SI_ELSE)
-      return BR.getOperand(2).getMBB();
+      return BR.getOperand(3).getMBB();
   }
   return nullptr;
 }
@@ -682,7 +682,7 @@ bool SIOptimizeVGPRLiveRange::run(MachineFunction &MF) {
     for (auto &MI : MBB.terminators()) {
       // Detect the if-else blocks
       if (MI.getOpcode() == AMDGPU::SI_IF) {
-        MachineBasicBlock *IfTarget = MI.getOperand(2).getMBB();
+        MachineBasicBlock *IfTarget = MI.getOperand(3).getMBB();
         auto *Endif = getElseTarget(IfTarget);
         if (!Endif)
           continue;
diff --git a/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp b/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp
index 2bb70c138a50c4..55777d7da366b0 100644
--- a/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPreEmitPeephole.cpp
@@ -14,6 +14,7 @@
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "llvm/ADT/Statistic.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/TargetSchedule.h"
 #include "llvm/Support/BranchProbability.h"
@@ -22,6 +23,8 @@ using namespace llvm;
 
 #define DEBUG_TYPE "si-pre-emit-peephole"
 
+STATISTIC(NumCBranchExeczElim, "Number of s_cbranch_execz eliminated.");
+
 namespace {
 
 class SIPreEmitPeephole : public MachineFunctionPass {
@@ -404,6 +407,7 @@ bool SIPreEmitPeephole::removeExeczBranch(MachineInstr &MI,
     return false;
 
   LLVM_DEBUG(dbgs() << "Removing the execz branch: " << MI);
+  ++NumCBranchExeczElim;
   MI.eraseFromParent();
   SrcMBB.removeSuccessor(TrueMBB);
 
diff --git a/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir b/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir
index 56ea4b528ba8f1..d618f21fff694b 100644
--- a/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir
+++ b/llvm/test/Analysis/DotMachineCFG/AMDGPU/irreducible.mir
@@ -14,10 +14,10 @@
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.3:\l|\l successors: %bb.4(0x80000000)\l\l  %4:vgpr_32 = PHI %5:vgpr_32, %bb.1, %7:vgpr_32, %bb.2\l}"];
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
-# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.4:\l|\l successors: %bb.2(0x40000000), %bb.5(0x40000000)\l\l  %8:vgpr_32 = V_AND_B32_e32 3, %1:vgpr_32, implicit $exec\l  %9:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 2, implicit $exec\l  %10:sreg_64 = SI_IF killed %9:sreg_64, %bb.2, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
+# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.4:\l|\l successors: %bb.2(0x40000000), %bb.5(0x40000000)\l\l  %8:vgpr_32 = V_AND_B32_e32 3, %1:vgpr_32, implicit $exec\l  %9:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 2, implicit $exec\l  %10:sreg_64 = SI_IF killed %9:sreg_64, 0, %bb.2, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
-# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.5:\l|\l successors: %bb.1(0x40000000), %bb.6(0x40000000)\l\l  %11:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 1, implicit $exec\l  %12:sreg_64 = SI_IF killed %11:sreg_64, %bb.1, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
+# MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.5:\l|\l successors: %bb.1(0x40000000), %bb.6(0x40000000)\l\l  %11:sreg_64 = V_CMP_EQ_U32_e64 %8:vgpr_32, 1, implicit $exec\l  %12:sreg_64 = SI_IF killed %11:sreg_64, 0, %bb.1, implicit-def dead $exec,\l... implicit-def dead $scc, implicit $exec\l}"];
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} -> Node{{[0-9A-Za-z]*}};
 # MCFG-NEXT: Node{{[0-9A-Za-z]*}} [shape=record,label="{bb.6:\l|\l\l S_ENDPGM 0\l}"];
@@ -74,12 +74,12 @@ body:             |
 
     %50:vgpr_32 = V_AND_B32_e32 3, %2, implicit $exec
     %51:sreg_64 = V_CMP_EQ_U32_e64 %50, 2, implicit $exec
-    %52:sreg_64 = SI_IF killed %51:sreg_64, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    %52:sreg_64 = SI_IF killed %51:sreg_64, 0, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
 
   bb.5:
     successors: %bb.1, %bb.6
     %61:sreg_64 = V_CMP_EQ_U32_e64 %50, 1, implicit $exec
-    %62:sreg_64 = SI_IF killed %61:sreg_64, %bb.1, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    %62:sreg_64 = SI_IF killed %61:sreg_64, 0, %bb.1, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
 
   bb.6:
     S_ENDPGM 0
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir
index dec55e5662c8c6..58c710389b2b2b 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/control-flow-intrinsics.mir
@@ -10,7 +10,7 @@ body:             |
     ; CHECK-NOT: DIVERGENT: %1
     %1:sreg_64(s64) = G_IMPLICIT_DEF
     ; CHECK: DIVERGENT: {{.*}} SI_IF
-    %2:sreg_64 = SI_IF %1, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
+    %2:sreg_64 = SI_IF %1, 0, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
 
   bb.1:
     SI_RETURN
@@ -30,7 +30,7 @@ body:             |
     ; CHECK-NOT: DIVERGENT: %1
     %1:sreg_64(s64) = G_IMPLICIT_DEF
     ; CHECK: DIVERGENT: {{.*}} SI_ELSE
-    %2:sreg_64 = SI_ELSE %1, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
+    %2:sreg_64 = SI_ELSE %1, 0, %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
 
   bb.1:
     SI_RETURN
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir
index b7e0d5449d2e8b..601566033c23cc 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir
@@ -35,7 +35,7 @@ body:             |
     %13:_(s32) = G_LOAD %11(p0) :: (load (s32))
     %37:_(s32) = G_CONSTANT i32 0
     %14:sreg_32_xm0_xexec(s1) = G_ICMP intpred(slt), %13(s32), %37
-    %16:sreg_32_xm0_xexec(s32) = SI_IF %14(s1), %bb.5, implicit-def $exec, implicit-def $scc, implicit $exec
+    %16:sreg_32_xm0_xexec(s32) = SI_IF %14(s1), 0, %bb.5, implicit-def $exec, implicit-def $scc, implicit $exec
     G_BR %bb.7
   
   bb.5:
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll
index b92daa64040e43..9fadd900a7c64f 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/control-flow-intrinsics.ll
@@ -16,14 +16,14 @@ entry:
 
 ; CHECK: for function 'test_if':
 ; CHECK: DIVERGENT: %cond = icmp eq i32 %arg0, 0
-; CHECK-NEXT: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond)
+; CHECK-NEXT: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond, i1 false)
 ; CHECK-NEXT: DIVERGENT: %if.bool = extractvalue { i1, i64 } %if, 0
 ; CHECK-NOT: DIVERGENT
 ; CHECK: DIVERGENT: %if.bool.ext = zext i1 %if.bool to i32
 define void @test_if(i32 %arg0) {
 entry:
   %cond = icmp eq i32 %arg0, 0
-  %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond)
+  %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond, i1 false)
   %if.bool = extractvalue { i1, i64 } %if, 0
   %if.mask = extractvalue { i1, i64 } %if, 1
   %if.bool.ext = zext i1 %if.bool to i32
@@ -35,14 +35,14 @@ entry:
 ; The result should still be treated as divergent, even with a uniform source.
 ; CHECK: for function 'test_if_unifor...
[truncated]

Copy link

github-actions bot commented Jan 21, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -36,6 +37,24 @@ namespace {
using StackEntry = std::pair<BasicBlock *, Value *>;
using StackVector = SmallVector<StackEntry, 16>;

class LikelyVaryingHeuristic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Is all phrased with too much uncertainly. You are detecting cases that you know with certainty will be dynamically divergent. Rename this to reflect that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. isLikelyVarying accepts arbitrarily complex expressions involving the workitem ID. There's no way we can know if these will be dynamically divergent. What about amdgcn.workitem.id.x & 0 or amdgcn.workitem.id.y >= 1000000 ???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Varying is still the wrong terminology, it's sounds like a value property

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8dd9600 renamed:
LikelyVaryingHeuristic -> DynamicDivergenceHeuristic ("Heuristic" should imply enough uncertainty here)
(is)LikelyVarying -> (is)LikelyDivergent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed: LikelyVaryingHeuristic -> DynamicDivergenceHeuristic ("Heuristic" should imply enough uncertainty here)
(is)LikelyVarying -> (is)LikelyDivergent

I like that naming.

@@ -385,6 +418,62 @@ bool SIAnnotateControlFlow::run() {
return Changed;
}

bool LikelyVaryingHeuristic::isRelevantSourceOfDivergence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase, these are all functions of the workitem ID

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8dd9600 also renamed:
isRelevantSourceOfDivergence -> isWorkitemID


// ExtractValueInst and IntrinsicInst enable looking through the
// amdgcn_if/else intrinsics inserted by SIAnnotateControlFlow.
// This condition excludes PHINodes, which prevents infinite recursion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usual way to handle this is to track a visited set while walking up the uses. You should be able to handle phis, including the recursive case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say that this proposal is unable to handle phis; they are excluded here because I think that phi values are too complex for this simple heuristic optimization, since they depend on branch conditions.
The more complex the computation between the thread id and the branch condition, the less clear is the likelihood that the condition is dynamically varying.

I think we should only traverse through phis here if including phis whose values are affected by the thread id improves the heuristic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 5915dd4, the heuristic also traverses phis and selects.

@@ -417,8 +417,8 @@ def IGLP_OPT : SPseudoInstSI<(outs), (ins i32imm:$mask),
let isTerminator = 1, isNotDuplicable = 1 in {

def SI_IF: CFPseudoInstSI <
(outs SReg_1:$dst), (ins SReg_1:$vcc, brtarget:$target),
[(set i1:$dst, (AMDGPUif i1:$vcc, bb:$target))], 1, 1> {
(outs SReg_1:$dst), (ins SReg_1:$vcc, i1imm:$likelyvarying, brtarget:$target),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also rename here

if (*SI == ExeczDest)
MBB.setSuccProbability(SI, BranchProbability::getZero());
}
MBB.normalizeSuccProbs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is duplicated also in SILowerControlFlow::emitElse. It'd be nicer to have it in its own function. Something like

if (LikelyDivergent) {
    MachineBasicBlock *ExeczDest = MI.getOperand(3).getMBB();
    setSuccessorUnlikely(MBB, ExeczDest);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9c90691, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants