[AMDGPU] Eliminate likely-spurious execz checks #117567

ritter-x2a · 2024-11-25T15:06:18Z

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch.

Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch. Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.

llvmbot · 2024-11-25T15:06:46Z

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch.

Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.

Patch is 33.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117567.diff

9 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+15)
(added) llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp (+201)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+3)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+4)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll (+66)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+84-92)
(modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/BUILD.gn (+1)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..e7914d1de4a8fb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -380,6 +380,10 @@ FunctionPass *createAMDGPURewriteUndefForPHILegacyPass();
 void initializeAMDGPURewriteUndefForPHILegacyPass(PassRegistry &);
 extern char &AMDGPURewriteUndefForPHILegacyPassID;
 
+FunctionPass *createAMDGPUAnnotateVaryingBranchWeightsLegacyPass();
+void initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(PassRegistry &);
+extern char &AMDGPUAnnotateVaryingBranchWeightsLegacyPassID;
+
 class AMDGPURewriteUndefForPHIPass
     : public PassInfoMixin<AMDGPURewriteUndefForPHIPass> {
 public:
@@ -397,6 +401,17 @@ class SIAnnotateControlFlowPass
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
 };
 
+class AMDGPUAnnotateVaryingBranchWeightsPass
+    : public PassInfoMixin<AMDGPUAnnotateVaryingBranchWeightsPass> {
+private:
+  const AMDGPUTargetMachine &TM;
+
+public:
+  AMDGPUAnnotateVaryingBranchWeightsPass(const AMDGPUTargetMachine &TM)
+      : TM(TM) {}
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
 void initializeSIAnnotateControlFlowLegacyPass(PassRegistry &);
 extern char &SIAnnotateControlFlowLegacyPassID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp
new file mode 100644
index 00000000000000..3c637290cbdbe3
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp
@@ -0,0 +1,201 @@
+//===- AMDGPUAnnotateVaryingBranchWeights.cpp -----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Estimate if conditional branches for which SIAnnotateControlFlow introduced
+// amdgcn_if or amdgcn_else intrinsics are likely to have different outcomes for
+// the threads of each wavefront. If that is the case, BranchWeight metadata is
+// added to signal that "then" and "else" blocks are both likely to be executed.
+// This may introduce branch weights that would be self-contradictory in a
+// non-SIMT setting.
+//
+// A consequence of this is that SIPreEmitPeephole is more likely to eliminate
+// s_cbranch_execz instructions that were introduced to skip these blocks when
+// no thread in the wavefront is active for them.
+//
+// Should only run after SIAnnotateControlFlow.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "llvm/Analysis/LazyValueInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/Analysis.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/ProfDataUtils.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-annotate-varying-branch-weights"
+
+namespace {
+
+class AMDGPUAnnotateVaryingBranchWeightsImpl {
+public:
+  AMDGPUAnnotateVaryingBranchWeightsImpl() = delete;
+  AMDGPUAnnotateVaryingBranchWeightsImpl(const GCNSubtarget &ST,
+                                         const TargetTransformInfo &TTI)
+      : ST(ST), TTI(TTI) {
+    // Determine weights that signal that a branch is very likely to be
+    // predicted correctly, i.e., whose ratio exceeds
+    // TTI.getPredictableBranchThreshold().
+    auto BranchProbThreshold = TTI.getPredictableBranchThreshold();
+    LikelyWeight = BranchProbThreshold.getNumerator();
+    UnlikelyWeight = BranchProbThreshold.getDenominator() - LikelyWeight;
+    if (UnlikelyWeight > 0)
+      --UnlikelyWeight;
+  }
+
+  bool run(Function &F);
+
+private:
+  const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
+  uint32_t LikelyWeight;
+  uint32_t UnlikelyWeight;
+  ValueMap<const Value *, bool> LikelyVaryingCache;
+
+  /// Heuristically check if it is likely that a wavefront has dynamically
+  /// varying values for V.
+  bool isLikelyVarying(const Value *V);
+
+  /// Set branch weights that signal that the "true" successor of Term is the
+  /// likely destination, if no prior weights are present.
+  /// Return true if weights were set.
+  bool setTrueSuccessorLikely(BranchInst *Term);
+};
+
+class AMDGPUAnnotateVaryingBranchWeightsLegacy : public FunctionPass {
+public:
+  static char ID;
+  AMDGPUAnnotateVaryingBranchWeightsLegacy() : FunctionPass(ID) {
+    initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  StringRef getPassName() const override {
+    return "AMDGPU Annotate Varying Branch Weights";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<TargetPassConfig>();
+    AU.setPreservesCFG();
+  }
+
+  bool runOnFunction(Function &F) override {
+    TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
+    const TargetMachine &TM = TPC.getTM<TargetMachine>();
+    const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+    const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
+    return AMDGPUAnnotateVaryingBranchWeightsImpl(ST, TTI).run(F);
+  }
+};
+
+} // end anonymous namespace
+
+char AMDGPUAnnotateVaryingBranchWeightsLegacy::ID = 0;
+
+INITIALIZE_PASS_BEGIN(AMDGPUAnnotateVaryingBranchWeightsLegacy, DEBUG_TYPE,
+                      "Annotate Varying Branch Weights", false, false)
+INITIALIZE_PASS_END(AMDGPUAnnotateVaryingBranchWeightsLegacy, DEBUG_TYPE,
+                    "Annotate Varying Branch Weights", false, false)
+
+FunctionPass *llvm::createAMDGPUAnnotateVaryingBranchWeightsLegacyPass() {
+  return new AMDGPUAnnotateVaryingBranchWeightsLegacy();
+}
+
+PreservedAnalyses
+AMDGPUAnnotateVaryingBranchWeightsPass::run(Function &F,
+                                            FunctionAnalysisManager &AM) {
+  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
+  bool Changed = AMDGPUAnnotateVaryingBranchWeightsImpl(ST, TTI).run(F);
+
+  if (!Changed)
+    return PreservedAnalyses::all();
+
+  PreservedAnalyses PA;
+  PA.preserveSet<CFGAnalyses>();
+  return PA;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::isLikelyVarying(const Value *V) {
+  // Check if V is a source of divergence or if it transitively uses one.
+  if (TTI.isSourceOfDivergence(V))
+    return true;
+
+  auto *U = dyn_cast<User>(V);
+  if (!U)
+    return false;
+
+  // Have we already checked V?
+  auto CacheEntry = LikelyVaryingCache.find(V);
+  if (CacheEntry != LikelyVaryingCache.end())
+    return CacheEntry->second;
+
+  // Does it use a likely varying Value?
+  bool Result = false;
+  for (const auto &Use : U->operands()) {
+    Result |= isLikelyVarying(Use);
+    if (Result)
+      break;
+  }
+
+  LikelyVaryingCache.insert({V, Result});
+  return Result;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::setTrueSuccessorLikely(
+    BranchInst *Term) {
+  assert(Term->isConditional());
+
+  // Don't overwrite existing branch weights.
+  if (hasProfMD(*Term))
+    return false;
+
+  llvm::setBranchWeights(*Term, {LikelyWeight, UnlikelyWeight}, false);
+  LLVM_DEBUG(dbgs() << "Added branch weights: " << *Term << '\n');
+  return true;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::run(Function &F) {
+  // If the workgroup has only a single thread, the condition cannot vary.
+  const auto WGSizes = ST.getFlatWorkGroupSizes(F);
+  if (WGSizes.first <= 1)
+    return false;
+
+  using namespace PatternMatch;
+
+  bool Changed = false;
+  for (auto &BB : F) {
+    auto *Term = BB.getTerminator();
+    // Look for conditional branches whose condition is an ExtractValueInst
+    // that extracts the return value of a call to the amdgcn_if or amdgcn_else
+    // intrinsic.
+    if (match(Term, m_Br(m_ExtractValue<0>(m_CombineOr(
+                             m_Intrinsic<Intrinsic::amdgcn_if>(),
+                             m_Intrinsic<Intrinsic::amdgcn_else>())),
+                         m_Value(), m_Value()))) {
+      // The this condition is an artificial value resulting from the control
+      // flow intrinsic, not the actual branch condition. However, the
+      // intrinsics connect it via data flow with the actual condition
+      // (even for the amdgcn_else intrinsic, via the matching amdgcn_if
+      // intrinsic), so isLikelyVarying still produces meaningful results.
+      if (isLikelyVarying(cast<BranchInst>(Term)->getCondition()))
+        Changed |= setTrueSuccessorLikely(cast<BranchInst>(Term));
+    }
+  }
+
+  return Changed;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..63a7b0a50c4455 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -65,6 +65,9 @@ FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
               AMDGPUUnifyDivergentExitNodesPass())
 FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
 FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("amdgpu-annotate-varying-branch-weights",
+              AMDGPUAnnotateVaryingBranchWeightsPass(
+                  *static_cast<const GCNTargetMachine *>(this)))
 #undef FUNCTION_PASS
 
 #ifndef FUNCTION_ANALYSIS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index da18f2b20f1427..325db6ca9b7ace 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -501,6 +501,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUReserveWWMRegsPass(*PR);
   initializeAMDGPURewriteOutArgumentsPass(*PR);
   initializeAMDGPURewriteUndefForPHILegacyPass(*PR);
+  initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(*PR);
   initializeAMDGPUUnifyMetadataPass(*PR);
   initializeSIAnnotateControlFlowLegacyPass(*PR);
   initializeAMDGPUInsertDelayAluPass(*PR);
@@ -1315,6 +1316,7 @@ bool GCNPassConfig::addPreISel() {
   // analysis. This depends on stopping SIAnnotateControlFlow from making
   // control flow modifications.
   addPass(createAMDGPURewriteUndefForPHILegacyPass());
+  addPass(createAMDGPUAnnotateVaryingBranchWeightsLegacyPass());
 
   addPass(createLCSSAPass());
 
@@ -2003,6 +2005,8 @@ void AMDGPUCodeGenPassBuilder::addPreISel(AddIRPass &addPass) const {
   // control flow modifications.
   addPass(AMDGPURewriteUndefForPHIPass());
 
+  addPass(AMDGPUAnnotateVaryingBranchWeightsPass(TM));
+
   addPass(LCSSAPass());
 
   if (TM.getOptLevel() > CodeGenOptLevel::Less)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..7e85770c70d5ff 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCodeGenPrepare.cpp
   AMDGPUCombinerHelper.cpp
   AMDGPUCtorDtorLowering.cpp
+  AMDGPUAnnotateVaryingBranchWeights.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
   AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll b/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll
new file mode 100644
index 00000000000000..f5f4c0a12eaeee
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll
@@ -0,0 +1,66 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s | FileCheck %s
+
+; Check that simple conditional memory accesses that are guarded by likely
+; varying conditions are not lowered with an s_cbranch_execz to bypass them.
+
+declare i32 @llvm.amdgcn.workitem.id.x()
+declare i32 @llvm.amdgcn.workitem.id.y()
+
+define amdgpu_kernel void @cond_ops(ptr addrspace(1) inreg %x, ptr addrspace(1) inreg %y) #0 !reqd_work_group_size !0 {
+; CHECK-LABEL: cond_ops:
+; CHECK:         s_trap 2 ; Kernarg preload header. Trap with incompatible firmware that doesn't support preloading kernel arguments.
+; CHECK-NEXT:    .fill 63, 4, 0xbf800000 ; s_nop 0
+; CHECK-NEXT:  ; %bb.0: ; %entry
+; CHECK-NEXT:    v_and_b32_e32 v1, 0x3ff, v0
+; CHECK-NEXT:    v_bfe_u32 v0, v0, 10, 10
+; CHECK-NEXT:    v_lshl_or_b32 v5, v0, 6, v1
+; CHECK-NEXT:    v_lshrrev_b32_e32 v0, 4, v5
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    v_lshlrev_b32_e32 v4, 4, v5
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    v_mov_b32_e32 v3, 0
+; CHECK-NEXT:    s_and_saveexec_b64 s[0:1], vcc
+; CHECK-NEXT:  ; %bb.1: ; %do.load
+; CHECK-NEXT:    global_load_dwordx4 v[0:3], v4, s[8:9]
+; CHECK-NEXT:  ; %bb.2: ; %post.load
+; CHECK-NEXT:    s_or_b64 exec, exec, s[0:1]
+; CHECK-NEXT:    v_and_b32_e32 v5, 15, v5
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v5
+; CHECK-NEXT:    s_and_saveexec_b64 s[0:1], vcc
+; CHECK-NEXT:    s_cbranch_execz .LBB0_4
+; CHECK-NEXT:  ; %bb.3: ; %do.store
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[10:11]
+; CHECK-NEXT:  .LBB0_4: ; %exit
+; CHECK-NEXT:    s_endpgm
+entry:
+  %tid.x = tail call range(i32 0, 64) i32 @llvm.amdgcn.workitem.id.x()
+  %tid.y = tail call range(i32 0, 4) i32 @llvm.amdgcn.workitem.id.y()
+  %tid.y.shift = shl nuw nsw i32 %tid.y, 6
+  %tid = or disjoint i32 %tid.x, %tid.y.shift
+  %k = lshr i32 %tid, 4
+  %j = and i32 %tid, 15
+  %load.cond = icmp ult i32 %k, 15
+  %tid.ext = zext nneg i32 %tid to i64
+  %my.x = getelementptr <4 x float>, ptr addrspace(1) %x, i64 %tid.ext
+  br i1 %load.cond, label %do.load, label %post.load
+do.load:
+  %loaded = load <4 x float>, ptr addrspace(1) %my.x
+  br label %post.load
+post.load:
+  %maybe.loaded = phi <4 x float> [ %loaded, %do.load ], [ zeroinitializer, %entry ]
+  %my.y = getelementptr <4 x float>, ptr addrspace(1) %y, i64 %tid.ext
+  %store.cond = icmp ult i32 %j, 15
+  br i1 %store.cond, label %do.store, label %exit
+do.store:
+  store <4 x float> %maybe.loaded, ptr addrspace(1) %my.y
+  br label %exit
+exit:
+  ret void
+}
+
+attributes #0 = {"uniform-work-group-size"="true" "amdgpu-flat-work-group-size"="256,256"}
+!0 = !{i32 64, i32 4, i32 1}
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index e77f4f69e265bb..65bb9a9652b17e 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -87,6 +87,7 @@
 ; GCN-O0-NEXT:        Cycle Info Analysis
 ; GCN-O0-NEXT:        Uniformity Analysis
 ; GCN-O0-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O0-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O0-NEXT:        LCSSA Verifier
 ; GCN-O0-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O0-NEXT:      DummyCGSCCPass
@@ -279,6 +280,7 @@
 ; GCN-O1-NEXT:        Cycle Info Analysis
 ; GCN-O1-NEXT:        Uniformity Analysis
 ; GCN-O1-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O1-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O1-NEXT:        LCSSA Verifier
 ; GCN-O1-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O1-NEXT:      DummyCGSCCPass
@@ -579,6 +581,7 @@
 ; GCN-O1-OPTS-NEXT:        Cycle Info Analysis
 ; GCN-O1-OPTS-NEXT:        Uniformity Analysis
 ; GCN-O1-OPTS-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O1-OPTS-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O1-OPTS-NEXT:        LCSSA Verifier
 ; GCN-O1-OPTS-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O1-OPTS-NEXT:      DummyCGSCCPass
@@ -891,6 +894,7 @@
 ; GCN-O2-NEXT:        Cycle Info Analysis
 ; GCN-O2-NEXT:        Uniformity Analysis
 ; GCN-O2-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O2-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O2-NEXT:        LCSSA Verifier
 ; GCN-O2-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O2-NEXT:      Analysis if a function is memory bound
@@ -1218,6 +1222,7 @@
 ; GCN-O3-NEXT:        Cycle Info Analysis
 ; GCN-O3-NEXT:        Uniformity Analysis
 ; GCN-O3-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O3-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O3-NEXT:        LCSSA Verifier
 ; GCN-O3-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O3-NEXT:      Analysis if a function is memory bound
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
index c826980991f94f..5db6dd8d64d283 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
@@ -146,8 +146,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    v_mov_b32_e32 v47, 0
 ; CHECK-NEXT:    s_mov_b32 s47, 0
 ; CHECK-NEXT:  .LBB0_5: ; =>This Loop Header: Depth=1
-; CHECK-NEXT:    ; Child Loop BB0_8 Depth 2
-; CHECK-NEXT:    ; Child Loop BB0_20 Depth 2
+; CHECK-NEXT:    ; Child Loop BB0_7 Depth 2
+; CHECK-NEXT:    ; Child Loop BB0_19 Depth 2
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, s47, v44
 ; CHECK-NEXT:    s_lshl_b32 s4, s47, 5
 ; CHECK-NEXT:    s_add_i32 s46, s47, 1
@@ -163,20 +163,7 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    ; in Loop: Header=BB0_5 Depth=1
 ; CHECK-NEXT:    s_mov_b32 s53, 0
 ; CHECK-NEXT:    s_mov_b32 s56, 0
-; CHECK-NEXT:    s_branch .LBB0_8
-; CHECK-NEXT:  .LBB0_7: ; in Loop: Header=BB0_8 Depth=2
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s57
-; CHECK-NEXT:    s_add_i32 s56, s56, 4
-; CHECK-NEXT:    s_add_i32 s4, s47, s56
-; CHECK-NEXT:    v_add_nc_u32_e32 v0, s56, v57
-; CHECK-NEXT:    s_add_i32 s5, s4, 5
-; CHECK-NEXT:    s_add_i32 s4, s4, 1
-; CHECK-NEXT:    v_cmp_ge_u32_e32 vcc_lo, s5, v42
-; CHECK-NEXT:    v_mov_b32_e32 v58, s4
-; CHECK-NEXT:    s_or_b32 s53, vcc_lo, s53
-; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s53
-; CHECK-NEXT:    s_cbranch_execz .LBB0_16
-; CHECK-NEXT:  .LBB0_8: ; Parent Loop BB0_5 Depth=1
+; CHECK-NEXT:  .LBB0_7: ; Parent Loop BB0_5 Depth=1
 ; CHECK-NEXT:    ; => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    v_add_nc_u32_e32 v59, s56, v46
 ; CHECK-NEXT:    v_add_nc_u32_e32 v58, s56, v57
@@ -184,8 +171,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    v_cmp_eq_u16_sdwa s4, v56, v0 src0_sel:BYTE_0 src1_sel:DWORD
 ; CHECK-NEXT:    s_and_saveexec_b32 s57, s4
-; CHECK-NEXT:    s_cbranch_execz .LBB0_10
-; CHECK-NEXT:  ; %bb.9: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT:    s_cbranch_execz .LBB0_9
+; CHECK-NEXT:  ; %bb.8: ; in Loop: Header=BB0_7 Depth=2
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0x3c00
 ; CHECK-NEXT:    s_add_u32 s8, s34, 40
@@ -203,14 +190,14 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; CHECK-NEXT:    ds_write_b32 v0, v58
-; CHECK-NEXT:  .LBB0_10: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT:  .LBB0_9: ; in Loop: Header=BB0_7 Depth=2
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s57
 ; CHECK-NEXT:    ds_read_u8 v0, v59 offset:1
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    v_cmp_eq_u16_sdwa s4, v56, v0 src0_sel:BYTE_0 src1_sel:DWORD
 ; CHECK-NEXT:    s_and_saveexec_b32 s57, s4
-; CHECK-NEXT:    s_cbranch_execz .LBB0_12
-; CHECK-NEXT:  ; %bb.11: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT:    s_cbranch_execz .LBB0_11
+; CHECK-NEXT:  ; %bb.10: ; in Loop: Header=BB0_7 Depth=2
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0x3c00
 ; CHECK-NEXT:    s_add_u32 s8, s34, 40
@@ -229,14 +216,14 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_swappc_...
[truncated]

jayfoad · 2024-11-25T15:18:50Z

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp

+      // intrinsics connect it via data flow with the actual condition
+      // (even for the amdgcn_else intrinsic, via the matching amdgcn_if
+      // intrinsic), so isLikelyVarying still produces meaningful results.
+      if (isLikelyVarying(cast<BranchInst>(Term)->getCondition()))


I don't understand the big picture here. isLikelyVarying just seems to be reimplementing divergence analysis, by checking for things that either are or use sources of divergence. But don't we already know that the branch is divergent, because SIAnnotateControl only inserts intrinsics like amdgcn_if in divergent control flow?

The goal is to use a heuristic there to determine when branches are likely to be varying, for instance when they are based on (the lower bits of) the thread id. The current implementation here is very likely not the heuristic we want. At this point, the point of this PR is to gather initial feedback on where and how I use the heuristic before tuning it.

OK. Then I guess I am just pointing out that your current heuristic could be simplified to return true :)

Well, technically not quite, right? Uniformity analysis does more than just follow data dependences. Via control dependences (e.g. as a phi at the control flow join of a non-uniform branch), a Value could be non-uniform but this heuristic wouldn't trigger (but we'd probably want to be more restrictive than that as well).

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp

llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp

jayfoad · 2024-11-25T15:55:50Z

I still don't really understand...

Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice?

How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that.

For what kind of branches do you want to influence the exec-skipping logic? Can you give an example?

arsenm · 2024-11-25T15:59:07Z

Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice?

I think it's the opposite. It's divergent, and known never uniform in practice

ritter-x2a · 2024-11-25T16:05:17Z

I still don't really understand...

Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice?

How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that.

For what kind of branches do you want to influence the exec-skipping logic? Can you give an example?

There is an example in the added llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll test (see also SWDEV-483228 for more discussion). When every wavefront diverges when executing a conditional, the execz branches to skip then/else cases are unnecessary and harm performance. The goal here is to remove the execz branches when that is likely the case (since the performance gains are probably not worth the effort of implementing a full, sound "must-not-be-uniform analysis" for this.)

ritter-x2a · 2024-11-25T16:15:08Z

How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that.

This patch doesn't use branch weights to find something out, it sets them to signal (to the later SIPreEmitPeephole pass) that it's likely that 'then' and 'else' case are executed, i.e., that it should remove the execz branches if possible.

arsenm

I feel like this deserves separate representation from branch frequencies, but I'm not sure what it would be. And it would involve more threading work to get to the late backend pass

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp

llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll

arsenm · 2024-11-25T16:42:14Z

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp

+bool AMDGPUAnnotateVaryingBranchWeightsImpl::run(Function &F) {
+  // If the workgroup has only a single thread, the condition cannot vary.
+  const auto WGSizes = ST.getFlatWorkGroupSizes(F);
+  if (WGSizes.first <= 1)


Add test for this?

I also suspect you want the product here, not just checking the first term

@krzysz00 From what I understand, ST.getFlatWorkGroupSizes() (based on the amdgpu-flat-work-group-size attribute) returns a pair of minimum and maximum numbers of work items (flattened across the dimensions), so .first is already the minimum number of total work items in the workgroup.
I'm currently working on a patch that refines the heuristic and also adds a similar check here based on the product of the reqd_work_group_size dimensions.

Can you use ST.isSingleLaneExecution instead?

Ah, I misremembered which function you were calling

Can you use ST.isSingleLaneExecution instead?

Returning early if ST.isSingleLaneExecution(F) is ~~false~~ true would mean annotating more cases: The intention of the current version was to annotate if it's known that single lane execution cannot happen for F. The changed variant would additionally annotate cases where we don't know if single lane execution could happen.
I'd be open to that change; it seems likely that kernels without explicit work group bounds and for which performance matters use more than a single lane.

arsenm · 2024-11-25T16:42:27Z

llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll

@@ -0,0 +1,66 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s | FileCheck %s
+


Need some IR pass checks

Added tests for the pass in annotate-likely-varying-branches.ll. I'll update them when changing to a different annotation mechanism instead of branch weights.

jayfoad · 2024-11-26T08:20:36Z

I see. I think you really want to know "will this divergent condition be true for at least one lane", in which case the "execz" condition will not be satisfied so you can remove the s_cbranch_execz.

I agree that using normal branch weights for this feels wrong, since a condition like "laneid == 0" is mostly false (for most lanes) but it does satisfy the condition you're interested in.

I don't see how any heuristic based on sources of divergence can help you here. Source of divergence just mean "not guaranteed to be uniform", and says nothing about whether the results in different lanes are actually likely to be different. I think perhaps you just need to pattern match specific cases like "laneid == constant" and "x == readfirstlane(x)".

ritter-x2a · 2024-11-26T09:08:19Z

@jayfoad The idea is to be more optimistic than that, in the realms of "does the condition use bits that are varying in every wavefront". For instance, if required workgroup dimensions are known, we could consider only the workitem.id intrinsics for dimensions that vary within a wavefront where the current PR checks for sources of divergence.

This means that execz branches that would be taken may be removed; the actual removal only happens if it doesn't affect correctness.

jayfoad · 2024-11-26T10:46:26Z

Do you have realistic examples of cases where we do want the execz branch, and cases where we do not want the execz branch?

I am sceptical that you can effectively distinguish these categories with a generic analysis of the IR (but maybe I'm wrong). I think it's more likely that you would have to pattern-match some examples of one category, and assume that everything else is in the other category.

ritter-x2a · 2024-11-26T13:46:03Z

@jayfoad
An example for where the execz branch should be eliminated would be a store in every even thread:

__global__ void foo(int *buf, size_t n) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid % 2 == 0)
        buf[tid] = 0;
}

An example where the execz branch should not be eliminated would be a sequence of conditional accesses where the condition is loaded from memory (the current implementation does not respect that):

__global__ void bar(int *buf, size_t n, bool *lookup) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (lookup[tid]) {
        buf[tid] = 0;
        // more memory accesses...
    }
}

Checking if the branch condition is the result of bitwise and arithmetic operations that involve a workitem id (i.e. only continuing to check the operands of a User if it is such an instruction) would be a plausible heuristic for that (that would benefit from benchmarking).

@krzysz00 might have more practical examples.

jayfoad · 2024-11-26T14:07:37Z

An example for where the execz branch should be eliminated would be a store in every even thread:
__global__ void foo(int *buf, size_t n) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid % 2 == 0)
        buf[tid] = 0;
}

Agreed.

An example where the execz branch should not be eliminated would be a sequence of conditional accesses where the condition is loaded from memory (the current implementation does not respect that):
__global__ void bar(int *buf, size_t n, bool *lookup) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (lookup[tid]) {
        buf[tid] = 0;
        // more memory accesses...
    }
}

Not so sure about this. The compiler has no information about the contents of lookup[tid], so no idea how likely it is that the condition will be false for every lane, correct?

ritter-x2a · 2024-11-26T14:15:14Z

Not so sure about this. The compiler has no information about the contents of lookup[tid], so no idea how likely it is that the condition will be false for every lane, correct?

Agreed. That's why I suggest keeping the handling for this case as it is in trunk and not eliminate the execz branch for it.

First round of reviewer feedback: do not match intrinsics to identify candidate branches, preserve alphabetic order of the pass registry, rename TTI->GCNTTI, update test

Make the heuristic more strict, add IR tests

ritter-x2a · 2024-11-28T13:42:13Z

@jayfoad @arsenm I made the heuristic more selective in 8cf0cbf, it can now distinguish the example cases I mentioned above.

I haven't found a good alternative for using the branch weights yet since AFAIK normal IR metadata does not find its way to the backend. We could add a boolean isLikelyVarying parameter to the amdgcn_if and amdgcn_else intrinsics (I think that, in that case, moving the functionality from this pass into SIAnnotateControlFlow would make more sense).

krzysz00

This feels like what I had in mind for the heuristic, but it seems like there's uncertainty there

krzysz00 · 2025-01-03T08:20:56Z

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp

+//===----------------------------------------------------------------------===//
+// Estimate if conditional branches for which SIAnnotateControlFlow introduced
+// amdgcn_if or amdgcn_else intrinsics are likely to have different outcomes for
+// the threads of each wavefront. If that is the case, BranchWeight metadata is


Do we not usually call them lanes?

Adjusted in 086ac46.

ritter-x2a · 2025-01-03T08:32:04Z

This feels like what I had in mind for the heuristic, but it seems like there's uncertainty there

Thanks for the confirmation! I'm currently still working on getting performance numbers for a variant of this heuristic; I think the further steps on this PR depend on those performance numbers.

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically varying values for the branch condition. This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities. This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so. This is an alternative to PR llvm#117567, using a simpler heuristic and passing the LikelyVarying information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata. Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-varying-branches.ll. For SWDEV-483228.

ritter-x2a · 2025-02-06T07:43:04Z

Closing this one in favor of #123749.

ritter-x2a added the backend:AMDGPU label Nov 25, 2024

ritter-x2a requested review from arsenm, krzysz00, jmmartinez and cdevadas November 25, 2024 15:06

ritter-x2a self-assigned this Nov 25, 2024

jayfoad reviewed Nov 25, 2024

View reviewed changes

jmmartinez reviewed Nov 25, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp Show resolved Hide resolved

arsenm reviewed Nov 25, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp Outdated Show resolved Hide resolved

arsenm reviewed Nov 25, 2024

View reviewed changes

ritter-x2a added 2 commits November 27, 2024 03:50

fixup! [AMDGPU] Eliminate likely-spurious execz checks

345269b

First round of reviewer feedback: do not match intrinsics to identify candidate branches, preserve alphabetic order of the pass registry, rename TTI->GCNTTI, update test

fixup! fixup! [AMDGPU] Eliminate likely-spurious execz checks

8cf0cbf

Make the heuristic more strict, add IR tests

krzysz00 reviewed Jan 3, 2025

View reviewed changes

Make comments refer to lanes instead of threads where lanes are meant.

086ac46

ritter-x2a mentioned this pull request Jan 21, 2025

[AMDGPU] Eliminate likely-spurious execz checks via intrinsic argument #123749

Closed

ritter-x2a closed this Feb 6, 2025

		@@ -0,0 +1,66 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s \| FileCheck %s

[AMDGPU] Eliminate likely-spurious execz checks #117567

[AMDGPU] Eliminate likely-spurious execz checks #117567

Uh oh!

Conversation

ritter-x2a commented Nov 25, 2024

Uh oh!

llvmbot commented Nov 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jayfoad commented Nov 25, 2024

Uh oh!

arsenm commented Nov 25, 2024

Uh oh!

ritter-x2a commented Nov 25, 2024

Uh oh!

ritter-x2a commented Nov 25, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Nov 26, 2024

Uh oh!

ritter-x2a commented Nov 26, 2024

Uh oh!

jayfoad commented Nov 26, 2024

Uh oh!

ritter-x2a commented Nov 26, 2024

Uh oh!

jayfoad commented Nov 26, 2024

Uh oh!

ritter-x2a commented Nov 26, 2024

Uh oh!

ritter-x2a commented Nov 28, 2024

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritter-x2a commented Jan 3, 2025

Uh oh!

ritter-x2a commented Feb 6, 2025

Uh oh!

Uh oh!

ritter-x2a Nov 28, 2024 •

edited

Loading