Skip to content

[AMDGPU] Eliminate likely-spurious execz checks #117567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

ritter-x2a
Copy link
Member

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch.

Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.

Currently, we introduce branches to skip conditionally executed
instructions if the EXEC mask is zero and only eliminate them if the
scheduling model says that executing the skipped instructions is cheaper
than taking the branch instruction.

This patch adds branch weights to branches where the threads of a
wavefront are likely (according to a heuristic) to have dynamically
varying values for the branch condition. This causes SIPreEmitPeephole
to eliminate the corresponding execz branch.

Currently, this is implemented as a new middle end pass with a rather
optimistic heuristic, to gather initial feedback.
@llvmbot
Copy link
Member

llvmbot commented Nov 25, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.

This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch.

Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.


Patch is 33.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117567.diff

9 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+15)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp (+201)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+4)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll (+66)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+84-92)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/BUILD.gn (+1)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..e7914d1de4a8fb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -380,6 +380,10 @@ FunctionPass *createAMDGPURewriteUndefForPHILegacyPass();
 void initializeAMDGPURewriteUndefForPHILegacyPass(PassRegistry &);
 extern char &AMDGPURewriteUndefForPHILegacyPassID;
 
+FunctionPass *createAMDGPUAnnotateVaryingBranchWeightsLegacyPass();
+void initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(PassRegistry &);
+extern char &AMDGPUAnnotateVaryingBranchWeightsLegacyPassID;
+
 class AMDGPURewriteUndefForPHIPass
     : public PassInfoMixin<AMDGPURewriteUndefForPHIPass> {
 public:
@@ -397,6 +401,17 @@ class SIAnnotateControlFlowPass
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
 };
 
+class AMDGPUAnnotateVaryingBranchWeightsPass
+    : public PassInfoMixin<AMDGPUAnnotateVaryingBranchWeightsPass> {
+private:
+  const AMDGPUTargetMachine &TM;
+
+public:
+  AMDGPUAnnotateVaryingBranchWeightsPass(const AMDGPUTargetMachine &TM)
+      : TM(TM) {}
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
 void initializeSIAnnotateControlFlowLegacyPass(PassRegistry &);
 extern char &SIAnnotateControlFlowLegacyPassID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp
new file mode 100644
index 00000000000000..3c637290cbdbe3
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp
@@ -0,0 +1,201 @@
+//===- AMDGPUAnnotateVaryingBranchWeights.cpp -----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Estimate if conditional branches for which SIAnnotateControlFlow introduced
+// amdgcn_if or amdgcn_else intrinsics are likely to have different outcomes for
+// the threads of each wavefront. If that is the case, BranchWeight metadata is
+// added to signal that "then" and "else" blocks are both likely to be executed.
+// This may introduce branch weights that would be self-contradictory in a
+// non-SIMT setting.
+//
+// A consequence of this is that SIPreEmitPeephole is more likely to eliminate
+// s_cbranch_execz instructions that were introduced to skip these blocks when
+// no thread in the wavefront is active for them.
+//
+// Should only run after SIAnnotateControlFlow.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "llvm/Analysis/LazyValueInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/Analysis.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/ProfDataUtils.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-annotate-varying-branch-weights"
+
+namespace {
+
+class AMDGPUAnnotateVaryingBranchWeightsImpl {
+public:
+  AMDGPUAnnotateVaryingBranchWeightsImpl() = delete;
+  AMDGPUAnnotateVaryingBranchWeightsImpl(const GCNSubtarget &ST,
+                                         const TargetTransformInfo &TTI)
+      : ST(ST), TTI(TTI) {
+    // Determine weights that signal that a branch is very likely to be
+    // predicted correctly, i.e., whose ratio exceeds
+    // TTI.getPredictableBranchThreshold().
+    auto BranchProbThreshold = TTI.getPredictableBranchThreshold();
+    LikelyWeight = BranchProbThreshold.getNumerator();
+    UnlikelyWeight = BranchProbThreshold.getDenominator() - LikelyWeight;
+    if (UnlikelyWeight > 0)
+      --UnlikelyWeight;
+  }
+
+  bool run(Function &F);
+
+private:
+  const GCNSubtarget &ST;
+  const TargetTransformInfo &TTI;
+  uint32_t LikelyWeight;
+  uint32_t UnlikelyWeight;
+  ValueMap<const Value *, bool> LikelyVaryingCache;
+
+  /// Heuristically check if it is likely that a wavefront has dynamically
+  /// varying values for V.
+  bool isLikelyVarying(const Value *V);
+
+  /// Set branch weights that signal that the "true" successor of Term is the
+  /// likely destination, if no prior weights are present.
+  /// Return true if weights were set.
+  bool setTrueSuccessorLikely(BranchInst *Term);
+};
+
+class AMDGPUAnnotateVaryingBranchWeightsLegacy : public FunctionPass {
+public:
+  static char ID;
+  AMDGPUAnnotateVaryingBranchWeightsLegacy() : FunctionPass(ID) {
+    initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  StringRef getPassName() const override {
+    return "AMDGPU Annotate Varying Branch Weights";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<TargetPassConfig>();
+    AU.setPreservesCFG();
+  }
+
+  bool runOnFunction(Function &F) override {
+    TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
+    const TargetMachine &TM = TPC.getTM<TargetMachine>();
+    const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+    const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
+    return AMDGPUAnnotateVaryingBranchWeightsImpl(ST, TTI).run(F);
+  }
+};
+
+} // end anonymous namespace
+
+char AMDGPUAnnotateVaryingBranchWeightsLegacy::ID = 0;
+
+INITIALIZE_PASS_BEGIN(AMDGPUAnnotateVaryingBranchWeightsLegacy, DEBUG_TYPE,
+                      "Annotate Varying Branch Weights", false, false)
+INITIALIZE_PASS_END(AMDGPUAnnotateVaryingBranchWeightsLegacy, DEBUG_TYPE,
+                    "Annotate Varying Branch Weights", false, false)
+
+FunctionPass *llvm::createAMDGPUAnnotateVaryingBranchWeightsLegacyPass() {
+  return new AMDGPUAnnotateVaryingBranchWeightsLegacy();
+}
+
+PreservedAnalyses
+AMDGPUAnnotateVaryingBranchWeightsPass::run(Function &F,
+                                            FunctionAnalysisManager &AM) {
+  const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+  const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
+  bool Changed = AMDGPUAnnotateVaryingBranchWeightsImpl(ST, TTI).run(F);
+
+  if (!Changed)
+    return PreservedAnalyses::all();
+
+  PreservedAnalyses PA;
+  PA.preserveSet<CFGAnalyses>();
+  return PA;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::isLikelyVarying(const Value *V) {
+  // Check if V is a source of divergence or if it transitively uses one.
+  if (TTI.isSourceOfDivergence(V))
+    return true;
+
+  auto *U = dyn_cast<User>(V);
+  if (!U)
+    return false;
+
+  // Have we already checked V?
+  auto CacheEntry = LikelyVaryingCache.find(V);
+  if (CacheEntry != LikelyVaryingCache.end())
+    return CacheEntry->second;
+
+  // Does it use a likely varying Value?
+  bool Result = false;
+  for (const auto &Use : U->operands()) {
+    Result |= isLikelyVarying(Use);
+    if (Result)
+      break;
+  }
+
+  LikelyVaryingCache.insert({V, Result});
+  return Result;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::setTrueSuccessorLikely(
+    BranchInst *Term) {
+  assert(Term->isConditional());
+
+  // Don't overwrite existing branch weights.
+  if (hasProfMD(*Term))
+    return false;
+
+  llvm::setBranchWeights(*Term, {LikelyWeight, UnlikelyWeight}, false);
+  LLVM_DEBUG(dbgs() << "Added branch weights: " << *Term << '\n');
+  return true;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::run(Function &F) {
+  // If the workgroup has only a single thread, the condition cannot vary.
+  const auto WGSizes = ST.getFlatWorkGroupSizes(F);
+  if (WGSizes.first <= 1)
+    return false;
+
+  using namespace PatternMatch;
+
+  bool Changed = false;
+  for (auto &BB : F) {
+    auto *Term = BB.getTerminator();
+    // Look for conditional branches whose condition is an ExtractValueInst
+    // that extracts the return value of a call to the amdgcn_if or amdgcn_else
+    // intrinsic.
+    if (match(Term, m_Br(m_ExtractValue<0>(m_CombineOr(
+                             m_Intrinsic<Intrinsic::amdgcn_if>(),
+                             m_Intrinsic<Intrinsic::amdgcn_else>())),
+                         m_Value(), m_Value()))) {
+      // The this condition is an artificial value resulting from the control
+      // flow intrinsic, not the actual branch condition. However, the
+      // intrinsics connect it via data flow with the actual condition
+      // (even for the amdgcn_else intrinsic, via the matching amdgcn_if
+      // intrinsic), so isLikelyVarying still produces meaningful results.
+      if (isLikelyVarying(cast<BranchInst>(Term)->getCondition()))
+        Changed |= setTrueSuccessorLikely(cast<BranchInst>(Term));
+    }
+  }
+
+  return Changed;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..63a7b0a50c4455 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -65,6 +65,9 @@ FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
               AMDGPUUnifyDivergentExitNodesPass())
 FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
 FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("amdgpu-annotate-varying-branch-weights",
+              AMDGPUAnnotateVaryingBranchWeightsPass(
+                  *static_cast<const GCNTargetMachine *>(this)))
 #undef FUNCTION_PASS
 
 #ifndef FUNCTION_ANALYSIS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index da18f2b20f1427..325db6ca9b7ace 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -501,6 +501,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUReserveWWMRegsPass(*PR);
   initializeAMDGPURewriteOutArgumentsPass(*PR);
   initializeAMDGPURewriteUndefForPHILegacyPass(*PR);
+  initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(*PR);
   initializeAMDGPUUnifyMetadataPass(*PR);
   initializeSIAnnotateControlFlowLegacyPass(*PR);
   initializeAMDGPUInsertDelayAluPass(*PR);
@@ -1315,6 +1316,7 @@ bool GCNPassConfig::addPreISel() {
   // analysis. This depends on stopping SIAnnotateControlFlow from making
   // control flow modifications.
   addPass(createAMDGPURewriteUndefForPHILegacyPass());
+  addPass(createAMDGPUAnnotateVaryingBranchWeightsLegacyPass());
 
   addPass(createLCSSAPass());
 
@@ -2003,6 +2005,8 @@ void AMDGPUCodeGenPassBuilder::addPreISel(AddIRPass &addPass) const {
   // control flow modifications.
   addPass(AMDGPURewriteUndefForPHIPass());
 
+  addPass(AMDGPUAnnotateVaryingBranchWeightsPass(TM));
+
   addPass(LCSSAPass());
 
   if (TM.getOptLevel() > CodeGenOptLevel::Less)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..7e85770c70d5ff 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCodeGenPrepare.cpp
   AMDGPUCombinerHelper.cpp
   AMDGPUCtorDtorLowering.cpp
+  AMDGPUAnnotateVaryingBranchWeights.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
   AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll b/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll
new file mode 100644
index 00000000000000..f5f4c0a12eaeee
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll
@@ -0,0 +1,66 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s | FileCheck %s
+
+; Check that simple conditional memory accesses that are guarded by likely
+; varying conditions are not lowered with an s_cbranch_execz to bypass them.
+
+declare i32 @llvm.amdgcn.workitem.id.x()
+declare i32 @llvm.amdgcn.workitem.id.y()
+
+define amdgpu_kernel void @cond_ops(ptr addrspace(1) inreg %x, ptr addrspace(1) inreg %y) #0 !reqd_work_group_size !0 {
+; CHECK-LABEL: cond_ops:
+; CHECK:         s_trap 2 ; Kernarg preload header. Trap with incompatible firmware that doesn't support preloading kernel arguments.
+; CHECK-NEXT:    .fill 63, 4, 0xbf800000 ; s_nop 0
+; CHECK-NEXT:  ; %bb.0: ; %entry
+; CHECK-NEXT:    v_and_b32_e32 v1, 0x3ff, v0
+; CHECK-NEXT:    v_bfe_u32 v0, v0, 10, 10
+; CHECK-NEXT:    v_lshl_or_b32 v5, v0, 6, v1
+; CHECK-NEXT:    v_lshrrev_b32_e32 v0, 4, v5
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v0
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    v_lshlrev_b32_e32 v4, 4, v5
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    v_mov_b32_e32 v3, 0
+; CHECK-NEXT:    s_and_saveexec_b64 s[0:1], vcc
+; CHECK-NEXT:  ; %bb.1: ; %do.load
+; CHECK-NEXT:    global_load_dwordx4 v[0:3], v4, s[8:9]
+; CHECK-NEXT:  ; %bb.2: ; %post.load
+; CHECK-NEXT:    s_or_b64 exec, exec, s[0:1]
+; CHECK-NEXT:    v_and_b32_e32 v5, 15, v5
+; CHECK-NEXT:    v_cmp_gt_u32_e32 vcc, 15, v5
+; CHECK-NEXT:    s_and_saveexec_b64 s[0:1], vcc
+; CHECK-NEXT:    s_cbranch_execz .LBB0_4
+; CHECK-NEXT:  ; %bb.3: ; %do.store
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[10:11]
+; CHECK-NEXT:  .LBB0_4: ; %exit
+; CHECK-NEXT:    s_endpgm
+entry:
+  %tid.x = tail call range(i32 0, 64) i32 @llvm.amdgcn.workitem.id.x()
+  %tid.y = tail call range(i32 0, 4) i32 @llvm.amdgcn.workitem.id.y()
+  %tid.y.shift = shl nuw nsw i32 %tid.y, 6
+  %tid = or disjoint i32 %tid.x, %tid.y.shift
+  %k = lshr i32 %tid, 4
+  %j = and i32 %tid, 15
+  %load.cond = icmp ult i32 %k, 15
+  %tid.ext = zext nneg i32 %tid to i64
+  %my.x = getelementptr <4 x float>, ptr addrspace(1) %x, i64 %tid.ext
+  br i1 %load.cond, label %do.load, label %post.load
+do.load:
+  %loaded = load <4 x float>, ptr addrspace(1) %my.x
+  br label %post.load
+post.load:
+  %maybe.loaded = phi <4 x float> [ %loaded, %do.load ], [ zeroinitializer, %entry ]
+  %my.y = getelementptr <4 x float>, ptr addrspace(1) %y, i64 %tid.ext
+  %store.cond = icmp ult i32 %j, 15
+  br i1 %store.cond, label %do.store, label %exit
+do.store:
+  store <4 x float> %maybe.loaded, ptr addrspace(1) %my.y
+  br label %exit
+exit:
+  ret void
+}
+
+attributes #0 = {"uniform-work-group-size"="true" "amdgpu-flat-work-group-size"="256,256"}
+!0 = !{i32 64, i32 4, i32 1}
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index e77f4f69e265bb..65bb9a9652b17e 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -87,6 +87,7 @@
 ; GCN-O0-NEXT:        Cycle Info Analysis
 ; GCN-O0-NEXT:        Uniformity Analysis
 ; GCN-O0-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O0-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O0-NEXT:        LCSSA Verifier
 ; GCN-O0-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O0-NEXT:      DummyCGSCCPass
@@ -279,6 +280,7 @@
 ; GCN-O1-NEXT:        Cycle Info Analysis
 ; GCN-O1-NEXT:        Uniformity Analysis
 ; GCN-O1-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O1-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O1-NEXT:        LCSSA Verifier
 ; GCN-O1-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O1-NEXT:      DummyCGSCCPass
@@ -579,6 +581,7 @@
 ; GCN-O1-OPTS-NEXT:        Cycle Info Analysis
 ; GCN-O1-OPTS-NEXT:        Uniformity Analysis
 ; GCN-O1-OPTS-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O1-OPTS-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O1-OPTS-NEXT:        LCSSA Verifier
 ; GCN-O1-OPTS-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O1-OPTS-NEXT:      DummyCGSCCPass
@@ -891,6 +894,7 @@
 ; GCN-O2-NEXT:        Cycle Info Analysis
 ; GCN-O2-NEXT:        Uniformity Analysis
 ; GCN-O2-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O2-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O2-NEXT:        LCSSA Verifier
 ; GCN-O2-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O2-NEXT:      Analysis if a function is memory bound
@@ -1218,6 +1222,7 @@
 ; GCN-O3-NEXT:        Cycle Info Analysis
 ; GCN-O3-NEXT:        Uniformity Analysis
 ; GCN-O3-NEXT:        AMDGPU Rewrite Undef for PHI
+; GCN-O3-NEXT:        AMDGPU Annotate Varying Branch Weights
 ; GCN-O3-NEXT:        LCSSA Verifier
 ; GCN-O3-NEXT:        Loop-Closed SSA Form Pass
 ; GCN-O3-NEXT:      Analysis if a function is memory bound
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
index c826980991f94f..5db6dd8d64d283 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
@@ -146,8 +146,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    v_mov_b32_e32 v47, 0
 ; CHECK-NEXT:    s_mov_b32 s47, 0
 ; CHECK-NEXT:  .LBB0_5: ; =>This Loop Header: Depth=1
-; CHECK-NEXT:    ; Child Loop BB0_8 Depth 2
-; CHECK-NEXT:    ; Child Loop BB0_20 Depth 2
+; CHECK-NEXT:    ; Child Loop BB0_7 Depth 2
+; CHECK-NEXT:    ; Child Loop BB0_19 Depth 2
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, s47, v44
 ; CHECK-NEXT:    s_lshl_b32 s4, s47, 5
 ; CHECK-NEXT:    s_add_i32 s46, s47, 1
@@ -163,20 +163,7 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    ; in Loop: Header=BB0_5 Depth=1
 ; CHECK-NEXT:    s_mov_b32 s53, 0
 ; CHECK-NEXT:    s_mov_b32 s56, 0
-; CHECK-NEXT:    s_branch .LBB0_8
-; CHECK-NEXT:  .LBB0_7: ; in Loop: Header=BB0_8 Depth=2
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s57
-; CHECK-NEXT:    s_add_i32 s56, s56, 4
-; CHECK-NEXT:    s_add_i32 s4, s47, s56
-; CHECK-NEXT:    v_add_nc_u32_e32 v0, s56, v57
-; CHECK-NEXT:    s_add_i32 s5, s4, 5
-; CHECK-NEXT:    s_add_i32 s4, s4, 1
-; CHECK-NEXT:    v_cmp_ge_u32_e32 vcc_lo, s5, v42
-; CHECK-NEXT:    v_mov_b32_e32 v58, s4
-; CHECK-NEXT:    s_or_b32 s53, vcc_lo, s53
-; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s53
-; CHECK-NEXT:    s_cbranch_execz .LBB0_16
-; CHECK-NEXT:  .LBB0_8: ; Parent Loop BB0_5 Depth=1
+; CHECK-NEXT:  .LBB0_7: ; Parent Loop BB0_5 Depth=1
 ; CHECK-NEXT:    ; => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    v_add_nc_u32_e32 v59, s56, v46
 ; CHECK-NEXT:    v_add_nc_u32_e32 v58, s56, v57
@@ -184,8 +171,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    v_cmp_eq_u16_sdwa s4, v56, v0 src0_sel:BYTE_0 src1_sel:DWORD
 ; CHECK-NEXT:    s_and_saveexec_b32 s57, s4
-; CHECK-NEXT:    s_cbranch_execz .LBB0_10
-; CHECK-NEXT:  ; %bb.9: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT:    s_cbranch_execz .LBB0_9
+; CHECK-NEXT:  ; %bb.8: ; in Loop: Header=BB0_7 Depth=2
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0x3c00
 ; CHECK-NEXT:    s_add_u32 s8, s34, 40
@@ -203,14 +190,14 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_swappc_b64 s[30:31], s[16:17]
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; CHECK-NEXT:    ds_write_b32 v0, v58
-; CHECK-NEXT:  .LBB0_10: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT:  .LBB0_9: ; in Loop: Header=BB0_7 Depth=2
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s57
 ; CHECK-NEXT:    ds_read_u8 v0, v59 offset:1
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    v_cmp_eq_u16_sdwa s4, v56, v0 src0_sel:BYTE_0 src1_sel:DWORD
 ; CHECK-NEXT:    s_and_saveexec_b32 s57, s4
-; CHECK-NEXT:    s_cbranch_execz .LBB0_12
-; CHECK-NEXT:  ; %bb.11: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT:    s_cbranch_execz .LBB0_11
+; CHECK-NEXT:  ; %bb.10: ; in Loop: Header=BB0_7 Depth=2
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v40
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 0x3c00
 ; CHECK-NEXT:    s_add_u32 s8, s34, 40
@@ -229,14 +216,14 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_swappc_...
[truncated]

// intrinsics connect it via data flow with the actual condition
// (even for the amdgcn_else intrinsic, via the matching amdgcn_if
// intrinsic), so isLikelyVarying still produces meaningful results.
if (isLikelyVarying(cast<BranchInst>(Term)->getCondition()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the big picture here. isLikelyVarying just seems to be reimplementing divergence analysis, by checking for things that either are or use sources of divergence. But don't we already know that the branch is divergent, because SIAnnotateControl only inserts intrinsics like amdgcn_if in divergent control flow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to use a heuristic there to determine when branches are likely to be varying, for instance when they are based on (the lower bits of) the thread id. The current implementation here is very likely not the heuristic we want. At this point, the point of this PR is to gather initial feedback on where and how I use the heuristic before tuning it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then I guess I am just pointing out that your current heuristic could be simplified to return true :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, technically not quite, right? Uniformity analysis does more than just follow data dependences. Via control dependences (e.g. as a phi at the control flow join of a non-uniform branch), a Value could be non-uniform but this heuristic wouldn't trigger (but we'd probably want to be more restrictive than that as well).

@jayfoad
Copy link
Contributor

jayfoad commented Nov 25, 2024

I still don't really understand...

Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice?

How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that.

For what kind of branches do you want to influence the exec-skipping logic? Can you give an example?

@arsenm
Copy link
Contributor

arsenm commented Nov 25, 2024

Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice?

I think it's the opposite. It's divergent, and known never uniform in practice

@ritter-x2a
Copy link
Member Author

I still don't really understand...

Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice?

How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that.

For what kind of branches do you want to influence the exec-skipping logic? Can you give an example?

There is an example in the added llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll test (see also SWDEV-483228 for more discussion). When every wavefront diverges when executing a conditional, the execz branches to skip then/else cases are unnecessary and harm performance. The goal here is to remove the execz branches when that is likely the case (since the performance gains are probably not worth the effort of implementing a full, sound "must-not-be-uniform analysis" for this.)

@ritter-x2a
Copy link
Member Author

How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that.

This patch doesn't use branch weights to find something out, it sets them to signal (to the later SIPreEmitPeephole pass) that it's likely that 'then' and 'else' case are executed, i.e., that it should remove the execz branches if possible.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this deserves separate representation from branch frequencies, but I'm not sure what it would be. And it would involve more threading work to get to the late backend pass

bool AMDGPUAnnotateVaryingBranchWeightsImpl::run(Function &F) {
// If the workgroup has only a single thread, the condition cannot vary.
const auto WGSizes = ST.getFlatWorkGroupSizes(F);
if (WGSizes.first <= 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suspect you want the product here, not just checking the first term

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krzysz00 From what I understand, ST.getFlatWorkGroupSizes() (based on the amdgpu-flat-work-group-size attribute) returns a pair of minimum and maximum numbers of work items (flattened across the dimensions), so .first is already the minimum number of total work items in the workgroup.
I'm currently working on a patch that refines the heuristic and also adds a similar check here based on the product of the reqd_work_group_size dimensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use ST.isSingleLaneExecution instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misremembered which function you were calling

Copy link
Member Author

@ritter-x2a ritter-x2a Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use ST.isSingleLaneExecution instead?

Returning early if ST.isSingleLaneExecution(F) is false true would mean annotating more cases: The intention of the current version was to annotate if it's known that single lane execution cannot happen for F. The changed variant would additionally annotate cases where we don't know if single lane execution could happen.
I'd be open to that change; it seems likely that kernels without explicit work group bounds and for which performance matters use more than a single lane.

@@ -0,0 +1,66 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s | FileCheck %s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some IR pass checks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for the pass in annotate-likely-varying-branches.ll. I'll update them when changing to a different annotation mechanism instead of branch weights.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 26, 2024

I see. I think you really want to know "will this divergent condition be true for at least one lane", in which case the "execz" condition will not be satisfied so you can remove the s_cbranch_execz.

I agree that using normal branch weights for this feels wrong, since a condition like "laneid == 0" is mostly false (for most lanes) but it does satisfy the condition you're interested in.

I don't see how any heuristic based on sources of divergence can help you here. Source of divergence just mean "not guaranteed to be uniform", and says nothing about whether the results in different lanes are actually likely to be different. I think perhaps you just need to pattern match specific cases like "laneid == constant" and "x == readfirstlane(x)".

@ritter-x2a
Copy link
Member Author

@jayfoad The idea is to be more optimistic than that, in the realms of "does the condition use bits that are varying in every wavefront". For instance, if required workgroup dimensions are known, we could consider only the workitem.id intrinsics for dimensions that vary within a wavefront where the current PR checks for sources of divergence.

This means that execz branches that would be taken may be removed; the actual removal only happens if it doesn't affect correctness.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 26, 2024

Do you have realistic examples of cases where we do want the execz branch, and cases where we do not want the execz branch?

I am sceptical that you can effectively distinguish these categories with a generic analysis of the IR (but maybe I'm wrong). I think it's more likely that you would have to pattern-match some examples of one category, and assume that everything else is in the other category.

@ritter-x2a
Copy link
Member Author

@jayfoad
An example for where the execz branch should be eliminated would be a store in every even thread:

__global__ void foo(int *buf, size_t n) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid % 2 == 0)
        buf[tid] = 0;
}

An example where the execz branch should not be eliminated would be a sequence of conditional accesses where the condition is loaded from memory (the current implementation does not respect that):

__global__ void bar(int *buf, size_t n, bool *lookup) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (lookup[tid]) {
        buf[tid] = 0;
        // more memory accesses...
    }
}

Checking if the branch condition is the result of bitwise and arithmetic operations that involve a workitem id (i.e. only continuing to check the operands of a User if it is such an instruction) would be a plausible heuristic for that (that would benefit from benchmarking).

@krzysz00 might have more practical examples.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 26, 2024

An example for where the execz branch should be eliminated would be a store in every even thread:

__global__ void foo(int *buf, size_t n) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid % 2 == 0)
        buf[tid] = 0;
}

Agreed.

An example where the execz branch should not be eliminated would be a sequence of conditional accesses where the condition is loaded from memory (the current implementation does not respect that):

__global__ void bar(int *buf, size_t n, bool *lookup) {
    unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (lookup[tid]) {
        buf[tid] = 0;
        // more memory accesses...
    }
}

Not so sure about this. The compiler has no information about the contents of lookup[tid], so no idea how likely it is that the condition will be false for every lane, correct?

@ritter-x2a
Copy link
Member Author

Not so sure about this. The compiler has no information about the contents of lookup[tid], so no idea how likely it is that the condition will be false for every lane, correct?

Agreed. That's why I suggest keeping the handling for this case as it is in trunk and not eliminate the execz branch for it.

First round of reviewer feedback: do not match intrinsics to identify candidate branches, preserve alphabetic order of the pass registry, rename TTI->GCNTTI, update test
Make the heuristic more strict, add IR tests
@ritter-x2a
Copy link
Member Author

@jayfoad @arsenm I made the heuristic more selective in 8cf0cbf, it can now distinguish the example cases I mentioned above.

I haven't found a good alternative for using the branch weights yet since AFAIK normal IR metadata does not find its way to the backend. We could add a boolean isLikelyVarying parameter to the amdgcn_if and amdgcn_else intrinsics (I think that, in that case, moving the functionality from this pass into SIAnnotateControlFlow would make more sense).

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like what I had in mind for the heuristic, but it seems like there's uncertainty there

//===----------------------------------------------------------------------===//
// Estimate if conditional branches for which SIAnnotateControlFlow introduced
// amdgcn_if or amdgcn_else intrinsics are likely to have different outcomes for
// the threads of each wavefront. If that is the case, BranchWeight metadata is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not usually call them lanes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted in 086ac46.

@ritter-x2a
Copy link
Member Author

This feels like what I had in mind for the heuristic, but it seems like there's uncertainty there

Thanks for the confirmation! I'm currently still working on getting performance numbers for a variant of this heuristic; I think the further steps on this PR depend on those performance numbers.

ritter-x2a added a commit to ritter-x2a/llvm-project that referenced this pull request Jan 21, 2025
Currently, we introduce branches to skip conditionally executed
instructions if the EXEC mask is zero and only eliminate them if the
scheduling model says that executing the skipped instructions is cheaper
than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the
lanes of a wavefront are likely to have dynamically varying values for
the branch condition.
This information is passed through new arguments/operands of the
amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to
SILowerControlFlow, where the execz branch is inserted with
corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz
branch if it is legal to do so.

This is an alternative to PR llvm#117567, using a simpler heuristic and
passing the LikelyVarying information through new arguments for the
amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead
of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else
intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying
argument is set to false/0 in these existing tests.
New tests for the functionality are in
conditional-mem-no-cbranch-execz.ll and
annotate-likely-varying-branches.ll.

For SWDEV-483228.
ritter-x2a added a commit to ritter-x2a/llvm-project that referenced this pull request Jan 27, 2025
Currently, we introduce branches to skip conditionally executed
instructions if the EXEC mask is zero and only eliminate them if the
scheduling model says that executing the skipped instructions is cheaper
than taking the branch instruction.

This patch adds a heuristic to SIAnnotateControlFlow to determine if the
lanes of a wavefront are likely to have dynamically varying values for
the branch condition.
This information is passed through new arguments/operands of the
amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to
SILowerControlFlow, where the execz branch is inserted with
corresponding branch probabilities.
This causes SIPreEmitPeephole to eliminate the corresponding execz
branch if it is legal to do so.

This is an alternative to PR llvm#117567, using a simpler heuristic and
passing the LikelyVarying information through new arguments for the
amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead
of abusing branch weight metadata.

Most test changes are caused by the new arguments for the amdgcn.if/else
intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying
argument is set to false/0 in these existing tests.
New tests for the functionality are in
conditional-mem-no-cbranch-execz.ll and
annotate-likely-varying-branches.ll.

For SWDEV-483228.
@ritter-x2a
Copy link
Member Author

Closing this one in favor of #123749.

@ritter-x2a ritter-x2a closed this Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants