-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[AMDGPU] Eliminate likely-spurious execz checks #117567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch. Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.
@llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesCurrently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch. Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback. Patch is 33.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117567.diff 9 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..e7914d1de4a8fb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -380,6 +380,10 @@ FunctionPass *createAMDGPURewriteUndefForPHILegacyPass();
void initializeAMDGPURewriteUndefForPHILegacyPass(PassRegistry &);
extern char &AMDGPURewriteUndefForPHILegacyPassID;
+FunctionPass *createAMDGPUAnnotateVaryingBranchWeightsLegacyPass();
+void initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(PassRegistry &);
+extern char &AMDGPUAnnotateVaryingBranchWeightsLegacyPassID;
+
class AMDGPURewriteUndefForPHIPass
: public PassInfoMixin<AMDGPURewriteUndefForPHIPass> {
public:
@@ -397,6 +401,17 @@ class SIAnnotateControlFlowPass
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
};
+class AMDGPUAnnotateVaryingBranchWeightsPass
+ : public PassInfoMixin<AMDGPUAnnotateVaryingBranchWeightsPass> {
+private:
+ const AMDGPUTargetMachine &TM;
+
+public:
+ AMDGPUAnnotateVaryingBranchWeightsPass(const AMDGPUTargetMachine &TM)
+ : TM(TM) {}
+ PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
void initializeSIAnnotateControlFlowLegacyPass(PassRegistry &);
extern char &SIAnnotateControlFlowLegacyPassID;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp
new file mode 100644
index 00000000000000..3c637290cbdbe3
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAnnotateVaryingBranchWeights.cpp
@@ -0,0 +1,201 @@
+//===- AMDGPUAnnotateVaryingBranchWeights.cpp -----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Estimate if conditional branches for which SIAnnotateControlFlow introduced
+// amdgcn_if or amdgcn_else intrinsics are likely to have different outcomes for
+// the threads of each wavefront. If that is the case, BranchWeight metadata is
+// added to signal that "then" and "else" blocks are both likely to be executed.
+// This may introduce branch weights that would be self-contradictory in a
+// non-SIMT setting.
+//
+// A consequence of this is that SIPreEmitPeephole is more likely to eliminate
+// s_cbranch_execz instructions that were introduced to skip these blocks when
+// no thread in the wavefront is active for them.
+//
+// Should only run after SIAnnotateControlFlow.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "llvm/Analysis/LazyValueInfo.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/Analysis.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/ProfDataUtils.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-annotate-varying-branch-weights"
+
+namespace {
+
+class AMDGPUAnnotateVaryingBranchWeightsImpl {
+public:
+ AMDGPUAnnotateVaryingBranchWeightsImpl() = delete;
+ AMDGPUAnnotateVaryingBranchWeightsImpl(const GCNSubtarget &ST,
+ const TargetTransformInfo &TTI)
+ : ST(ST), TTI(TTI) {
+ // Determine weights that signal that a branch is very likely to be
+ // predicted correctly, i.e., whose ratio exceeds
+ // TTI.getPredictableBranchThreshold().
+ auto BranchProbThreshold = TTI.getPredictableBranchThreshold();
+ LikelyWeight = BranchProbThreshold.getNumerator();
+ UnlikelyWeight = BranchProbThreshold.getDenominator() - LikelyWeight;
+ if (UnlikelyWeight > 0)
+ --UnlikelyWeight;
+ }
+
+ bool run(Function &F);
+
+private:
+ const GCNSubtarget &ST;
+ const TargetTransformInfo &TTI;
+ uint32_t LikelyWeight;
+ uint32_t UnlikelyWeight;
+ ValueMap<const Value *, bool> LikelyVaryingCache;
+
+ /// Heuristically check if it is likely that a wavefront has dynamically
+ /// varying values for V.
+ bool isLikelyVarying(const Value *V);
+
+ /// Set branch weights that signal that the "true" successor of Term is the
+ /// likely destination, if no prior weights are present.
+ /// Return true if weights were set.
+ bool setTrueSuccessorLikely(BranchInst *Term);
+};
+
+class AMDGPUAnnotateVaryingBranchWeightsLegacy : public FunctionPass {
+public:
+ static char ID;
+ AMDGPUAnnotateVaryingBranchWeightsLegacy() : FunctionPass(ID) {
+ initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(
+ *PassRegistry::getPassRegistry());
+ }
+
+ StringRef getPassName() const override {
+ return "AMDGPU Annotate Varying Branch Weights";
+ }
+
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<TargetPassConfig>();
+ AU.setPreservesCFG();
+ }
+
+ bool runOnFunction(Function &F) override {
+ TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
+ const TargetMachine &TM = TPC.getTM<TargetMachine>();
+ const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+ const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
+ return AMDGPUAnnotateVaryingBranchWeightsImpl(ST, TTI).run(F);
+ }
+};
+
+} // end anonymous namespace
+
+char AMDGPUAnnotateVaryingBranchWeightsLegacy::ID = 0;
+
+INITIALIZE_PASS_BEGIN(AMDGPUAnnotateVaryingBranchWeightsLegacy, DEBUG_TYPE,
+ "Annotate Varying Branch Weights", false, false)
+INITIALIZE_PASS_END(AMDGPUAnnotateVaryingBranchWeightsLegacy, DEBUG_TYPE,
+ "Annotate Varying Branch Weights", false, false)
+
+FunctionPass *llvm::createAMDGPUAnnotateVaryingBranchWeightsLegacyPass() {
+ return new AMDGPUAnnotateVaryingBranchWeightsLegacy();
+}
+
+PreservedAnalyses
+AMDGPUAnnotateVaryingBranchWeightsPass::run(Function &F,
+ FunctionAnalysisManager &AM) {
+ const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+ const TargetTransformInfo &TTI = TM.getTargetTransformInfo(F);
+ bool Changed = AMDGPUAnnotateVaryingBranchWeightsImpl(ST, TTI).run(F);
+
+ if (!Changed)
+ return PreservedAnalyses::all();
+
+ PreservedAnalyses PA;
+ PA.preserveSet<CFGAnalyses>();
+ return PA;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::isLikelyVarying(const Value *V) {
+ // Check if V is a source of divergence or if it transitively uses one.
+ if (TTI.isSourceOfDivergence(V))
+ return true;
+
+ auto *U = dyn_cast<User>(V);
+ if (!U)
+ return false;
+
+ // Have we already checked V?
+ auto CacheEntry = LikelyVaryingCache.find(V);
+ if (CacheEntry != LikelyVaryingCache.end())
+ return CacheEntry->second;
+
+ // Does it use a likely varying Value?
+ bool Result = false;
+ for (const auto &Use : U->operands()) {
+ Result |= isLikelyVarying(Use);
+ if (Result)
+ break;
+ }
+
+ LikelyVaryingCache.insert({V, Result});
+ return Result;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::setTrueSuccessorLikely(
+ BranchInst *Term) {
+ assert(Term->isConditional());
+
+ // Don't overwrite existing branch weights.
+ if (hasProfMD(*Term))
+ return false;
+
+ llvm::setBranchWeights(*Term, {LikelyWeight, UnlikelyWeight}, false);
+ LLVM_DEBUG(dbgs() << "Added branch weights: " << *Term << '\n');
+ return true;
+}
+
+bool AMDGPUAnnotateVaryingBranchWeightsImpl::run(Function &F) {
+ // If the workgroup has only a single thread, the condition cannot vary.
+ const auto WGSizes = ST.getFlatWorkGroupSizes(F);
+ if (WGSizes.first <= 1)
+ return false;
+
+ using namespace PatternMatch;
+
+ bool Changed = false;
+ for (auto &BB : F) {
+ auto *Term = BB.getTerminator();
+ // Look for conditional branches whose condition is an ExtractValueInst
+ // that extracts the return value of a call to the amdgcn_if or amdgcn_else
+ // intrinsic.
+ if (match(Term, m_Br(m_ExtractValue<0>(m_CombineOr(
+ m_Intrinsic<Intrinsic::amdgcn_if>(),
+ m_Intrinsic<Intrinsic::amdgcn_else>())),
+ m_Value(), m_Value()))) {
+ // The this condition is an artificial value resulting from the control
+ // flow intrinsic, not the actual branch condition. However, the
+ // intrinsics connect it via data flow with the actual condition
+ // (even for the amdgcn_else intrinsic, via the matching amdgcn_if
+ // intrinsic), so isLikelyVarying still produces meaningful results.
+ if (isLikelyVarying(cast<BranchInst>(Term)->getCondition()))
+ Changed |= setTrueSuccessorLikely(cast<BranchInst>(Term));
+ }
+ }
+
+ return Changed;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..63a7b0a50c4455 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -65,6 +65,9 @@ FUNCTION_PASS("amdgpu-unify-divergent-exit-nodes",
AMDGPUUnifyDivergentExitNodesPass())
FUNCTION_PASS("amdgpu-usenative", AMDGPUUseNativeCallsPass())
FUNCTION_PASS("si-annotate-control-flow", SIAnnotateControlFlowPass(*static_cast<const GCNTargetMachine *>(this)))
+FUNCTION_PASS("amdgpu-annotate-varying-branch-weights",
+ AMDGPUAnnotateVaryingBranchWeightsPass(
+ *static_cast<const GCNTargetMachine *>(this)))
#undef FUNCTION_PASS
#ifndef FUNCTION_ANALYSIS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index da18f2b20f1427..325db6ca9b7ace 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -501,6 +501,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUReserveWWMRegsPass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPURewriteUndefForPHILegacyPass(*PR);
+ initializeAMDGPUAnnotateVaryingBranchWeightsLegacyPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowLegacyPass(*PR);
initializeAMDGPUInsertDelayAluPass(*PR);
@@ -1315,6 +1316,7 @@ bool GCNPassConfig::addPreISel() {
// analysis. This depends on stopping SIAnnotateControlFlow from making
// control flow modifications.
addPass(createAMDGPURewriteUndefForPHILegacyPass());
+ addPass(createAMDGPUAnnotateVaryingBranchWeightsLegacyPass());
addPass(createLCSSAPass());
@@ -2003,6 +2005,8 @@ void AMDGPUCodeGenPassBuilder::addPreISel(AddIRPass &addPass) const {
// control flow modifications.
addPass(AMDGPURewriteUndefForPHIPass());
+ addPass(AMDGPUAnnotateVaryingBranchWeightsPass(TM));
+
addPass(LCSSAPass());
if (TM.getOptLevel() > CodeGenOptLevel::Less)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..7e85770c70d5ff 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUCodeGenPrepare.cpp
AMDGPUCombinerHelper.cpp
AMDGPUCtorDtorLowering.cpp
+ AMDGPUAnnotateVaryingBranchWeights.cpp
AMDGPUExportClustering.cpp
AMDGPUFrameLowering.cpp
AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll b/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll
new file mode 100644
index 00000000000000..f5f4c0a12eaeee
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll
@@ -0,0 +1,66 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s | FileCheck %s
+
+; Check that simple conditional memory accesses that are guarded by likely
+; varying conditions are not lowered with an s_cbranch_execz to bypass them.
+
+declare i32 @llvm.amdgcn.workitem.id.x()
+declare i32 @llvm.amdgcn.workitem.id.y()
+
+define amdgpu_kernel void @cond_ops(ptr addrspace(1) inreg %x, ptr addrspace(1) inreg %y) #0 !reqd_work_group_size !0 {
+; CHECK-LABEL: cond_ops:
+; CHECK: s_trap 2 ; Kernarg preload header. Trap with incompatible firmware that doesn't support preloading kernel arguments.
+; CHECK-NEXT: .fill 63, 4, 0xbf800000 ; s_nop 0
+; CHECK-NEXT: ; %bb.0: ; %entry
+; CHECK-NEXT: v_and_b32_e32 v1, 0x3ff, v0
+; CHECK-NEXT: v_bfe_u32 v0, v0, 10, 10
+; CHECK-NEXT: v_lshl_or_b32 v5, v0, 6, v1
+; CHECK-NEXT: v_lshrrev_b32_e32 v0, 4, v5
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc, 15, v0
+; CHECK-NEXT: v_mov_b32_e32 v0, 0
+; CHECK-NEXT: v_lshlrev_b32_e32 v4, 4, v5
+; CHECK-NEXT: v_mov_b32_e32 v1, 0
+; CHECK-NEXT: v_mov_b32_e32 v2, 0
+; CHECK-NEXT: v_mov_b32_e32 v3, 0
+; CHECK-NEXT: s_and_saveexec_b64 s[0:1], vcc
+; CHECK-NEXT: ; %bb.1: ; %do.load
+; CHECK-NEXT: global_load_dwordx4 v[0:3], v4, s[8:9]
+; CHECK-NEXT: ; %bb.2: ; %post.load
+; CHECK-NEXT: s_or_b64 exec, exec, s[0:1]
+; CHECK-NEXT: v_and_b32_e32 v5, 15, v5
+; CHECK-NEXT: v_cmp_gt_u32_e32 vcc, 15, v5
+; CHECK-NEXT: s_and_saveexec_b64 s[0:1], vcc
+; CHECK-NEXT: s_cbranch_execz .LBB0_4
+; CHECK-NEXT: ; %bb.3: ; %do.store
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_dwordx4 v4, v[0:3], s[10:11]
+; CHECK-NEXT: .LBB0_4: ; %exit
+; CHECK-NEXT: s_endpgm
+entry:
+ %tid.x = tail call range(i32 0, 64) i32 @llvm.amdgcn.workitem.id.x()
+ %tid.y = tail call range(i32 0, 4) i32 @llvm.amdgcn.workitem.id.y()
+ %tid.y.shift = shl nuw nsw i32 %tid.y, 6
+ %tid = or disjoint i32 %tid.x, %tid.y.shift
+ %k = lshr i32 %tid, 4
+ %j = and i32 %tid, 15
+ %load.cond = icmp ult i32 %k, 15
+ %tid.ext = zext nneg i32 %tid to i64
+ %my.x = getelementptr <4 x float>, ptr addrspace(1) %x, i64 %tid.ext
+ br i1 %load.cond, label %do.load, label %post.load
+do.load:
+ %loaded = load <4 x float>, ptr addrspace(1) %my.x
+ br label %post.load
+post.load:
+ %maybe.loaded = phi <4 x float> [ %loaded, %do.load ], [ zeroinitializer, %entry ]
+ %my.y = getelementptr <4 x float>, ptr addrspace(1) %y, i64 %tid.ext
+ %store.cond = icmp ult i32 %j, 15
+ br i1 %store.cond, label %do.store, label %exit
+do.store:
+ store <4 x float> %maybe.loaded, ptr addrspace(1) %my.y
+ br label %exit
+exit:
+ ret void
+}
+
+attributes #0 = {"uniform-work-group-size"="true" "amdgpu-flat-work-group-size"="256,256"}
+!0 = !{i32 64, i32 4, i32 1}
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index e77f4f69e265bb..65bb9a9652b17e 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -87,6 +87,7 @@
; GCN-O0-NEXT: Cycle Info Analysis
; GCN-O0-NEXT: Uniformity Analysis
; GCN-O0-NEXT: AMDGPU Rewrite Undef for PHI
+; GCN-O0-NEXT: AMDGPU Annotate Varying Branch Weights
; GCN-O0-NEXT: LCSSA Verifier
; GCN-O0-NEXT: Loop-Closed SSA Form Pass
; GCN-O0-NEXT: DummyCGSCCPass
@@ -279,6 +280,7 @@
; GCN-O1-NEXT: Cycle Info Analysis
; GCN-O1-NEXT: Uniformity Analysis
; GCN-O1-NEXT: AMDGPU Rewrite Undef for PHI
+; GCN-O1-NEXT: AMDGPU Annotate Varying Branch Weights
; GCN-O1-NEXT: LCSSA Verifier
; GCN-O1-NEXT: Loop-Closed SSA Form Pass
; GCN-O1-NEXT: DummyCGSCCPass
@@ -579,6 +581,7 @@
; GCN-O1-OPTS-NEXT: Cycle Info Analysis
; GCN-O1-OPTS-NEXT: Uniformity Analysis
; GCN-O1-OPTS-NEXT: AMDGPU Rewrite Undef for PHI
+; GCN-O1-OPTS-NEXT: AMDGPU Annotate Varying Branch Weights
; GCN-O1-OPTS-NEXT: LCSSA Verifier
; GCN-O1-OPTS-NEXT: Loop-Closed SSA Form Pass
; GCN-O1-OPTS-NEXT: DummyCGSCCPass
@@ -891,6 +894,7 @@
; GCN-O2-NEXT: Cycle Info Analysis
; GCN-O2-NEXT: Uniformity Analysis
; GCN-O2-NEXT: AMDGPU Rewrite Undef for PHI
+; GCN-O2-NEXT: AMDGPU Annotate Varying Branch Weights
; GCN-O2-NEXT: LCSSA Verifier
; GCN-O2-NEXT: Loop-Closed SSA Form Pass
; GCN-O2-NEXT: Analysis if a function is memory bound
@@ -1218,6 +1222,7 @@
; GCN-O3-NEXT: Cycle Info Analysis
; GCN-O3-NEXT: Uniformity Analysis
; GCN-O3-NEXT: AMDGPU Rewrite Undef for PHI
+; GCN-O3-NEXT: AMDGPU Annotate Varying Branch Weights
; GCN-O3-NEXT: LCSSA Verifier
; GCN-O3-NEXT: Loop-Closed SSA Form Pass
; GCN-O3-NEXT: Analysis if a function is memory bound
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
index c826980991f94f..5db6dd8d64d283 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
@@ -146,8 +146,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
; CHECK-NEXT: v_mov_b32_e32 v47, 0
; CHECK-NEXT: s_mov_b32 s47, 0
; CHECK-NEXT: .LBB0_5: ; =>This Loop Header: Depth=1
-; CHECK-NEXT: ; Child Loop BB0_8 Depth 2
-; CHECK-NEXT: ; Child Loop BB0_20 Depth 2
+; CHECK-NEXT: ; Child Loop BB0_7 Depth 2
+; CHECK-NEXT: ; Child Loop BB0_19 Depth 2
; CHECK-NEXT: v_add_nc_u32_e32 v0, s47, v44
; CHECK-NEXT: s_lshl_b32 s4, s47, 5
; CHECK-NEXT: s_add_i32 s46, s47, 1
@@ -163,20 +163,7 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
; CHECK-NEXT: ; in Loop: Header=BB0_5 Depth=1
; CHECK-NEXT: s_mov_b32 s53, 0
; CHECK-NEXT: s_mov_b32 s56, 0
-; CHECK-NEXT: s_branch .LBB0_8
-; CHECK-NEXT: .LBB0_7: ; in Loop: Header=BB0_8 Depth=2
-; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s57
-; CHECK-NEXT: s_add_i32 s56, s56, 4
-; CHECK-NEXT: s_add_i32 s4, s47, s56
-; CHECK-NEXT: v_add_nc_u32_e32 v0, s56, v57
-; CHECK-NEXT: s_add_i32 s5, s4, 5
-; CHECK-NEXT: s_add_i32 s4, s4, 1
-; CHECK-NEXT: v_cmp_ge_u32_e32 vcc_lo, s5, v42
-; CHECK-NEXT: v_mov_b32_e32 v58, s4
-; CHECK-NEXT: s_or_b32 s53, vcc_lo, s53
-; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s53
-; CHECK-NEXT: s_cbranch_execz .LBB0_16
-; CHECK-NEXT: .LBB0_8: ; Parent Loop BB0_5 Depth=1
+; CHECK-NEXT: .LBB0_7: ; Parent Loop BB0_5 Depth=1
; CHECK-NEXT: ; => This Inner Loop Header: Depth=2
; CHECK-NEXT: v_add_nc_u32_e32 v59, s56, v46
; CHECK-NEXT: v_add_nc_u32_e32 v58, s56, v57
@@ -184,8 +171,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: v_cmp_eq_u16_sdwa s4, v56, v0 src0_sel:BYTE_0 src1_sel:DWORD
; CHECK-NEXT: s_and_saveexec_b32 s57, s4
-; CHECK-NEXT: s_cbranch_execz .LBB0_10
-; CHECK-NEXT: ; %bb.9: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT: s_cbranch_execz .LBB0_9
+; CHECK-NEXT: ; %bb.8: ; in Loop: Header=BB0_7 Depth=2
; CHECK-NEXT: v_mov_b32_e32 v31, v40
; CHECK-NEXT: v_mov_b32_e32 v0, 0x3c00
; CHECK-NEXT: s_add_u32 s8, s34, 40
@@ -203,14 +190,14 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
; CHECK-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; CHECK-NEXT: ds_write_b32 v0, v58
-; CHECK-NEXT: .LBB0_10: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT: .LBB0_9: ; in Loop: Header=BB0_7 Depth=2
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s57
; CHECK-NEXT: ds_read_u8 v0, v59 offset:1
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: v_cmp_eq_u16_sdwa s4, v56, v0 src0_sel:BYTE_0 src1_sel:DWORD
; CHECK-NEXT: s_and_saveexec_b32 s57, s4
-; CHECK-NEXT: s_cbranch_execz .LBB0_12
-; CHECK-NEXT: ; %bb.11: ; in Loop: Header=BB0_8 Depth=2
+; CHECK-NEXT: s_cbranch_execz .LBB0_11
+; CHECK-NEXT: ; %bb.10: ; in Loop: Header=BB0_7 Depth=2
; CHECK-NEXT: v_mov_b32_e32 v31, v40
; CHECK-NEXT: v_mov_b32_e32 v0, 0x3c00
; CHECK-NEXT: s_add_u32 s8, s34, 40
@@ -229,14 +216,14 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
; CHECK-NEXT: s_swappc_...
[truncated]
|
// intrinsics connect it via data flow with the actual condition | ||
// (even for the amdgcn_else intrinsic, via the matching amdgcn_if | ||
// intrinsic), so isLikelyVarying still produces meaningful results. | ||
if (isLikelyVarying(cast<BranchInst>(Term)->getCondition())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the big picture here. isLikelyVarying
just seems to be reimplementing divergence analysis, by checking for things that either are or use sources of divergence. But don't we already know that the branch is divergent, because SIAnnotateControl only inserts intrinsics like amdgcn_if in divergent control flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to use a heuristic there to determine when branches are likely to be varying, for instance when they are based on (the lower bits of) the thread id. The current implementation here is very likely not the heuristic we want. At this point, the point of this PR is to gather initial feedback on where and how I use the heuristic before tuning it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Then I guess I am just pointing out that your current heuristic could be simplified to return true
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, technically not quite, right? Uniformity analysis does more than just follow data dependences. Via control dependences (e.g. as a phi at the control flow join of a non-uniform branch), a Value could be non-uniform but this heuristic wouldn't trigger (but we'd probably want to be more restrictive than that as well).
I still don't really understand... Are you trying to identify branches whose condition is divergent according to UA, but is likely to be uniform in practice? How can branch weights tell you whether the branch is likely uniform or not? I guess if it's likely uniform-and-true or uniform-and-false then they can tell you that. For what kind of branches do you want to influence the exec-skipping logic? Can you give an example? |
I think it's the opposite. It's divergent, and known never uniform in practice |
There is an example in the added llvm/test/CodeGen/AMDGPU/conditional-mem-no-cbranch-execz.ll test (see also SWDEV-483228 for more discussion). When every wavefront diverges when executing a conditional, the execz branches to skip then/else cases are unnecessary and harm performance. The goal here is to remove the execz branches when that is likely the case (since the performance gains are probably not worth the effort of implementing a full, sound "must-not-be-uniform analysis" for this.) |
This patch doesn't use branch weights to find something out, it sets them to signal (to the later SIPreEmitPeephole pass) that it's likely that 'then' and 'else' case are executed, i.e., that it should remove the execz branches if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this deserves separate representation from branch frequencies, but I'm not sure what it would be. And it would involve more threading work to get to the late backend pass
bool AMDGPUAnnotateVaryingBranchWeightsImpl::run(Function &F) { | ||
// If the workgroup has only a single thread, the condition cannot vary. | ||
const auto WGSizes = ST.getFlatWorkGroupSizes(F); | ||
if (WGSizes.first <= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also suspect you want the product here, not just checking the first term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krzysz00 From what I understand, ST.getFlatWorkGroupSizes()
(based on the amdgpu-flat-work-group-size
attribute) returns a pair of minimum and maximum numbers of work items (flattened across the dimensions), so .first
is already the minimum number of total work items in the workgroup.
I'm currently working on a patch that refines the heuristic and also adds a similar check here based on the product of the reqd_work_group_size
dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use ST.isSingleLaneExecution
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misremembered which function you were calling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use
ST.isSingleLaneExecution
instead?
Returning early if ST.isSingleLaneExecution(F)
is false true
would mean annotating more cases: The intention of the current version was to annotate if it's known that single lane execution cannot happen for F
. The changed variant would additionally annotate cases where we don't know if single lane execution could happen.
I'd be open to that change; it seems likely that kernels without explicit work group bounds and for which performance matters use more than a single lane.
@@ -0,0 +1,66 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -mattr=+sramecc,-xnack < %s | FileCheck %s | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some IR pass checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests for the pass in annotate-likely-varying-branches.ll. I'll update them when changing to a different annotation mechanism instead of branch weights.
I see. I think you really want to know "will this divergent condition be true for at least one lane", in which case the "execz" condition will not be satisfied so you can remove the s_cbranch_execz. I agree that using normal branch weights for this feels wrong, since a condition like "laneid == 0" is mostly false (for most lanes) but it does satisfy the condition you're interested in. I don't see how any heuristic based on sources of divergence can help you here. Source of divergence just mean "not guaranteed to be uniform", and says nothing about whether the results in different lanes are actually likely to be different. I think perhaps you just need to pattern match specific cases like "laneid == constant" and "x == readfirstlane(x)". |
@jayfoad The idea is to be more optimistic than that, in the realms of "does the condition use bits that are varying in every wavefront". For instance, if required workgroup dimensions are known, we could consider only the workitem.id intrinsics for dimensions that vary within a wavefront where the current PR checks for sources of divergence. This means that execz branches that would be taken may be removed; the actual removal only happens if it doesn't affect correctness. |
Do you have realistic examples of cases where we do want the execz branch, and cases where we do not want the execz branch? I am sceptical that you can effectively distinguish these categories with a generic analysis of the IR (but maybe I'm wrong). I think it's more likely that you would have to pattern-match some examples of one category, and assume that everything else is in the other category. |
@jayfoad __global__ void foo(int *buf, size_t n) {
unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid % 2 == 0)
buf[tid] = 0;
} An example where the execz branch should not be eliminated would be a sequence of conditional accesses where the condition is loaded from memory (the current implementation does not respect that): __global__ void bar(int *buf, size_t n, bool *lookup) {
unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
if (lookup[tid]) {
buf[tid] = 0;
// more memory accesses...
}
} Checking if the branch condition is the result of bitwise and arithmetic operations that involve a workitem id (i.e. only continuing to check the operands of a User if it is such an instruction) would be a plausible heuristic for that (that would benefit from benchmarking). @krzysz00 might have more practical examples. |
Agreed.
Not so sure about this. The compiler has no information about the contents of |
Agreed. That's why I suggest keeping the handling for this case as it is in trunk and not eliminate the execz branch for it. |
First round of reviewer feedback: do not match intrinsics to identify candidate branches, preserve alphabetic order of the pass registry, rename TTI->GCNTTI, update test
Make the heuristic more strict, add IR tests
@jayfoad @arsenm I made the heuristic more selective in 8cf0cbf, it can now distinguish the example cases I mentioned above. I haven't found a good alternative for using the branch weights yet since AFAIK normal IR metadata does not find its way to the backend. We could add a boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like what I had in mind for the heuristic, but it seems like there's uncertainty there
//===----------------------------------------------------------------------===// | ||
// Estimate if conditional branches for which SIAnnotateControlFlow introduced | ||
// amdgcn_if or amdgcn_else intrinsics are likely to have different outcomes for | ||
// the threads of each wavefront. If that is the case, BranchWeight metadata is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not usually call them lanes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted in 086ac46.
Thanks for the confirmation! I'm currently still working on getting performance numbers for a variant of this heuristic; I think the further steps on this PR depend on those performance numbers. |
Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically varying values for the branch condition. This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities. This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so. This is an alternative to PR llvm#117567, using a simpler heuristic and passing the LikelyVarying information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata. Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-varying-branches.ll. For SWDEV-483228.
Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction. This patch adds a heuristic to SIAnnotateControlFlow to determine if the lanes of a wavefront are likely to have dynamically varying values for the branch condition. This information is passed through new arguments/operands of the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions to SILowerControlFlow, where the execz branch is inserted with corresponding branch probabilities. This causes SIPreEmitPeephole to eliminate the corresponding execz branch if it is legal to do so. This is an alternative to PR llvm#117567, using a simpler heuristic and passing the LikelyVarying information through new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions instead of abusing branch weight metadata. Most test changes are caused by the new arguments for the amdgcn.if/else intrinsics and the SI_IF/ELSE pseudo instructions; the LikelyVarying argument is set to false/0 in these existing tests. New tests for the functionality are in conditional-mem-no-cbranch-execz.ll and annotate-likely-varying-branches.ll. For SWDEV-483228.
Closing this one in favor of #123749. |
Currently, we introduce branches to skip conditionally executed instructions if the EXEC mask is zero and only eliminate them if the scheduling model says that executing the skipped instructions is cheaper than taking the branch instruction.
This patch adds branch weights to branches where the threads of a wavefront are likely (according to a heuristic) to have dynamically varying values for the branch condition. This causes SIPreEmitPeephole to eliminate the corresponding execz branch.
Currently, this is implemented as a new middle end pass with a rather optimistic heuristic, to gather initial feedback.