-
Notifications
You must be signed in to change notification settings - Fork 13.7k
[AMDGPU][LTO] Introduce AMDGPUCloneModuleLDS #89683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-lto @llvm/pr-subscribers-backend-amdgpu Author: Anshil Gandhi (gandhi56) ChangesThe purpose of this pass is to ensure that the This pass operates as follows:
Patch is 27.83 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89683.diff 9 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 6016bd5187d887..f913833a25119b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -149,6 +149,11 @@ struct AMDGPULowerBufferFatPointersPass
const TargetMachine &TM;
};
+struct AMDGPUCloneModuleLDSPass
+ : public PassInfoMixin<AMDGPUCloneModuleLDSPass> {
+ PreservedAnalyses run(Module &, ModuleAnalysisManager &);
+};
+
void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
extern char &AMDGPURewriteOutArgumentsID;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp
new file mode 100644
index 00000000000000..2ab931818ea1eb
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp
@@ -0,0 +1,201 @@
+//===-- AMDGPUCloneModuleLDSPass.cpp ------------------------------*- C++ -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// The purpose of this pass is to ensure that the combined module contains
+// as many LDS global variables as there are kernels that (indirectly) access
+// them. As LDS variables behave like C++ static variables, it is important that
+// each partition contains a unique copy of the variable on a per kernel basis.
+// This representation also prepares the combined module to eliminate
+// cross-module dependencies of LDS variables.
+//
+// This pass operates as follows:
+// 1. Firstly, traverse the call graph from each kernel to determine the number
+// of kernels calling each device function.
+// 2. For each LDS global variable GV, determine the function F that defines it.
+// Collect it's caller functions. Clone F and GV, and finally insert a
+// call/invoke instruction in each caller function.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "llvm/ADT/DepthFirstIterator.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Passes/PassBuilder.h"
+#include "llvm/Support/ScopedPrinter.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-clone-module-lds"
+
+static cl::opt<unsigned int> MaxCountForClonedFunctions(
+ "clone-lds-functions-max-count", cl::init(16), cl::Hidden,
+ cl::desc("Specify a limit to the number of clones of a function"));
+
+/// Return the function that defines \p GV
+/// \param GV The global variable in question
+/// \return The function defining \p GV
+static Function *getFunctionDefiningGV(GlobalVariable &GV) {
+ SmallVector<User *> Worklist(GV.users());
+ while (!Worklist.empty()) {
+ User *U = Worklist.pop_back_val();
+ if (auto *Inst = dyn_cast<Instruction>(U))
+ return Inst->getFunction();
+ if (auto *Op = dyn_cast<Operator>(U))
+ append_range(Worklist, Op->users());
+ }
+ llvm_unreachable("GV must be used in a function.");
+};
+
+/// Replace all references to \p OldGV in \p NewF with \p NewGV
+/// \param OldGV The global variable to be replaced
+/// \param NewGV The global variable taking the place of \p OldGV
+/// \param NewF The function in which the replacement occurs
+static void replaceUsesOfWith(GlobalVariable *OldGV, GlobalVariable *NewGV,
+ Function *NewF) {
+ // ReplaceOperatorUses takes in an instruction Inst, which is assumed to
+ // contain OldGV as an operator, inserts an instruction correponding the
+ // OldGV-operand and update Inst accordingly.
+ auto ReplaceOperatorUses = [&OldGV, &NewGV](Instruction *Inst) {
+ Inst->replaceUsesOfWith(OldGV, NewGV);
+ SmallVector<Value *, 8> Worklist(Inst->operands());
+ while (!Worklist.empty()) {
+ auto *V = Worklist.pop_back_val();
+ if (auto *I = dyn_cast<AddrSpaceCastOperator>(V)) {
+ auto *Cast = new AddrSpaceCastInst(NewGV, I->getType(), "", Inst);
+ Inst->replaceUsesOfWith(I, Cast);
+ } else if (auto *I = dyn_cast<GEPOperator>(V)) {
+ SmallVector<Value *, 8> Indices(I->indices());
+ auto *GEP = GetElementPtrInst::Create(NewGV->getValueType(), NewGV,
+ Indices, "", Inst);
+ Inst->replaceUsesOfWith(I, GEP);
+ }
+ }
+ };
+
+ // Collect all user instructions of OldGV using a Worklist algorithm.
+ // If a user is an operator, collect the instruction wrapping
+ // the operator.
+ SmallVector<Instruction *, 8> InstsToReplace;
+ SmallVector<User *, 8> UsersWorklist(OldGV->users());
+ while (!UsersWorklist.empty()) {
+ auto *U = UsersWorklist.pop_back_val();
+ if (auto *Inst = dyn_cast<Instruction>(U)) {
+ InstsToReplace.push_back(Inst);
+ } else if (auto *Op = dyn_cast<Operator>(U)) {
+ append_range(UsersWorklist, Op->users());
+ }
+ }
+
+ // Replace all occurences of OldGV in NewF
+ DenseSet<Instruction *> ReplacedInsts;
+ while (!InstsToReplace.empty()) {
+ auto *Inst = InstsToReplace.pop_back_val();
+ if (Inst->getFunction() != NewF || ReplacedInsts.contains(Inst))
+ continue;
+ ReplaceOperatorUses(Inst);
+ ReplacedInsts.insert(Inst);
+ }
+};
+
+PreservedAnalyses AMDGPUCloneModuleLDSPass::run(Module &M,
+ ModuleAnalysisManager &AM) {
+ if (MaxCountForClonedFunctions.getValue() == 1)
+ return PreservedAnalyses::all();
+
+ bool Changed = false;
+ auto &CG = AM.getResult<CallGraphAnalysis>(M);
+
+ // For each function in the call graph, determine the number
+ // of ancestor-caller kernels.
+ DenseMap<Function *, unsigned int> KernelRefsToFuncs;
+ for (auto &Fn : M) {
+ if (Fn.getCallingConv() != CallingConv::AMDGPU_KERNEL)
+ continue;
+ for (auto I = df_begin(&CG), E = df_end(&CG); I != E; ++I)
+ if (auto *F = I->getFunction())
+ KernelRefsToFuncs[F]++;
+ }
+
+ DenseMap<GlobalVariable *, Function *> GVToFnMap;
+ LLVMContext &Ctx = M.getContext();
+ IRBuilder<> IRB(Ctx);
+ for (auto &GV : M.globals()) {
+ if (GVToFnMap.contains(&GV) ||
+ GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS ||
+ !GV.hasInitializer())
+ continue;
+
+ auto *OldF = getFunctionDefiningGV(GV);
+ GVToFnMap.insert({&GV, OldF});
+ LLVM_DEBUG(dbgs() << "Found LDS " << GV.getName() << " used in function "
+ << OldF->getName() << '\n');
+
+ // Collect all caller functions of OldF. Each of them must call it's
+ // corresponding clone of OldF.
+ SmallVector<std::pair<Instruction *, SmallVector<Value *>>>
+ InstsCallingOldF;
+ for (auto &I : OldF->uses()) {
+ User *U = I.getUser();
+ SmallVector<Value *> Args;
+ if (auto *CI = dyn_cast<CallInst>(U)) {
+ append_range(Args, CI->args());
+ InstsCallingOldF.push_back({CI, Args});
+ } else if (auto *II = dyn_cast<InvokeInst>(U)) {
+ append_range(Args, II->args());
+ InstsCallingOldF.push_back({II, Args});
+ }
+ }
+
+ // Create as many clones of the function containing LDS global as
+ // there are kernels calling the function (including the function
+ // already defining the LDS global). Respectively, clone the
+ // LDS global and the call instructions to the function.
+ LLVM_DEBUG(dbgs() << "\tFunction is referenced by "
+ << KernelRefsToFuncs[OldF] << " kernels.\n");
+ for (unsigned int ID = 0;
+ ID + 1 < std::min(KernelRefsToFuncs[OldF],
+ MaxCountForClonedFunctions.getValue());
+ ++ID) {
+ // Clone function
+ ValueToValueMapTy VMap;
+ auto *NewF = CloneFunction(OldF, VMap);
+ NewF->setName(OldF->getName() + ".clone." + to_string(ID));
+ LLVM_DEBUG(dbgs() << "Inserting function clone with name "
+ << NewF->getName() << '\n');
+
+ // Clone LDS global variable
+ auto *NewGV = new GlobalVariable(
+ M, GV.getValueType(), GV.isConstant(), GlobalValue::InternalLinkage,
+ UndefValue::get(GV.getValueType()),
+ GV.getName() + ".clone." + to_string(ID), &GV,
+ GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS, false);
+ NewGV->copyAttributesFrom(&GV);
+ NewGV->copyMetadata(&GV, 0);
+ NewGV->setComdat(GV.getComdat());
+ replaceUsesOfWith(&GV, NewGV, NewF);
+ LLVM_DEBUG(dbgs() << "Inserting LDS clone with name " << NewGV->getName()
+ << "\n");
+
+ // Create a new CallInst to call the cloned function
+ for (auto [Inst, Args] : InstsCallingOldF) {
+ IRB.SetInsertPoint(Inst);
+ Instruction *I;
+ if (isa<CallInst>(Inst))
+ I = IRB.CreateCall(NewF, Args,
+ Inst->getName() + ".clone." + to_string(ID));
+ else if (auto *II = dyn_cast<InvokeInst>(Inst))
+ I = IRB.CreateInvoke(NewF, II->getNormalDest(), II->getUnwindDest(),
+ Args, II->getName() + ".clone" + to_string(ID));
+ LLVM_DEBUG(dbgs() << "Inserting inst: " << *I << '\n');
+ }
+ Changed = true;
+ }
+ }
+ return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 90f36fadf35903..eb4bf25fef628a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,6 +22,7 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
AMDGPULowerBufferFatPointersPass(*this))
MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
+MODULE_PASS("amdgpu-clone-module-lds", AMDGPUCloneModuleLDSPass())
MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
#undef MODULE_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 305a6c8c3b9262..09beabd3f9c55c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -725,6 +725,7 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(
// We want to support the -lto-partitions=N option as "best effort".
// For that, we need to lower LDS earlier in the pipeline before the
// module is partitioned for codegen.
+ PM.addPass(AMDGPUCloneModuleLDSPass());
if (EnableLowerModuleLDS)
PM.addPass(AMDGPULowerModuleLDSPass(*this));
});
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 48325a0928f93d..fbf59e0422cb79 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUAtomicOptimizer.cpp
AMDGPUAttributor.cpp
AMDGPUCallLowering.cpp
+ AMDGPUCloneModuleLDS.cpp
AMDGPUCodeGenPrepare.cpp
AMDGPUCombinerHelper.cpp
AMDGPUCtorDtorLowering.cpp
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll
new file mode 100644
index 00000000000000..ed4a4434d74b04
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll
@@ -0,0 +1,59 @@
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; In this examples, CloneModuleLDS pass creates two copies of LDS_GV
+; as two kernels call the same device function where LDS_GV is used.
+
+; CHECK: [[LDS_GV_CLONE:@.*\.clone\.0]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[LDS_GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL_CLONE_0:%.*]] = call i32 @lds_func.clone.0(i32 [[N]])
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @lds_func(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @lds_func(i32 %n)
+ ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL_CLONE_0:%.*]] = call i32 @lds_func.clone.0(i32 [[N]])
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @lds_func(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @lds_func(i32 %n)
+ ret void
+}
+
+
+define i32 @lds_func(i32 %x) {
+; CHECK-LABEL: define i32 @lds_func(
+; CHECK-SAME: i32 [[X:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[P:%.*]] = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) [[LDS_GV]] to ptr), i64 0, i64 0
+; CHECK-NEXT: store i32 [[X]], ptr [[P]], align 4
+; CHECK-NEXT: ret i32 [[X]]
+;
+entry:
+ %p = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) @lds_gv to ptr), i64 0, i64 0
+ store i32 %x, ptr %p
+ ret i32 %x
+}
+
+; CHECK-LABEL: define i32 @lds_func.clone.0(i32 %x) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: %0 = addrspacecast ptr addrspace(3) [[LDS_GV_CLONE]] to ptr
+; CHECK-NEXT: %p = getelementptr inbounds [64 x i32], ptr %0, i64 0, i64 0
+; CHECK-NEXT: store i32 %x, ptr %p, align 4
+; CHECK-NEXT: ret i32 %x
+
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll
new file mode 100644
index 00000000000000..1efb1d12db924c
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll
@@ -0,0 +1,107 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Before transformation, After transformation,
+; K1 K2 K1 K2
+; | / | /
+; | / | /
+; A ==> A
+; | \ | \
+; | \ | \
+; B C B C
+; | | \
+; X X1 X2
+;
+; where X contains an LDS reference
+
+; CHECK: [[GV_CLONE:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @A(i32 %n)
+ ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @A(i32 %n)
+ ret void
+}
+
+define void @A() {
+; CHECK-LABEL: define void @A() {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: call void @B()
+; CHECK-NEXT: call void @C()
+; CHECK-NEXT: ret void
+;
+entry:
+ call void @B()
+ call void @C()
+ ret void
+}
+
+define i32 @B() {
+; CHECK-LABEL: define i32 @B() {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[P:%.*]] = alloca i32, align 4
+; CHECK-NEXT: store i32 5, ptr [[P]], align 4
+; CHECK-NEXT: [[RET_CLONE_0:%.*]] = call i32 @X.clone.0(ptr [[P]])
+; CHECK-NEXT: [[RET:%.*]] = call i32 @X(ptr [[P]])
+; CHECK-NEXT: ret i32 [[RET]]
+;
+entry:
+ %p = alloca i32
+ store i32 5, ptr %p
+ %ret = call i32 @X(ptr %p)
+ ret i32 %ret
+}
+
+define void @C() {
+; CHECK-LABEL: define void @C() {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: ret void
+;
+entry:
+ ret void
+}
+
+define i32 @X(ptr %x) {
+; CHECK-LABEL: define i32 @X(
+; CHECK-SAME: ptr [[X:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[P:%.*]] = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) [[GV]] to ptr), i64 0, i64 0
+; CHECK-NEXT: [[V:%.*]] = load i32, ptr [[X]], align 4
+; CHECK-NEXT: store i32 [[V]], ptr [[P]], align 4
+; CHECK-NEXT: ret i32 [[V]]
+;
+entry:
+ %p = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) @lds_gv to ptr), i64 0, i64 0
+ %v = load i32, ptr %x
+ store i32 %v, ptr %p
+ ret i32 %v
+}
+
+; CHECK-LABEL: define i32 @X.clone.0(ptr %x) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: %0 = addrspacecast ptr addrspace(3) [[GV_CLONE]] to ptr
+; CHECK-NEXT: %p = getelementptr inbounds [64 x i32], ptr %0, i64 0, i64 0
+; CHECK-NEXT: %v = load i32, ptr %x, align 4
+; CHECK-NEXT: store i32 %v, ptr %p, align 4
+; CHECK-NEXT: ret i32 %v
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
new file mode 100644
index 00000000000000..7f27a630b7d850
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
@@ -0,0 +1,134 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Before transformation, After transformation,
+; K1 K2 K3 K1 K2 K3
+; | / | | / |
+; | / | | / |
+; A --------+ ==> A --------+
+; | |
+; | |
+; B B
+; | / | \
+; X X1 X2 X3
+; | \ | /
+; D \ | /
+; D
+; where X contains an LDS reference
+
+; CHECK: [[GV_CLONE_0:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV_CLONE_1:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @A(i32 %n)
+ ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @A(i32 %n)
+ ret void
+}
+
+define protected amdgpu_kernel void @kernel3(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel3(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT: ret void
+;
+entry:
+ %call = call i32 @A(i32 %n)
+ ret void
+}
+
+define void @A() {
+; CHECK-LABEL: define void @A() {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: call void @B()
+; CHECK-NEXT: ret void
+;
+entry:
+ call void @B()
+ ret void
+}
+
+define i32 @B() {
+; CHECK-LABEL: define i32 @B() {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[P:%.*]] = alloca i32, align 4
+; CHECK-NEXT: store i32 5, ptr [[P]], align 4
+; CHECK-NEXT: [[RET_CLONE_0:%.*]] = call i32 @X.clone.0(ptr [[P]])
+...
[truncated]
|
if (auto *Op = dyn_cast<Operator>(U)) | ||
append_range(Worklist, Op->users()); | ||
} | ||
llvm_unreachable("GV must be used in a function."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be unreachable and you also need to handle initializer references (in particular llvm.used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getFunctionDefiningGV
is only called when GV is an LDS variable. Is it possible to define an intrinsic LDS global variable using @llvm.used
or @llvm.global_ctors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. LDS globals aren't special, they are just globals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isLDSVariableToLower(GV)
returns false when GV
is a global intrinsic. I don't think they need to be cloned as they simply bundle a collection of GVs - which may need to be cloned if they are LDS.
llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
Outdated
Show resolved
Hide resolved
// Collect all caller functions of OldF. Each of them must call it's | ||
// corresponding clone of OldF. | ||
SmallVector<std::pair<Instruction *, SmallVector<Value *>>> | ||
InstsCallingOldF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Cloning utilities should handle this with the VMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InstsCallingOldF
collects all call instructions to the function defining LDS. These instructions are later cloned within their parent basic blocks. I am not sure if we want to clone these instructions when the callee function is cloned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. If you're cloning the entire function, you must clone all of the instructions in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all the instructions within the function are cloned upon calling CloneFunction
. Calls/invokes to the function need to be handled seperately, however, as they aren't cloned along with the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this doesn't make sense, the instructions have to be cloned. And I would expect the value map to remap the references to the called function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the function to be cloned is called F
, are you saying that the call/invoke instructions to F
should be cloned via VMap?
InstsToReplace.push_back(Inst); | ||
} else if (auto *Op = dyn_cast<Operator>(U)) { | ||
append_range(UsersWorklist, Op->users()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to handle all cases. this will miss aliases and global initializers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not sure what you mean by handling global initializers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If by global initializers you mean global ctors, isLDSVariableToLower(GV)
already skips cloning them as intended. I am handling GlobalAlias
now in the latest revision of this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you can initialize a global variable to contain a pointer value (e.g. in llvm.used). Instructions / Operators are not the only possible users:
@llvm.used = appending global [1 x ptr] [ptr addrspacecast (ptr addrspace(3) @var1 to ptr)], section "llvm.metadata"
@lds = addrspace(3) global i32 0
@b = addrspace(1) global ptr addrspace(3) @lds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in your example, if @var1
is an LDS which is cloned, every global intrinsic containing @var1.clone.0
should be cloned. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the special case of llvm.used, yes. I think we need to think harder about the general initializer case
1fc2186
to
032f3d9
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
The purpose of this pass is to ensure that the combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a unique copy of the variable on a per kernel basis. This representation also prepares the combined module to eliminate cross-module dependencies of LDS variables. This pass operates as follows: 1. Firstly, traverse the call graph from each kernel to determine the number of kernels calling each device function. 2. For each LDS global variable GV, determine the function F that defines it. Collect it's caller functions. Clone F and GV, and finally insert a call/invoke instruction in each caller function. Change-Id: I998291a389ea3db10de9122f08fe55c981da6049
I'm not sure what the relation is between this and #89245 |
032f3d9
to
af7fb7a
Compare
This pass runs before (AMDGPU)SplitModule. Since every kernel will have its own specific LDS global struct, SplitModule should be able to form self-contained modules with comparatively fewer cross-module LDS references. |
Also includes some minor nits Change-Id: I94dd1845d9896e797f25ae7968a0d9b055f89161
Change-Id: Ib6e0db62bf15d8486618b5d549fbef0a150bff15
Is this still necessary? |
The concept may still be necessary. I might implement it in LDS lowering or elsewhere, at some point in the future. We can close this PR for now. |
Cloning functions can improve LDS allocation in exchange for degrading code density. Seems moderately unlikely to be a win overall to me. Better if there are other reasons to clone the function. Cloning LDS variables without cloning functions, prior to the existing lowering pass, will not improve codegen and stands a decent chance of making it worse. |
The purpose of this pass is to ensure that the
combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a
unique copy of the variable on a per kernel basis. This representation also prepares the combined
module to eliminate cross-module dependencies of
LDS variables.
This pass operates as follows: