[AMDGPU][LTO] Introduce AMDGPUCloneModuleLDS #89683

gandhi56 · 2024-04-22T22:19:08Z

The purpose of this pass is to ensure that the
combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a
unique copy of the variable on a per kernel basis. This representation also prepares the combined
module to eliminate cross-module dependencies of
LDS variables.

This pass operates as follows:

Firstly, traverse the call graph from each kernel to determine the number of kernels calling each device function.
For each LDS global variable GV, determine the function F that defines it. Collect it's caller functions. Clone F and GV, and finally insert a call/invoke instruction in each caller function.

llvmbot · 2024-04-22T22:19:38Z

@llvm/pr-subscribers-lto

@llvm/pr-subscribers-backend-amdgpu

Author: Anshil Gandhi (gandhi56)

Changes

The purpose of this pass is to ensure that the
combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a
unique copy of the variable on a per kernel basis. This representation also prepares the combined
module to eliminate cross-module dependencies of
LDS variables.

This pass operates as follows:

Firstly, traverse the call graph from each kernel to determine the number of kernels calling each device function.
For each LDS global variable GV, determine the function F that defines it. Collect it's caller functions. Clone F and GV, and finally insert a call/invoke instruction in each caller function.

Patch is 27.83 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89683.diff

9 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+5)
(added) llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp (+201)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+1)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll (+59)
(added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll (+107)
(added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll (+134)
(added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-struct-insts.ll (+104)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 6016bd5187d887..f913833a25119b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -149,6 +149,11 @@ struct AMDGPULowerBufferFatPointersPass
   const TargetMachine &TM;
 };
 
+struct AMDGPUCloneModuleLDSPass
+    : public PassInfoMixin<AMDGPUCloneModuleLDSPass> {
+  PreservedAnalyses run(Module &, ModuleAnalysisManager &);
+};
+
 void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
 extern char &AMDGPURewriteOutArgumentsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp
new file mode 100644
index 00000000000000..2ab931818ea1eb
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp
@@ -0,0 +1,201 @@
+//===-- AMDGPUCloneModuleLDSPass.cpp ------------------------------*- C++ -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// The purpose of this pass is to ensure that the combined module contains
+// as many LDS global variables as there are kernels that (indirectly) access
+// them. As LDS variables behave like C++ static variables, it is important that
+// each partition contains a unique copy of the variable on a per kernel basis.
+// This representation also prepares the combined module to eliminate
+// cross-module dependencies of LDS variables.
+//
+// This pass operates as follows:
+// 1. Firstly, traverse the call graph from each kernel to determine the number
+//    of kernels calling each device function.
+// 2. For each LDS global variable GV, determine the function F that defines it.
+//    Collect it's caller functions. Clone F and GV, and finally insert a
+//    call/invoke instruction in each caller function.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "llvm/ADT/DepthFirstIterator.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Passes/PassBuilder.h"
+#include "llvm/Support/ScopedPrinter.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-clone-module-lds"
+
+static cl::opt<unsigned int> MaxCountForClonedFunctions(
+    "clone-lds-functions-max-count", cl::init(16), cl::Hidden,
+    cl::desc("Specify a limit to the number of clones of a function"));
+
+/// Return the function that defines \p GV
+/// \param GV The global variable in question
+/// \return The function defining \p GV
+static Function *getFunctionDefiningGV(GlobalVariable &GV) {
+  SmallVector<User *> Worklist(GV.users());
+  while (!Worklist.empty()) {
+    User *U = Worklist.pop_back_val();
+    if (auto *Inst = dyn_cast<Instruction>(U))
+      return Inst->getFunction();
+    if (auto *Op = dyn_cast<Operator>(U))
+      append_range(Worklist, Op->users());
+  }
+  llvm_unreachable("GV must be used in a function.");
+};
+
+/// Replace all references to \p OldGV in \p NewF with \p NewGV
+/// \param OldGV The global variable to be replaced
+/// \param NewGV The global variable taking the place of \p OldGV
+/// \param NewF The function in which the replacement occurs
+static void replaceUsesOfWith(GlobalVariable *OldGV, GlobalVariable *NewGV,
+                              Function *NewF) {
+  // ReplaceOperatorUses takes in an instruction Inst, which is assumed to
+  // contain OldGV as an operator, inserts an instruction correponding the
+  // OldGV-operand and update Inst accordingly.
+  auto ReplaceOperatorUses = [&OldGV, &NewGV](Instruction *Inst) {
+    Inst->replaceUsesOfWith(OldGV, NewGV);
+    SmallVector<Value *, 8> Worklist(Inst->operands());
+    while (!Worklist.empty()) {
+      auto *V = Worklist.pop_back_val();
+      if (auto *I = dyn_cast<AddrSpaceCastOperator>(V)) {
+        auto *Cast = new AddrSpaceCastInst(NewGV, I->getType(), "", Inst);
+        Inst->replaceUsesOfWith(I, Cast);
+      } else if (auto *I = dyn_cast<GEPOperator>(V)) {
+        SmallVector<Value *, 8> Indices(I->indices());
+        auto *GEP = GetElementPtrInst::Create(NewGV->getValueType(), NewGV,
+                                              Indices, "", Inst);
+        Inst->replaceUsesOfWith(I, GEP);
+      }
+    }
+  };
+
+  // Collect all user instructions of OldGV using a Worklist algorithm.
+  // If a user is an operator, collect the instruction wrapping
+  // the operator.
+  SmallVector<Instruction *, 8> InstsToReplace;
+  SmallVector<User *, 8> UsersWorklist(OldGV->users());
+  while (!UsersWorklist.empty()) {
+    auto *U = UsersWorklist.pop_back_val();
+    if (auto *Inst = dyn_cast<Instruction>(U)) {
+      InstsToReplace.push_back(Inst);
+    } else if (auto *Op = dyn_cast<Operator>(U)) {
+      append_range(UsersWorklist, Op->users());
+    }
+  }
+
+  // Replace all occurences of OldGV in NewF
+  DenseSet<Instruction *> ReplacedInsts;
+  while (!InstsToReplace.empty()) {
+    auto *Inst = InstsToReplace.pop_back_val();
+    if (Inst->getFunction() != NewF || ReplacedInsts.contains(Inst))
+      continue;
+    ReplaceOperatorUses(Inst);
+    ReplacedInsts.insert(Inst);
+  }
+};
+
+PreservedAnalyses AMDGPUCloneModuleLDSPass::run(Module &M,
+                                                ModuleAnalysisManager &AM) {
+  if (MaxCountForClonedFunctions.getValue() == 1)
+    return PreservedAnalyses::all();
+
+  bool Changed = false;
+  auto &CG = AM.getResult<CallGraphAnalysis>(M);
+
+  // For each function in the call graph, determine the number
+  // of ancestor-caller kernels.
+  DenseMap<Function *, unsigned int> KernelRefsToFuncs;
+  for (auto &Fn : M) {
+    if (Fn.getCallingConv() != CallingConv::AMDGPU_KERNEL)
+      continue;
+    for (auto I = df_begin(&CG), E = df_end(&CG); I != E; ++I)
+      if (auto *F = I->getFunction())
+        KernelRefsToFuncs[F]++;
+  }
+
+  DenseMap<GlobalVariable *, Function *> GVToFnMap;
+  LLVMContext &Ctx = M.getContext();
+  IRBuilder<> IRB(Ctx);
+  for (auto &GV : M.globals()) {
+    if (GVToFnMap.contains(&GV) ||
+        GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS ||
+        !GV.hasInitializer())
+      continue;
+
+    auto *OldF = getFunctionDefiningGV(GV);
+    GVToFnMap.insert({&GV, OldF});
+    LLVM_DEBUG(dbgs() << "Found LDS " << GV.getName() << " used in function "
+                      << OldF->getName() << '\n');
+
+    // Collect all caller functions of OldF. Each of them must call it's
+    // corresponding clone of OldF.
+    SmallVector<std::pair<Instruction *, SmallVector<Value *>>>
+        InstsCallingOldF;
+    for (auto &I : OldF->uses()) {
+      User *U = I.getUser();
+      SmallVector<Value *> Args;
+      if (auto *CI = dyn_cast<CallInst>(U)) {
+        append_range(Args, CI->args());
+        InstsCallingOldF.push_back({CI, Args});
+      } else if (auto *II = dyn_cast<InvokeInst>(U)) {
+        append_range(Args, II->args());
+        InstsCallingOldF.push_back({II, Args});
+      }
+    }
+
+    // Create as many clones of the function containing LDS global as
+    // there are kernels calling the function (including the function
+    // already defining the LDS global). Respectively, clone the
+    // LDS global and the call instructions to the function.
+    LLVM_DEBUG(dbgs() << "\tFunction is referenced by "
+                      << KernelRefsToFuncs[OldF] << " kernels.\n");
+    for (unsigned int ID = 0;
+         ID + 1 < std::min(KernelRefsToFuncs[OldF],
+                           MaxCountForClonedFunctions.getValue());
+         ++ID) {
+      // Clone function
+      ValueToValueMapTy VMap;
+      auto *NewF = CloneFunction(OldF, VMap);
+      NewF->setName(OldF->getName() + ".clone." + to_string(ID));
+      LLVM_DEBUG(dbgs() << "Inserting function clone with name "
+                        << NewF->getName() << '\n');
+
+      // Clone LDS global variable
+      auto *NewGV = new GlobalVariable(
+          M, GV.getValueType(), GV.isConstant(), GlobalValue::InternalLinkage,
+          UndefValue::get(GV.getValueType()),
+          GV.getName() + ".clone." + to_string(ID), &GV,
+          GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS, false);
+      NewGV->copyAttributesFrom(&GV);
+      NewGV->copyMetadata(&GV, 0);
+      NewGV->setComdat(GV.getComdat());
+      replaceUsesOfWith(&GV, NewGV, NewF);
+      LLVM_DEBUG(dbgs() << "Inserting LDS clone with name " << NewGV->getName()
+                        << "\n");
+
+      // Create a new CallInst to call the cloned function
+      for (auto [Inst, Args] : InstsCallingOldF) {
+        IRB.SetInsertPoint(Inst);
+        Instruction *I;
+        if (isa<CallInst>(Inst))
+          I = IRB.CreateCall(NewF, Args,
+                             Inst->getName() + ".clone." + to_string(ID));
+        else if (auto *II = dyn_cast<InvokeInst>(Inst))
+          I = IRB.CreateInvoke(NewF, II->getNormalDest(), II->getUnwindDest(),
+                               Args, II->getName() + ".clone" + to_string(ID));
+        LLVM_DEBUG(dbgs() << "Inserting inst: " << *I << '\n');
+      }
+      Changed = true;
+    }
+  }
+  return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 90f36fadf35903..eb4bf25fef628a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,6 +22,7 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
             AMDGPULowerBufferFatPointersPass(*this))
 MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
 MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
+MODULE_PASS("amdgpu-clone-module-lds", AMDGPUCloneModuleLDSPass())
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
 #undef MODULE_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 305a6c8c3b9262..09beabd3f9c55c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -725,6 +725,7 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(
         // We want to support the -lto-partitions=N option as "best effort".
         // For that, we need to lower LDS earlier in the pipeline before the
         // module is partitioned for codegen.
+        PM.addPass(AMDGPUCloneModuleLDSPass());
         if (EnableLowerModuleLDS)
           PM.addPass(AMDGPULowerModuleLDSPass(*this));
       });
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 48325a0928f93d..fbf59e0422cb79 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUAtomicOptimizer.cpp
   AMDGPUAttributor.cpp
   AMDGPUCallLowering.cpp
+  AMDGPUCloneModuleLDS.cpp
   AMDGPUCodeGenPrepare.cpp
   AMDGPUCombinerHelper.cpp
   AMDGPUCtorDtorLowering.cpp
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll
new file mode 100644
index 00000000000000..ed4a4434d74b04
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll
@@ -0,0 +1,59 @@
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; In this examples, CloneModuleLDS pass creates two copies of LDS_GV
+; as two kernels call the same device function where LDS_GV is used.
+
+; CHECK: [[LDS_GV_CLONE:@.*\.clone\.0]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[LDS_GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL_CLONE_0:%.*]] = call i32 @lds_func.clone.0(i32 [[N]])
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @lds_func(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @lds_func(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL_CLONE_0:%.*]] = call i32 @lds_func.clone.0(i32 [[N]])
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @lds_func(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @lds_func(i32 %n)
+  ret void
+}
+
+
+define i32 @lds_func(i32 %x) {
+; CHECK-LABEL: define i32 @lds_func(
+; CHECK-SAME: i32 [[X:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) [[LDS_GV]] to ptr), i64 0, i64 0
+; CHECK-NEXT:    store i32 [[X]], ptr [[P]], align 4
+; CHECK-NEXT:    ret i32 [[X]]
+;
+entry:
+  %p = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) @lds_gv to ptr), i64 0, i64 0
+  store i32 %x, ptr %p
+  ret i32 %x
+}
+
+; CHECK-LABEL: define i32 @lds_func.clone.0(i32 %x) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT:   %0 = addrspacecast ptr addrspace(3) [[LDS_GV_CLONE]] to ptr
+; CHECK-NEXT:   %p = getelementptr inbounds [64 x i32], ptr %0, i64 0, i64 0
+; CHECK-NEXT:   store i32 %x, ptr %p, align 4
+; CHECK-NEXT:   ret i32 %x
+
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll
new file mode 100644
index 00000000000000..1efb1d12db924c
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll
@@ -0,0 +1,107 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Before transformation,                    After transformation,
+;  K1  K2                                    K1  K2
+;  |  /                                      |  /
+;  | /                                       | /
+;  A                         ==>             A
+;  | \                                       | \
+;  |  \                                      |  \
+;  B   C                                     B   C
+;  |                                         | \
+;  X                                         X1 X2
+;
+; where X contains an LDS reference
+
+; CHECK: [[GV_CLONE:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define void @A() {
+; CHECK-LABEL: define void @A() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    call void @B()
+; CHECK-NEXT:    call void @C()
+; CHECK-NEXT:    ret void
+;
+entry:
+  call void @B()
+  call void @C()
+  ret void
+}
+
+define i32 @B() {
+; CHECK-LABEL: define i32 @B() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = alloca i32, align 4
+; CHECK-NEXT:    store i32 5, ptr [[P]], align 4
+; CHECK-NEXT:    [[RET_CLONE_0:%.*]] = call i32 @X.clone.0(ptr [[P]])
+; CHECK-NEXT:    [[RET:%.*]] = call i32 @X(ptr [[P]])
+; CHECK-NEXT:    ret i32 [[RET]]
+;
+entry:
+  %p = alloca i32
+  store i32 5, ptr %p
+  %ret = call i32 @X(ptr %p)
+  ret i32 %ret
+}
+
+define void @C() {
+; CHECK-LABEL: define void @C() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret void
+;
+entry:
+  ret void
+}
+
+define i32 @X(ptr %x) {
+; CHECK-LABEL: define i32 @X(
+; CHECK-SAME: ptr [[X:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) [[GV]] to ptr), i64 0, i64 0
+; CHECK-NEXT:    [[V:%.*]] = load i32, ptr [[X]], align 4
+; CHECK-NEXT:    store i32 [[V]], ptr [[P]], align 4
+; CHECK-NEXT:    ret i32 [[V]]
+;
+entry:
+  %p = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) @lds_gv to ptr), i64 0, i64 0
+  %v = load i32, ptr %x
+  store i32 %v, ptr %p
+  ret i32 %v
+}
+
+; CHECK-LABEL: define i32 @X.clone.0(ptr %x) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT:   %0 = addrspacecast ptr addrspace(3) [[GV_CLONE]] to ptr
+; CHECK-NEXT:   %p = getelementptr inbounds [64 x i32], ptr %0, i64 0, i64 0
+; CHECK-NEXT:   %v = load i32, ptr %x, align 4
+; CHECK-NEXT:   store i32 %v, ptr %p, align 4
+; CHECK-NEXT:   ret i32 %v
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
new file mode 100644
index 00000000000000..7f27a630b7d850
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
@@ -0,0 +1,134 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Before transformation,                    After transformation,
+;  K1  K2    K3                              K1  K2    K3
+;  |  /      |                               |  /      |
+;  | /       |                               | /       |
+;  A --------+               ==>             A --------+
+;  |                                         |
+;  |                                         |
+;  B                                         B
+;  |                                       / | \
+;  X                                      X1 X2 X3
+;  |                                      \  |  /
+;  D                                       \ | /
+;                                            D
+; where X contains an LDS reference
+
+; CHECK: [[GV_CLONE_0:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV_CLONE_1:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel3(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel3(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define void @A() {
+; CHECK-LABEL: define void @A() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    call void @B()
+; CHECK-NEXT:    ret void
+;
+entry:
+  call void @B()
+  ret void
+}
+
+define i32 @B() {
+; CHECK-LABEL: define i32 @B() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = alloca i32, align 4
+; CHECK-NEXT:    store i32 5, ptr [[P]], align 4
+; CHECK-NEXT:    [[RET_CLONE_0:%.*]] = call i32 @X.clone.0(ptr [[P]])
+...
[truncated]

arsenm · 2024-04-23T10:09:05Z

llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp

+    if (auto *Op = dyn_cast<Operator>(U))
+      append_range(Worklist, Op->users());
+  }
+  llvm_unreachable("GV must be used in a function.");


This can't be unreachable and you also need to handle initializer references (in particular llvm.used)

getFunctionDefiningGV is only called when GV is an LDS variable. Is it possible to define an intrinsic LDS global variable using @llvm.used or @llvm.global_ctors?

Yes. LDS globals aren't special, they are just globals

isLDSVariableToLower(GV) returns false when GV is a global intrinsic. I don't think they need to be cloned as they simply bundle a collection of GVs - which may need to be cloned if they are LDS.

llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp

llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll

llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll

arsenm · 2024-04-23T10:13:49Z

llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp

+    // Collect all caller functions of OldF. Each of them must call it's
+    // corresponding clone of OldF.
+    SmallVector<std::pair<Instruction *, SmallVector<Value *>>>
+        InstsCallingOldF;


I think the Cloning utilities should handle this with the VMap

InstsCallingOldF collects all call instructions to the function defining LDS. These instructions are later cloned within their parent basic blocks. I am not sure if we want to clone these instructions when the callee function is cloned?

I'm confused. If you're cloning the entire function, you must clone all of the instructions in the function?

Yes, all the instructions within the function are cloned upon calling CloneFunction. Calls/invokes to the function need to be handled seperately, however, as they aren't cloned along with the function.

But this doesn't make sense, the instructions have to be cloned. And I would expect the value map to remap the references to the called function

If the function to be cloned is called F, are you saying that the call/invoke instructions to F should be cloned via VMap?

llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp

arsenm · 2024-04-23T11:30:09Z

llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp

+      InstsToReplace.push_back(Inst);
+    } else if (auto *Op = dyn_cast<Operator>(U)) {
+      append_range(UsersWorklist, Op->users());
+    }


need to handle all cases. this will miss aliases and global initializers

Sorry, I am not sure what you mean by handling global initializers.

If by global initializers you mean global ctors, isLDSVariableToLower(GV) already skips cloning them as intended. I am handling GlobalAlias now in the latest revision of this patch.

I mean you can initialize a global variable to contain a pointer value (e.g. in llvm.used). Instructions / Operators are not the only possible users:

@llvm.used = appending global [1 x ptr] [ptr addrspacecast (ptr addrspace(3) @var1 to ptr)], section "llvm.metadata"

@lds = addrspace(3) global i32 0 @b = addrspace(1) global ptr addrspace(3) @lds

So in your example, if @var1 is an LDS which is cloned, every global intrinsic containing @var1.clone.0 should be cloned. Right?

For the special case of llvm.used, yes. I think we need to think harder about the general initializer case

github-actions · 2024-04-24T06:46:07Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll

llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp

The purpose of this pass is to ensure that the combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a unique copy of the variable on a per kernel basis. This representation also prepares the combined module to eliminate cross-module dependencies of LDS variables. This pass operates as follows: 1. Firstly, traverse the call graph from each kernel to determine the number of kernels calling each device function. 2. For each LDS global variable GV, determine the function F that defines it. Collect it's caller functions. Clone F and GV, and finally insert a call/invoke instruction in each caller function. Change-Id: I998291a389ea3db10de9122f08fe55c981da6049

llvm/test/tools/llvm-split/AMDGPU/clone-lds-struct-insts.ll

arsenm · 2024-05-07T15:09:12Z

I'm not sure what the relation is between this and #89245

gandhi56 · 2024-05-07T18:41:47Z

I'm not sure what the relation is between this and #89245

This pass runs before (AMDGPU)SplitModule. Since every kernel will have its own specific LDS global struct, SplitModule should be able to form self-contained modules with comparatively fewer cross-module LDS references.

Also includes some minor nits Change-Id: I94dd1845d9896e797f25ae7968a0d9b055f89161

Change-Id: Ib6e0db62bf15d8486618b5d549fbef0a150bff15

arsenm · 2024-08-16T13:12:16Z

Is this still necessary?

gandhi56 · 2024-08-21T19:52:57Z

Is this still necessary?

The concept may still be necessary. I might implement it in LDS lowering or elsewhere, at some point in the future. We can close this PR for now.

JonChesterfield · 2024-09-12T14:45:30Z

Cloning functions can improve LDS allocation in exchange for degrading code density. Seems moderately unlikely to be a win overall to me. Better if there are other reasons to clone the function.

Cloning LDS variables without cloning functions, prior to the existing lowering pass, will not improve codegen and stands a decent chance of making it worse.

gandhi56 requested a review from Pierre-vh April 22, 2024 22:19

llvmbot added the backend:AMDGPU label Apr 22, 2024

gandhi56 self-assigned this Apr 22, 2024

gandhi56 added the LTO Link time optimization (regular/full LTO or ThinLTO) label Apr 22, 2024

gandhi56 requested review from arsenm, scchan and jdoerfert April 22, 2024 22:20

arsenm reviewed Apr 23, 2024

View reviewed changes

gandhi56 requested a review from JonChesterfield April 23, 2024 16:01

gandhi56 force-pushed the amdgpu-clone-module-lds branch from 1fc2186 to 032f3d9 Compare April 24, 2024 06:42

arsenm reviewed Apr 24, 2024

View reviewed changes

llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll Outdated Show resolved Hide resolved

llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll Show resolved Hide resolved

Pierre-vh requested changes Apr 24, 2024

View reviewed changes

gandhi56 mentioned this pull request May 3, 2024

[AMDGPU] Add AMDGPU-specific module splitting #89245

Merged

arsenm reviewed May 7, 2024

View reviewed changes

llvm/test/tools/llvm-split/AMDGPU/clone-lds-struct-insts.ll Outdated Show resolved Hide resolved

llvm/test/tools/llvm-split/AMDGPU/clone-lds-struct-insts.ll Outdated Show resolved Hide resolved

gandhi56 force-pushed the amdgpu-clone-module-lds branch from 032f3d9 to af7fb7a Compare May 7, 2024 16:40

gandhi56 added 2 commits May 9, 2024 06:59

[AMDGPUCloneModuleLDS] Add GlobalAlias support

13ec451

Also includes some minor nits Change-Id: I94dd1845d9896e797f25ae7968a0d9b055f89161

Fix tests and add a test for global intrinsics

5c2ef71

Change-Id: Ib6e0db62bf15d8486618b5d549fbef0a150bff15

gandhi56 closed this Aug 21, 2024

[AMDGPU][LTO] Introduce AMDGPUCloneModuleLDS #89683

[AMDGPU][LTO] Introduce AMDGPUCloneModuleLDS #89683

Uh oh!

Conversation

gandhi56 commented Apr 22, 2024

Uh oh!

llvmbot commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm commented May 7, 2024

Uh oh!

gandhi56 commented May 7, 2024

Uh oh!

arsenm commented Aug 16, 2024

Uh oh!

gandhi56 commented Aug 21, 2024

Uh oh!

JonChesterfield commented Sep 12, 2024

Uh oh!

Uh oh!

llvmbot commented Apr 22, 2024 •

edited

Loading

github-actions bot commented Apr 24, 2024 •

edited

Loading