Skip to content

[AMDGPU][LTO] Introduce AMDGPUCloneModuleLDS #89683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

gandhi56
Copy link
Contributor

The purpose of this pass is to ensure that the
combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a
unique copy of the variable on a per kernel basis. This representation also prepares the combined
module to eliminate cross-module dependencies of
LDS variables.

This pass operates as follows:

  1. Firstly, traverse the call graph from each kernel to determine the number of kernels calling each device function.
  2. For each LDS global variable GV, determine the function F that defines it. Collect it's caller functions. Clone F and GV, and finally insert a call/invoke instruction in each caller function.

@gandhi56 gandhi56 requested a review from Pierre-vh April 22, 2024 22:19
@gandhi56 gandhi56 self-assigned this Apr 22, 2024
@llvmbot
Copy link
Member

llvmbot commented Apr 22, 2024

@llvm/pr-subscribers-lto

@llvm/pr-subscribers-backend-amdgpu

Author: Anshil Gandhi (gandhi56)

Changes

The purpose of this pass is to ensure that the
combined module contains as many LDS global variables as there are kernels that (indirectly) access them. As LDS variables behave like C++ static variables, it is important that each partition contains a
unique copy of the variable on a per kernel basis. This representation also prepares the combined
module to eliminate cross-module dependencies of
LDS variables.

This pass operates as follows:

  1. Firstly, traverse the call graph from each kernel to determine the number of kernels calling each device function.
  2. For each LDS global variable GV, determine the function F that defines it. Collect it's caller functions. Clone F and GV, and finally insert a call/invoke instruction in each caller function.

Patch is 27.83 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89683.diff

9 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+5)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp (+201)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll (+59)
  • (added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll (+107)
  • (added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll (+134)
  • (added) llvm/test/tools/llvm-split/AMDGPU/clone-lds-struct-insts.ll (+104)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 6016bd5187d887..f913833a25119b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -149,6 +149,11 @@ struct AMDGPULowerBufferFatPointersPass
   const TargetMachine &TM;
 };
 
+struct AMDGPUCloneModuleLDSPass
+    : public PassInfoMixin<AMDGPUCloneModuleLDSPass> {
+  PreservedAnalyses run(Module &, ModuleAnalysisManager &);
+};
+
 void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
 extern char &AMDGPURewriteOutArgumentsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp
new file mode 100644
index 00000000000000..2ab931818ea1eb
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCloneModuleLDS.cpp
@@ -0,0 +1,201 @@
+//===-- AMDGPUCloneModuleLDSPass.cpp ------------------------------*- C++ -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// The purpose of this pass is to ensure that the combined module contains
+// as many LDS global variables as there are kernels that (indirectly) access
+// them. As LDS variables behave like C++ static variables, it is important that
+// each partition contains a unique copy of the variable on a per kernel basis.
+// This representation also prepares the combined module to eliminate
+// cross-module dependencies of LDS variables.
+//
+// This pass operates as follows:
+// 1. Firstly, traverse the call graph from each kernel to determine the number
+//    of kernels calling each device function.
+// 2. For each LDS global variable GV, determine the function F that defines it.
+//    Collect it's caller functions. Clone F and GV, and finally insert a
+//    call/invoke instruction in each caller function.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "llvm/ADT/DepthFirstIterator.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Passes/PassBuilder.h"
+#include "llvm/Support/ScopedPrinter.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-clone-module-lds"
+
+static cl::opt<unsigned int> MaxCountForClonedFunctions(
+    "clone-lds-functions-max-count", cl::init(16), cl::Hidden,
+    cl::desc("Specify a limit to the number of clones of a function"));
+
+/// Return the function that defines \p GV
+/// \param GV The global variable in question
+/// \return The function defining \p GV
+static Function *getFunctionDefiningGV(GlobalVariable &GV) {
+  SmallVector<User *> Worklist(GV.users());
+  while (!Worklist.empty()) {
+    User *U = Worklist.pop_back_val();
+    if (auto *Inst = dyn_cast<Instruction>(U))
+      return Inst->getFunction();
+    if (auto *Op = dyn_cast<Operator>(U))
+      append_range(Worklist, Op->users());
+  }
+  llvm_unreachable("GV must be used in a function.");
+};
+
+/// Replace all references to \p OldGV in \p NewF with \p NewGV
+/// \param OldGV The global variable to be replaced
+/// \param NewGV The global variable taking the place of \p OldGV
+/// \param NewF The function in which the replacement occurs
+static void replaceUsesOfWith(GlobalVariable *OldGV, GlobalVariable *NewGV,
+                              Function *NewF) {
+  // ReplaceOperatorUses takes in an instruction Inst, which is assumed to
+  // contain OldGV as an operator, inserts an instruction correponding the
+  // OldGV-operand and update Inst accordingly.
+  auto ReplaceOperatorUses = [&OldGV, &NewGV](Instruction *Inst) {
+    Inst->replaceUsesOfWith(OldGV, NewGV);
+    SmallVector<Value *, 8> Worklist(Inst->operands());
+    while (!Worklist.empty()) {
+      auto *V = Worklist.pop_back_val();
+      if (auto *I = dyn_cast<AddrSpaceCastOperator>(V)) {
+        auto *Cast = new AddrSpaceCastInst(NewGV, I->getType(), "", Inst);
+        Inst->replaceUsesOfWith(I, Cast);
+      } else if (auto *I = dyn_cast<GEPOperator>(V)) {
+        SmallVector<Value *, 8> Indices(I->indices());
+        auto *GEP = GetElementPtrInst::Create(NewGV->getValueType(), NewGV,
+                                              Indices, "", Inst);
+        Inst->replaceUsesOfWith(I, GEP);
+      }
+    }
+  };
+
+  // Collect all user instructions of OldGV using a Worklist algorithm.
+  // If a user is an operator, collect the instruction wrapping
+  // the operator.
+  SmallVector<Instruction *, 8> InstsToReplace;
+  SmallVector<User *, 8> UsersWorklist(OldGV->users());
+  while (!UsersWorklist.empty()) {
+    auto *U = UsersWorklist.pop_back_val();
+    if (auto *Inst = dyn_cast<Instruction>(U)) {
+      InstsToReplace.push_back(Inst);
+    } else if (auto *Op = dyn_cast<Operator>(U)) {
+      append_range(UsersWorklist, Op->users());
+    }
+  }
+
+  // Replace all occurences of OldGV in NewF
+  DenseSet<Instruction *> ReplacedInsts;
+  while (!InstsToReplace.empty()) {
+    auto *Inst = InstsToReplace.pop_back_val();
+    if (Inst->getFunction() != NewF || ReplacedInsts.contains(Inst))
+      continue;
+    ReplaceOperatorUses(Inst);
+    ReplacedInsts.insert(Inst);
+  }
+};
+
+PreservedAnalyses AMDGPUCloneModuleLDSPass::run(Module &M,
+                                                ModuleAnalysisManager &AM) {
+  if (MaxCountForClonedFunctions.getValue() == 1)
+    return PreservedAnalyses::all();
+
+  bool Changed = false;
+  auto &CG = AM.getResult<CallGraphAnalysis>(M);
+
+  // For each function in the call graph, determine the number
+  // of ancestor-caller kernels.
+  DenseMap<Function *, unsigned int> KernelRefsToFuncs;
+  for (auto &Fn : M) {
+    if (Fn.getCallingConv() != CallingConv::AMDGPU_KERNEL)
+      continue;
+    for (auto I = df_begin(&CG), E = df_end(&CG); I != E; ++I)
+      if (auto *F = I->getFunction())
+        KernelRefsToFuncs[F]++;
+  }
+
+  DenseMap<GlobalVariable *, Function *> GVToFnMap;
+  LLVMContext &Ctx = M.getContext();
+  IRBuilder<> IRB(Ctx);
+  for (auto &GV : M.globals()) {
+    if (GVToFnMap.contains(&GV) ||
+        GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS ||
+        !GV.hasInitializer())
+      continue;
+
+    auto *OldF = getFunctionDefiningGV(GV);
+    GVToFnMap.insert({&GV, OldF});
+    LLVM_DEBUG(dbgs() << "Found LDS " << GV.getName() << " used in function "
+                      << OldF->getName() << '\n');
+
+    // Collect all caller functions of OldF. Each of them must call it's
+    // corresponding clone of OldF.
+    SmallVector<std::pair<Instruction *, SmallVector<Value *>>>
+        InstsCallingOldF;
+    for (auto &I : OldF->uses()) {
+      User *U = I.getUser();
+      SmallVector<Value *> Args;
+      if (auto *CI = dyn_cast<CallInst>(U)) {
+        append_range(Args, CI->args());
+        InstsCallingOldF.push_back({CI, Args});
+      } else if (auto *II = dyn_cast<InvokeInst>(U)) {
+        append_range(Args, II->args());
+        InstsCallingOldF.push_back({II, Args});
+      }
+    }
+
+    // Create as many clones of the function containing LDS global as
+    // there are kernels calling the function (including the function
+    // already defining the LDS global). Respectively, clone the
+    // LDS global and the call instructions to the function.
+    LLVM_DEBUG(dbgs() << "\tFunction is referenced by "
+                      << KernelRefsToFuncs[OldF] << " kernels.\n");
+    for (unsigned int ID = 0;
+         ID + 1 < std::min(KernelRefsToFuncs[OldF],
+                           MaxCountForClonedFunctions.getValue());
+         ++ID) {
+      // Clone function
+      ValueToValueMapTy VMap;
+      auto *NewF = CloneFunction(OldF, VMap);
+      NewF->setName(OldF->getName() + ".clone." + to_string(ID));
+      LLVM_DEBUG(dbgs() << "Inserting function clone with name "
+                        << NewF->getName() << '\n');
+
+      // Clone LDS global variable
+      auto *NewGV = new GlobalVariable(
+          M, GV.getValueType(), GV.isConstant(), GlobalValue::InternalLinkage,
+          UndefValue::get(GV.getValueType()),
+          GV.getName() + ".clone." + to_string(ID), &GV,
+          GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS, false);
+      NewGV->copyAttributesFrom(&GV);
+      NewGV->copyMetadata(&GV, 0);
+      NewGV->setComdat(GV.getComdat());
+      replaceUsesOfWith(&GV, NewGV, NewF);
+      LLVM_DEBUG(dbgs() << "Inserting LDS clone with name " << NewGV->getName()
+                        << "\n");
+
+      // Create a new CallInst to call the cloned function
+      for (auto [Inst, Args] : InstsCallingOldF) {
+        IRB.SetInsertPoint(Inst);
+        Instruction *I;
+        if (isa<CallInst>(Inst))
+          I = IRB.CreateCall(NewF, Args,
+                             Inst->getName() + ".clone." + to_string(ID));
+        else if (auto *II = dyn_cast<InvokeInst>(Inst))
+          I = IRB.CreateInvoke(NewF, II->getNormalDest(), II->getUnwindDest(),
+                               Args, II->getName() + ".clone" + to_string(ID));
+        LLVM_DEBUG(dbgs() << "Inserting inst: " << *I << '\n');
+      }
+      Changed = true;
+    }
+  }
+  return Changed ? PreservedAnalyses::none() : PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 90f36fadf35903..eb4bf25fef628a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,6 +22,7 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
             AMDGPULowerBufferFatPointersPass(*this))
 MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
 MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
+MODULE_PASS("amdgpu-clone-module-lds", AMDGPUCloneModuleLDSPass())
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
 #undef MODULE_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 305a6c8c3b9262..09beabd3f9c55c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -725,6 +725,7 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(
         // We want to support the -lto-partitions=N option as "best effort".
         // For that, we need to lower LDS earlier in the pipeline before the
         // module is partitioned for codegen.
+        PM.addPass(AMDGPUCloneModuleLDSPass());
         if (EnableLowerModuleLDS)
           PM.addPass(AMDGPULowerModuleLDSPass(*this));
       });
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 48325a0928f93d..fbf59e0422cb79 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUAtomicOptimizer.cpp
   AMDGPUAttributor.cpp
   AMDGPUCallLowering.cpp
+  AMDGPUCloneModuleLDS.cpp
   AMDGPUCodeGenPrepare.cpp
   AMDGPUCombinerHelper.cpp
   AMDGPUCtorDtorLowering.cpp
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll
new file mode 100644
index 00000000000000..ed4a4434d74b04
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-function.ll
@@ -0,0 +1,59 @@
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; In this examples, CloneModuleLDS pass creates two copies of LDS_GV
+; as two kernels call the same device function where LDS_GV is used.
+
+; CHECK: [[LDS_GV_CLONE:@.*\.clone\.0]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[LDS_GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL_CLONE_0:%.*]] = call i32 @lds_func.clone.0(i32 [[N]])
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @lds_func(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @lds_func(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL_CLONE_0:%.*]] = call i32 @lds_func.clone.0(i32 [[N]])
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @lds_func(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @lds_func(i32 %n)
+  ret void
+}
+
+
+define i32 @lds_func(i32 %x) {
+; CHECK-LABEL: define i32 @lds_func(
+; CHECK-SAME: i32 [[X:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) [[LDS_GV]] to ptr), i64 0, i64 0
+; CHECK-NEXT:    store i32 [[X]], ptr [[P]], align 4
+; CHECK-NEXT:    ret i32 [[X]]
+;
+entry:
+  %p = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) @lds_gv to ptr), i64 0, i64 0
+  store i32 %x, ptr %p
+  ret i32 %x
+}
+
+; CHECK-LABEL: define i32 @lds_func.clone.0(i32 %x) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT:   %0 = addrspacecast ptr addrspace(3) [[LDS_GV_CLONE]] to ptr
+; CHECK-NEXT:   %p = getelementptr inbounds [64 x i32], ptr %0, i64 0, i64 0
+; CHECK-NEXT:   store i32 %x, ptr %p, align 4
+; CHECK-NEXT:   ret i32 %x
+
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll
new file mode 100644
index 00000000000000..1efb1d12db924c
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-ancestor-kernels.ll
@@ -0,0 +1,107 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Before transformation,                    After transformation,
+;  K1  K2                                    K1  K2
+;  |  /                                      |  /
+;  | /                                       | /
+;  A                         ==>             A
+;  | \                                       | \
+;  |  \                                      |  \
+;  B   C                                     B   C
+;  |                                         | \
+;  X                                         X1 X2
+;
+; where X contains an LDS reference
+
+; CHECK: [[GV_CLONE:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) #3 {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define void @A() {
+; CHECK-LABEL: define void @A() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    call void @B()
+; CHECK-NEXT:    call void @C()
+; CHECK-NEXT:    ret void
+;
+entry:
+  call void @B()
+  call void @C()
+  ret void
+}
+
+define i32 @B() {
+; CHECK-LABEL: define i32 @B() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = alloca i32, align 4
+; CHECK-NEXT:    store i32 5, ptr [[P]], align 4
+; CHECK-NEXT:    [[RET_CLONE_0:%.*]] = call i32 @X.clone.0(ptr [[P]])
+; CHECK-NEXT:    [[RET:%.*]] = call i32 @X(ptr [[P]])
+; CHECK-NEXT:    ret i32 [[RET]]
+;
+entry:
+  %p = alloca i32
+  store i32 5, ptr %p
+  %ret = call i32 @X(ptr %p)
+  ret i32 %ret
+}
+
+define void @C() {
+; CHECK-LABEL: define void @C() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    ret void
+;
+entry:
+  ret void
+}
+
+define i32 @X(ptr %x) {
+; CHECK-LABEL: define i32 @X(
+; CHECK-SAME: ptr [[X:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) [[GV]] to ptr), i64 0, i64 0
+; CHECK-NEXT:    [[V:%.*]] = load i32, ptr [[X]], align 4
+; CHECK-NEXT:    store i32 [[V]], ptr [[P]], align 4
+; CHECK-NEXT:    ret i32 [[V]]
+;
+entry:
+  %p = getelementptr inbounds [64 x i32], ptr addrspacecast (ptr addrspace(3) @lds_gv to ptr), i64 0, i64 0
+  %v = load i32, ptr %x
+  store i32 %v, ptr %p
+  ret i32 %v
+}
+
+; CHECK-LABEL: define i32 @X.clone.0(ptr %x) {
+; CHECK-NEXT: entry:
+; CHECK-NEXT:   %0 = addrspacecast ptr addrspace(3) [[GV_CLONE]] to ptr
+; CHECK-NEXT:   %p = getelementptr inbounds [64 x i32], ptr %0, i64 0, i64 0
+; CHECK-NEXT:   %v = load i32, ptr %x, align 4
+; CHECK-NEXT:   store i32 %v, ptr %p, align 4
+; CHECK-NEXT:   ret i32 %v
diff --git a/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
new file mode 100644
index 00000000000000..7f27a630b7d850
--- /dev/null
+++ b/llvm/test/tools/llvm-split/AMDGPU/clone-lds-functions-successors.ll
@@ -0,0 +1,134 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -passes=amdgpu-clone-module-lds %s -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+; Before transformation,                    After transformation,
+;  K1  K2    K3                              K1  K2    K3
+;  |  /      |                               |  /      |
+;  | /       |                               | /       |
+;  A --------+               ==>             A --------+
+;  |                                         |
+;  |                                         |
+;  B                                         B
+;  |                                       / | \
+;  X                                      X1 X2 X3
+;  |                                      \  |  /
+;  D                                       \ | /
+;                                            D
+; where X contains an LDS reference
+
+; CHECK: [[GV_CLONE_0:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV_CLONE_1:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+; CHECK: [[GV:@.*]] = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+@lds_gv = internal unnamed_addr addrspace(3) global [64 x i32] undef, align 16
+
+define protected amdgpu_kernel void @kernel1(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel1(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel2(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel2(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define protected amdgpu_kernel void @kernel3(i32 %n) {
+; CHECK-LABEL: define protected amdgpu_kernel void @kernel3(
+; CHECK-SAME: i32 [[N:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CALL:%.*]] = call i32 @A(i32 [[N]])
+; CHECK-NEXT:    ret void
+;
+entry:
+  %call = call i32 @A(i32 %n)
+  ret void
+}
+
+define void @A() {
+; CHECK-LABEL: define void @A() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    call void @B()
+; CHECK-NEXT:    ret void
+;
+entry:
+  call void @B()
+  ret void
+}
+
+define i32 @B() {
+; CHECK-LABEL: define i32 @B() {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P:%.*]] = alloca i32, align 4
+; CHECK-NEXT:    store i32 5, ptr [[P]], align 4
+; CHECK-NEXT:    [[RET_CLONE_0:%.*]] = call i32 @X.clone.0(ptr [[P]])
+...
[truncated]

@gandhi56 gandhi56 added the LTO Link time optimization (regular/full LTO or ThinLTO) label Apr 22, 2024
@gandhi56 gandhi56 requested review from arsenm, scchan and jdoerfert April 22, 2024 22:20
if (auto *Op = dyn_cast<Operator>(U))
append_range(Worklist, Op->users());
}
llvm_unreachable("GV must be used in a function.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be unreachable and you also need to handle initializer references (in particular llvm.used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFunctionDefiningGV is only called when GV is an LDS variable. Is it possible to define an intrinsic LDS global variable using @llvm.used or @llvm.global_ctors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. LDS globals aren't special, they are just globals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isLDSVariableToLower(GV) returns false when GV is a global intrinsic. I don't think they need to be cloned as they simply bundle a collection of GVs - which may need to be cloned if they are LDS.

Comment on lines 139 to 142
// Collect all caller functions of OldF. Each of them must call it's
// corresponding clone of OldF.
SmallVector<std::pair<Instruction *, SmallVector<Value *>>>
InstsCallingOldF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Cloning utilities should handle this with the VMap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InstsCallingOldF collects all call instructions to the function defining LDS. These instructions are later cloned within their parent basic blocks. I am not sure if we want to clone these instructions when the callee function is cloned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. If you're cloning the entire function, you must clone all of the instructions in the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all the instructions within the function are cloned upon calling CloneFunction. Calls/invokes to the function need to be handled seperately, however, as they aren't cloned along with the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this doesn't make sense, the instructions have to be cloned. And I would expect the value map to remap the references to the called function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the function to be cloned is called F, are you saying that the call/invoke instructions to F should be cloned via VMap?

InstsToReplace.push_back(Inst);
} else if (auto *Op = dyn_cast<Operator>(U)) {
append_range(UsersWorklist, Op->users());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle all cases. this will miss aliases and global initializers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure what you mean by handling global initializers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by global initializers you mean global ctors, isLDSVariableToLower(GV) already skips cloning them as intended. I am handling GlobalAlias now in the latest revision of this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you can initialize a global variable to contain a pointer value (e.g. in llvm.used). Instructions / Operators are not the only possible users:

@llvm.used = appending global [1 x ptr] [ptr addrspacecast (ptr addrspace(3) @var1 to ptr)], section "llvm.metadata"
@lds = addrspace(3) global i32 0
@b = addrspace(1) global ptr addrspace(3) @lds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in your example, if @var1 is an LDS which is cloned, every global intrinsic containing @var1.clone.0 should be cloned. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the special case of llvm.used, yes. I think we need to think harder about the general initializer case

@gandhi56 gandhi56 force-pushed the amdgpu-clone-module-lds branch from 1fc2186 to 032f3d9 Compare April 24, 2024 06:42
Copy link

github-actions bot commented Apr 24, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

The purpose of this pass is to ensure that the
combined module contains as many LDS global variables
as there are kernels that (indirectly) access them.
As LDS variables behave like C++ static variables,
it is important that each partition contains a
unique copy of the variable on a per kernel basis.
This representation also prepares the combined
module to eliminate cross-module dependencies of
LDS variables.

This pass operates as follows:
1. Firstly, traverse the call graph from each kernel
   to determine the number of kernels calling each
   device function.
2. For each LDS global variable GV, determine the
   function F that defines it. Collect it's caller
   functions. Clone F and GV, and finally insert a
   call/invoke instruction in each caller function.

Change-Id: I998291a389ea3db10de9122f08fe55c981da6049
@arsenm
Copy link
Contributor

arsenm commented May 7, 2024

I'm not sure what the relation is between this and #89245

@gandhi56 gandhi56 force-pushed the amdgpu-clone-module-lds branch from 032f3d9 to af7fb7a Compare May 7, 2024 16:40
@gandhi56
Copy link
Contributor Author

gandhi56 commented May 7, 2024

I'm not sure what the relation is between this and #89245

This pass runs before (AMDGPU)SplitModule. Since every kernel will have its own specific LDS global struct, SplitModule should be able to form self-contained modules with comparatively fewer cross-module LDS references.

gandhi56 added 2 commits May 9, 2024 06:59
Also includes some minor nits

Change-Id: I94dd1845d9896e797f25ae7968a0d9b055f89161
Change-Id: Ib6e0db62bf15d8486618b5d549fbef0a150bff15
@arsenm
Copy link
Contributor

arsenm commented Aug 16, 2024

Is this still necessary?

@gandhi56
Copy link
Contributor Author

Is this still necessary?

The concept may still be necessary. I might implement it in LDS lowering or elsewhere, at some point in the future. We can close this PR for now.

@gandhi56 gandhi56 closed this Aug 21, 2024
@JonChesterfield
Copy link
Collaborator

Cloning functions can improve LDS allocation in exchange for degrading code density. Seems moderately unlikely to be a win overall to me. Better if there are other reasons to clone the function.

Cloning LDS variables without cloning functions, prior to the existing lowering pass, will not improve codegen and stands a decent chance of making it worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants