Skip to content

[mlir][gpu] Change GPU modules to globals #135478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 22, 2025

Conversation

chsigg
Copy link
Contributor

@chsigg chsigg commented Apr 12, 2025

Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of embedBinary() and launchKernel() use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for launchKernel() to align them again.

…en launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.
@llvmbot llvmbot added mlir:llvm mlir:gpu mlir bazel "Peripheral" support tier build system: utils/bazel labels Apr 12, 2025
@llvmbot
Copy link
Member

llvmbot commented Apr 12, 2025

@llvm/pr-subscribers-mlir-execution-engine
@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Christian Sigg (chsigg)

Changes

Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of embedBinary() and launchKernel() use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for launchKernel() to align them again.


Patch is 25.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135478.diff

4 Files Affected:

  • (modified) mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp (+128-130)
  • (added) mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir (+48)
  • (modified) mlir/test/Target/LLVMIR/gpu.mlir (+35-36)
  • (modified) utils/bazel/llvm-project-overlay/mlir/BUILD.bazel (+1)
diff --git a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
index 8d4a0bcf8adbf..d3216d9ad17eb 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
@@ -18,11 +18,13 @@
 #include "mlir/Target/LLVMIR/Export.h"
 #include "mlir/Target/LLVMIR/ModuleTranslation.h"
 
+#include "llvm/ADT/ScopeExit.h"
 #include "llvm/IR/Constants.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Module.h"
 #include "llvm/Support/FormatVariadic.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
 
 using namespace mlir;
 
@@ -31,9 +33,13 @@ namespace {
 class SelectObjectAttrImpl
     : public gpu::OffloadingLLVMTranslationAttrInterface::FallbackModel<
           SelectObjectAttrImpl> {
+  // Returns the selected object for embedding.
+  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
+
 public:
   // Translates a `gpu.binary`, embedding the binary into a host LLVM module as
-  // global binary string.
+  // global binary string which gets loaded/unloaded into a global module
+  // object through a global ctor/dtor.
   LogicalResult embedBinary(Attribute attribute, Operation *operation,
                             llvm::IRBuilderBase &builder,
                             LLVM::ModuleTranslation &moduleTranslation) const;
@@ -45,23 +51,9 @@ class SelectObjectAttrImpl
                              Operation *binaryOperation,
                              llvm::IRBuilderBase &builder,
                              LLVM::ModuleTranslation &moduleTranslation) const;
-
-  // Returns the selected object for embedding.
-  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
 };
-// Returns an identifier for the global string holding the binary.
-std::string getBinaryIdentifier(StringRef binaryName) {
-  return binaryName.str() + "_bin_cst";
-}
 } // namespace
 
-void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
-    DialectRegistry &registry) {
-  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
-    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
-  });
-}
-
 gpu::ObjectAttr
 SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   ArrayRef<Attribute> objects = op.getObjectsAttr().getValue();
@@ -96,6 +88,94 @@ SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   return mlir::dyn_cast<gpu::ObjectAttr>(objects[index]);
 }
 
+static Twine getModuleIdentifier(StringRef moduleName) {
+  return moduleName + "_module";
+}
+
+namespace llvm {
+static LogicalResult embedBinaryImpl(StringRef moduleName,
+                                     gpu::ObjectAttr object, Module &module) {
+
+  // Embed the object as a global string.
+  // Add null for assembly output for JIT paths that expect null-terminated
+  // strings.
+  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
+  StringRef serializedStr = object.getObject().getValue();
+  Constant *serializedCst =
+      ConstantDataArray::getString(module.getContext(), serializedStr, addNull);
+  GlobalVariable *serializedObj =
+      new GlobalVariable(module, serializedCst->getType(), true,
+                         GlobalValue::LinkageTypes::InternalLinkage,
+                         serializedCst, moduleName + "_binary");
+  serializedObj->setAlignment(MaybeAlign(8));
+  serializedObj->setUnnamedAddr(GlobalValue::UnnamedAddr::None);
+
+  // Default JIT optimization level.
+  auto optLevel = APInt::getZero(32);
+
+  if (DictionaryAttr objectProps = object.getProperties()) {
+    if (auto section = dyn_cast_or_null<StringAttr>(
+            objectProps.get(gpu::elfSectionName))) {
+      serializedObj->setSection(section.getValue());
+    }
+    // Check if there's an optimization level embedded in the object.
+    if (auto optAttr = dyn_cast_or_null<IntegerAttr>(objectProps.get("O")))
+      optLevel = optAttr.getValue();
+  }
+
+  IRBuilder<> builder(module.getContext());
+  auto i32Ty = builder.getInt32Ty();
+  auto i64Ty = builder.getInt64Ty();
+  auto ptrTy = builder.getPtrTy(0);
+  auto voidTy = builder.getVoidTy();
+
+  // Embed the module as a global object.
+  auto *modulePtr = new GlobalVariable(
+      module, ptrTy, /*isConstant=*/false, GlobalValue::InternalLinkage,
+      /*Initializer=*/ConstantPointerNull::get(ptrTy),
+      getModuleIdentifier(moduleName));
+
+  auto *loadFn = Function::Create(FunctionType::get(voidTy, /*IsVarArg=*/false),
+                                  GlobalValue::InternalLinkage,
+                                  moduleName + "_load", module);
+  loadFn->setSection(".text.startup");
+  auto *loadBlock = BasicBlock::Create(module.getContext(), "entry", loadFn);
+  builder.SetInsertPoint(loadBlock);
+  Value *moduleObj = [&] {
+    if (object.getFormat() == gpu::CompilationTarget::Assembly) {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoadJIT", FunctionType::get(ptrTy, {ptrTy, i32Ty}, false));
+      Constant *optValue = ConstantInt::get(i32Ty, optLevel);
+      return builder.CreateCall(moduleLoadFn, {serializedObj, optValue});
+    } else {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoad", FunctionType::get(ptrTy, {ptrTy, i64Ty}, false));
+      Constant *binarySize =
+          ConstantInt::get(i64Ty, serializedStr.size() + (addNull ? 1 : 0));
+      return builder.CreateCall(moduleLoadFn, {serializedObj, binarySize});
+    }
+  }();
+  builder.CreateStore(moduleObj, modulePtr);
+  builder.CreateRetVoid();
+  appendToGlobalCtors(module, loadFn, /*Priority=*/123);
+
+  auto *unloadFn = Function::Create(
+      FunctionType::get(voidTy, /*IsVarArg=*/false),
+      GlobalValue::InternalLinkage, moduleName + "_unload", module);
+  unloadFn->setSection(".text.startup");
+  auto *unloadBlock =
+      BasicBlock::Create(module.getContext(), "entry", unloadFn);
+  builder.SetInsertPoint(unloadBlock);
+  FunctionCallee moduleUnloadFn = module.getOrInsertFunction(
+      "mgpuModuleUnload", FunctionType::get(voidTy, ptrTy, false));
+  builder.CreateCall(moduleUnloadFn, builder.CreateLoad(ptrTy, modulePtr));
+  builder.CreateRetVoid();
+  appendToGlobalDtors(module, unloadFn, /*Priority=*/123);
+
+  return success();
+}
+} // namespace llvm
+
 LogicalResult SelectObjectAttrImpl::embedBinary(
     Attribute attribute, Operation *operation, llvm::IRBuilderBase &builder,
     LLVM::ModuleTranslation &moduleTranslation) const {
@@ -113,29 +193,8 @@ LogicalResult SelectObjectAttrImpl::embedBinary(
   if (!object)
     return failure();
 
-  llvm::Module *module = moduleTranslation.getLLVMModule();
-
-  // Embed the object as a global string.
-  // Add null for assembly output for JIT paths that expect null-terminated
-  // strings.
-  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
-  llvm::Constant *binary = llvm::ConstantDataArray::getString(
-      builder.getContext(), object.getObject().getValue(), addNull);
-  llvm::GlobalVariable *serializedObj =
-      new llvm::GlobalVariable(*module, binary->getType(), true,
-                               llvm::GlobalValue::LinkageTypes::InternalLinkage,
-                               binary, getBinaryIdentifier(op.getName()));
-
-  if (object.getProperties()) {
-    if (auto section = mlir::dyn_cast_or_null<mlir::StringAttr>(
-            object.getProperties().get(gpu::elfSectionName))) {
-      serializedObj->setSection(section.getValue());
-    }
-  }
-  serializedObj->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);
-  serializedObj->setAlignment(llvm::MaybeAlign(8));
-  serializedObj->setUnnamedAddr(llvm::GlobalValue::UnnamedAddr::None);
-  return success();
+  return embedBinaryImpl(op.getName(), object,
+                         *moduleTranslation.getLLVMModule());
 }
 
 namespace llvm {
@@ -153,15 +212,6 @@ class LaunchKernel {
   // Get the module function callee.
   FunctionCallee getModuleFunctionFn();
 
-  // Get the module load callee.
-  FunctionCallee getModuleLoadFn();
-
-  // Get the module load JIT callee.
-  FunctionCallee getModuleLoadJITFn();
-
-  // Get the module unload callee.
-  FunctionCallee getModuleUnloadFn();
-
   // Get the stream create callee.
   FunctionCallee getStreamCreateFn();
 
@@ -261,24 +311,6 @@ llvm::FunctionCallee llvm::LaunchKernel::getModuleFunctionFn() {
       FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, ptrTy}), false));
 }
 
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoad",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i64Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadJITFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoadJIT",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i32Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleUnloadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleUnload",
-      FunctionType::get(voidTy, ArrayRef<Type *>({ptrTy}), false));
-}
-
 llvm::FunctionCallee llvm::LaunchKernel::getStreamCreateFn() {
   return module.getOrInsertFunction("mgpuStreamCreate",
                                     FunctionType::get(ptrTy, false));
@@ -301,9 +333,9 @@ llvm::FunctionCallee llvm::LaunchKernel::getStreamSyncFn() {
 llvm::Value *llvm::LaunchKernel::getOrCreateFunctionName(StringRef moduleName,
                                                          StringRef kernelName) {
   std::string globalName =
-      std::string(formatv("{0}_{1}_kernel_name", moduleName, kernelName));
+      std::string(formatv("{0}_{1}_name", moduleName, kernelName));
 
-  if (GlobalVariable *gv = module.getGlobalVariable(globalName))
+  if (GlobalVariable *gv = module.getGlobalVariable(globalName, true))
     return gv;
 
   return builder.CreateGlobalString(kernelName, globalName);
@@ -346,16 +378,13 @@ llvm::LaunchKernel::createKernelArgArray(mlir::gpu::LaunchFuncOp op) {
 }
 
 // Emits LLVM IR to launch a kernel function:
-// %0 = call %binarygetter
-// %1 = call %moduleLoad(%0)
-// %2 = <see generateKernelNameConstant>
-// %3 = call %moduleGetFunction(%1, %2)
-// %4 = call %streamCreate()
-// %5 = <see generateParamsArray>
-// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)
-// call %streamSynchronize(%4)
-// call %streamDestroy(%4)
-// call %moduleUnload(%1)
+// %1 = load %global_module_object
+// %2 = call @mgpuModuleGetFunction(%1, %global_kernel_name)
+// %3 = call @mgpuStreamCreate()
+// %4 = <see createKernelArgArray()>
+// call @mgpuLaunchKernel(%2, ..., %3, %4, ...)
+// call @mgpuStreamSynchronize(%3)
+// call @mgpuStreamDestroy(%3)
 llvm::LogicalResult
 llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                        mlir::gpu::ObjectAttr object) {
@@ -385,58 +414,29 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
   // Create the argument array.
   Value *argArray = createKernelArgArray(op);
 
-  // Default JIT optimization level.
-  llvm::Constant *optV = llvm::ConstantInt::get(i32Ty, 0);
-  // Check if there's an optimization level embedded in the object.
-  DictionaryAttr objectProps = object.getProperties();
-  mlir::Attribute optAttr;
-  if (objectProps && (optAttr = objectProps.get("O"))) {
-    auto optLevel = dyn_cast<IntegerAttr>(optAttr);
-    if (!optLevel)
-      return op.emitError("the optimization level must be an integer");
-    optV = llvm::ConstantInt::get(i32Ty, optLevel.getValue());
-  }
-
-  // Load the kernel module.
-  StringRef moduleName = op.getKernelModuleName().getValue();
-  std::string binaryIdentifier = getBinaryIdentifier(moduleName);
-  Value *binary = module.getGlobalVariable(binaryIdentifier, true);
-  if (!binary)
-    return op.emitError() << "Couldn't find the binary: " << binaryIdentifier;
-
-  auto binaryVar = dyn_cast<llvm::GlobalVariable>(binary);
-  if (!binaryVar)
-    return op.emitError() << "Binary is not a global variable: "
-                          << binaryIdentifier;
-  llvm::Constant *binaryInit = binaryVar->getInitializer();
-  auto binaryDataSeq =
-      dyn_cast_if_present<llvm::ConstantDataSequential>(binaryInit);
-  if (!binaryDataSeq)
-    return op.emitError() << "Couldn't find binary data array: "
-                          << binaryIdentifier;
-  llvm::Constant *binarySize =
-      llvm::ConstantInt::get(i64Ty, binaryDataSeq->getNumElements() *
-                                        binaryDataSeq->getElementByteSize());
-
-  Value *moduleObject =
-      object.getFormat() == gpu::CompilationTarget::Assembly
-          ? builder.CreateCall(getModuleLoadJITFn(), {binary, optV})
-          : builder.CreateCall(getModuleLoadFn(), {binary, binarySize});
-
   // Load the kernel function.
-  Value *moduleFunction = builder.CreateCall(
-      getModuleFunctionFn(),
-      {moduleObject,
-       getOrCreateFunctionName(moduleName, op.getKernelName().getValue())});
+  StringRef moduleName = op.getKernelModuleName().getValue();
+  Twine moduleIdentifier = getModuleIdentifier(moduleName);
+  Value *modulePtr = module.getGlobalVariable(moduleIdentifier.str(), true);
+  if (!modulePtr)
+    return op.emitError() << "Couldn't find the binary: " << moduleIdentifier;
+  Value *moduleObj = builder.CreateLoad(ptrTy, modulePtr);
+  Value *functionName = getOrCreateFunctionName(moduleName, op.getKernelName());
+  Value *moduleFunction =
+      builder.CreateCall(getModuleFunctionFn(), {moduleObj, functionName});
 
   // Get the stream to use for execution. If there's no async object then create
   // a stream to make a synchronous kernel launch.
   Value *stream = nullptr;
-  bool handleStream = false;
+  // Sync & destroy the stream, for synchronous launches.
+  auto destroyStream = make_scope_exit([&]() {
+    builder.CreateCall(getStreamSyncFn(), {stream});
+    builder.CreateCall(getStreamDestroyFn(), {stream});
+  });
   if (mlir::Value asyncObject = op.getAsyncObject()) {
     stream = llvmValue(asyncObject);
+    destroyStream.release();
   } else {
-    handleStream = true;
     stream = builder.CreateCall(getStreamCreateFn(), {});
   }
 
@@ -462,14 +462,12 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                           argArray, nullPtr, paramsCount}));
   }
 
-  // Sync & destroy the stream, for synchronous launches.
-  if (handleStream) {
-    builder.CreateCall(getStreamSyncFn(), {stream});
-    builder.CreateCall(getStreamDestroyFn(), {stream});
-  }
-
-  // Unload the kernel module.
-  builder.CreateCall(getModuleUnloadFn(), {moduleObject});
-
   return success();
 }
+
+void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
+    DialectRegistry &registry) {
+  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
+    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
+  });
+}
diff --git a/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
new file mode 100644
index 0000000000000..80cc6d6bf91dd
--- /dev/null
+++ b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
@@ -0,0 +1,48 @@
+// Tests that we can run multiple kernels concurrently. Runs two kernels, which
+// increment a global atomic counter, then wait for the counter to reach 2.
+//
+// RUN: mlir-opt %s \
+// RUN: | mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-format=%gpu_compilation_format" \
+// RUN: | mlir-runner \
+// RUN:   --shared-libs=%mlir_cuda_runtime \
+// RUN:   --shared-libs=%mlir_runner_utils \
+// RUN:   --entry-point-result=void
+
+module attributes {gpu.container_module} {
+    gpu.module @kernels {
+        gpu.func @kernel(%memref: memref<i32>) kernel {
+            %c0 = arith.constant 0 : i32
+            %c1 = arith.constant 1 : i32
+            %c2 = arith.constant 2 : i32
+            %block = memref.atomic_rmw addi %c1, %memref[] : (i32, memref<i32>) -> i32
+            scf.while: () -> () {
+                %value = memref.atomic_rmw addi %c0, %memref[] : (i32, memref<i32>) -> i32
+                %cond = arith.cmpi slt, %value, %c2 : i32
+                scf.condition(%cond)
+            } do {
+                scf.yield
+            }
+            gpu.return
+        }
+    }
+
+    func.func @main() {
+        %memref = gpu.alloc host_shared () : memref<i32>
+        %c0 = arith.constant 0 : i32
+        memref.store %c0, %memref[] : memref<i32>
+
+        %0 = gpu.wait async
+        %1 = gpu.wait async
+        %c1 = arith.constant 1 : index
+        %2 = gpu.launch_func async [%0] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        %3 = gpu.launch_func async [%1] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        gpu.wait [%2, %3]
+        return
+    }
+}
diff --git a/mlir/test/Target/LLVMIR/gpu.mlir b/mlir/test/Target/LLVMIR/gpu.mlir
index 6b7e7fcc71960..0d29a95b12266 100644
--- a/mlir/test/Target/LLVMIR/gpu.mlir
+++ b/mlir/test/Target/LLVMIR/gpu.mlir
@@ -3,8 +3,11 @@
 // Checking the translation of the `gpu.binary` & `gpu.launch_fun` ops.
 module attributes {gpu.container_module} {
   // CHECK: [[ARGS_TY:%.*]] = type { i32, i32 }
-  // CHECK: @kernel_module_bin_cst = internal constant [4 x i8] c"BLOB", align 8
-  // CHECK: @kernel_module_kernel_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
+  // CHECK-DAG: @kernel_module_binary = internal constant [4 x i8] c"BLOB", align 8
+  // CHECK-DAG: kernel_module_module = internal global ptr null
+  // CHECK-DAG: @llvm.global_ctors = appending global {{.*}} @kernel_module_load
+  // CHECK-DAG: @llvm.global_dtors = appending global {{.*}} @kernel_module_unload
+  // CHECK-DAG: @kernel_module_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
   gpu.binary @kernel_module  [#gpu.object<#nvvm.target, "BLOB">]
   llvm.func @foo() {
     // CHECK: [[ARGS:%.*]] = alloca %{{.*}}, align 8
@@ -17,26 +20,32 @@ module attributes {gpu.container_module} {
     // CHECK: store i32 32, ptr [[ARG1]], align 4
     // CHECK: %{{.*}} = getelementptr ptr, ptr [[ARGS_ARRAY]], i32 1
     // CHECK: store ptr [[ARG1]], ptr %{{.*}}, align 8
-    // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad(ptr @kernel_module_bin_cst, i64 4)
-    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_kernel_name)
+    // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_name)
     // CHECK: [[STREAM:%.*]] = call ptr @mgpuStreamCreate()
     // CHECK: call void @mgpuLaunchKernel(ptr [[FUNC]], i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i32 256, ptr [[STREAM]], ptr [[ARGS_ARRAY]], ptr null, i64 2)
     // CHECK: call void @mgpuStreamSynchronize(ptr [[STREAM]])
     // CHECK: call void @mgpuStreamDestroy(ptr [[STREAM]])
-    // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
     %0 = llvm.mlir.constant(8 : index) : i64
     %1 = llvm.mlir.constant(32 : i32) : i32
     %2 = llvm.mlir.constant(256 : i32) : i32
     gpu.launch_func @kernel_module::@kernel blocks in (%0, %0, %0) threads in (%0, %0, %0) : i64 dynamic_shared_memory_size %2 args(%1 : i32, %1 : i32)
     llvm.return
   }
+  // CHECK: @kernel_module_load() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad
+  // CHECK: store ptr [[MODULE]], ptr @kernel_module_module
+  //
+  // CHECK: @kernel_module_unload() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+  // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
 }
 
 // -----
 
 // Checking the correct selection of the second object using an index as a selector.
 module {
-  // CHECK: @kernel_module_bin_cst = internal constant [1 x i8] c"1", align 8
+  // CHECK: @kernel_module_binary = internal constant [1 x i8] c"1", align 8
   gpu.binary @kernel_module <#gpu.select_object<1>> [#gpu.object<#nvvm.target, "0">, #gpu.object<#nvvm.target, "1">]
 }
 
@@ -44,7 +53,7 @@ module {
 
 // Checking the correct selection of the second object using a target as a selector.
 module {
-  // CHECK: @ker...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Apr 12, 2025

@llvm/pr-subscribers-mlir-llvm

Author: Christian Sigg (chsigg)

Changes

Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of embedBinary() and launchKernel() use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for launchKernel() to align them again.


Patch is 25.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135478.diff

4 Files Affected:

  • (modified) mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp (+128-130)
  • (added) mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir (+48)
  • (modified) mlir/test/Target/LLVMIR/gpu.mlir (+35-36)
  • (modified) utils/bazel/llvm-project-overlay/mlir/BUILD.bazel (+1)
diff --git a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
index 8d4a0bcf8adbf..d3216d9ad17eb 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
@@ -18,11 +18,13 @@
 #include "mlir/Target/LLVMIR/Export.h"
 #include "mlir/Target/LLVMIR/ModuleTranslation.h"
 
+#include "llvm/ADT/ScopeExit.h"
 #include "llvm/IR/Constants.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Module.h"
 #include "llvm/Support/FormatVariadic.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
 
 using namespace mlir;
 
@@ -31,9 +33,13 @@ namespace {
 class SelectObjectAttrImpl
     : public gpu::OffloadingLLVMTranslationAttrInterface::FallbackModel<
           SelectObjectAttrImpl> {
+  // Returns the selected object for embedding.
+  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
+
 public:
   // Translates a `gpu.binary`, embedding the binary into a host LLVM module as
-  // global binary string.
+  // global binary string which gets loaded/unloaded into a global module
+  // object through a global ctor/dtor.
   LogicalResult embedBinary(Attribute attribute, Operation *operation,
                             llvm::IRBuilderBase &builder,
                             LLVM::ModuleTranslation &moduleTranslation) const;
@@ -45,23 +51,9 @@ class SelectObjectAttrImpl
                              Operation *binaryOperation,
                              llvm::IRBuilderBase &builder,
                              LLVM::ModuleTranslation &moduleTranslation) const;
-
-  // Returns the selected object for embedding.
-  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
 };
-// Returns an identifier for the global string holding the binary.
-std::string getBinaryIdentifier(StringRef binaryName) {
-  return binaryName.str() + "_bin_cst";
-}
 } // namespace
 
-void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
-    DialectRegistry &registry) {
-  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
-    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
-  });
-}
-
 gpu::ObjectAttr
 SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   ArrayRef<Attribute> objects = op.getObjectsAttr().getValue();
@@ -96,6 +88,94 @@ SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   return mlir::dyn_cast<gpu::ObjectAttr>(objects[index]);
 }
 
+static Twine getModuleIdentifier(StringRef moduleName) {
+  return moduleName + "_module";
+}
+
+namespace llvm {
+static LogicalResult embedBinaryImpl(StringRef moduleName,
+                                     gpu::ObjectAttr object, Module &module) {
+
+  // Embed the object as a global string.
+  // Add null for assembly output for JIT paths that expect null-terminated
+  // strings.
+  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
+  StringRef serializedStr = object.getObject().getValue();
+  Constant *serializedCst =
+      ConstantDataArray::getString(module.getContext(), serializedStr, addNull);
+  GlobalVariable *serializedObj =
+      new GlobalVariable(module, serializedCst->getType(), true,
+                         GlobalValue::LinkageTypes::InternalLinkage,
+                         serializedCst, moduleName + "_binary");
+  serializedObj->setAlignment(MaybeAlign(8));
+  serializedObj->setUnnamedAddr(GlobalValue::UnnamedAddr::None);
+
+  // Default JIT optimization level.
+  auto optLevel = APInt::getZero(32);
+
+  if (DictionaryAttr objectProps = object.getProperties()) {
+    if (auto section = dyn_cast_or_null<StringAttr>(
+            objectProps.get(gpu::elfSectionName))) {
+      serializedObj->setSection(section.getValue());
+    }
+    // Check if there's an optimization level embedded in the object.
+    if (auto optAttr = dyn_cast_or_null<IntegerAttr>(objectProps.get("O")))
+      optLevel = optAttr.getValue();
+  }
+
+  IRBuilder<> builder(module.getContext());
+  auto i32Ty = builder.getInt32Ty();
+  auto i64Ty = builder.getInt64Ty();
+  auto ptrTy = builder.getPtrTy(0);
+  auto voidTy = builder.getVoidTy();
+
+  // Embed the module as a global object.
+  auto *modulePtr = new GlobalVariable(
+      module, ptrTy, /*isConstant=*/false, GlobalValue::InternalLinkage,
+      /*Initializer=*/ConstantPointerNull::get(ptrTy),
+      getModuleIdentifier(moduleName));
+
+  auto *loadFn = Function::Create(FunctionType::get(voidTy, /*IsVarArg=*/false),
+                                  GlobalValue::InternalLinkage,
+                                  moduleName + "_load", module);
+  loadFn->setSection(".text.startup");
+  auto *loadBlock = BasicBlock::Create(module.getContext(), "entry", loadFn);
+  builder.SetInsertPoint(loadBlock);
+  Value *moduleObj = [&] {
+    if (object.getFormat() == gpu::CompilationTarget::Assembly) {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoadJIT", FunctionType::get(ptrTy, {ptrTy, i32Ty}, false));
+      Constant *optValue = ConstantInt::get(i32Ty, optLevel);
+      return builder.CreateCall(moduleLoadFn, {serializedObj, optValue});
+    } else {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoad", FunctionType::get(ptrTy, {ptrTy, i64Ty}, false));
+      Constant *binarySize =
+          ConstantInt::get(i64Ty, serializedStr.size() + (addNull ? 1 : 0));
+      return builder.CreateCall(moduleLoadFn, {serializedObj, binarySize});
+    }
+  }();
+  builder.CreateStore(moduleObj, modulePtr);
+  builder.CreateRetVoid();
+  appendToGlobalCtors(module, loadFn, /*Priority=*/123);
+
+  auto *unloadFn = Function::Create(
+      FunctionType::get(voidTy, /*IsVarArg=*/false),
+      GlobalValue::InternalLinkage, moduleName + "_unload", module);
+  unloadFn->setSection(".text.startup");
+  auto *unloadBlock =
+      BasicBlock::Create(module.getContext(), "entry", unloadFn);
+  builder.SetInsertPoint(unloadBlock);
+  FunctionCallee moduleUnloadFn = module.getOrInsertFunction(
+      "mgpuModuleUnload", FunctionType::get(voidTy, ptrTy, false));
+  builder.CreateCall(moduleUnloadFn, builder.CreateLoad(ptrTy, modulePtr));
+  builder.CreateRetVoid();
+  appendToGlobalDtors(module, unloadFn, /*Priority=*/123);
+
+  return success();
+}
+} // namespace llvm
+
 LogicalResult SelectObjectAttrImpl::embedBinary(
     Attribute attribute, Operation *operation, llvm::IRBuilderBase &builder,
     LLVM::ModuleTranslation &moduleTranslation) const {
@@ -113,29 +193,8 @@ LogicalResult SelectObjectAttrImpl::embedBinary(
   if (!object)
     return failure();
 
-  llvm::Module *module = moduleTranslation.getLLVMModule();
-
-  // Embed the object as a global string.
-  // Add null for assembly output for JIT paths that expect null-terminated
-  // strings.
-  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
-  llvm::Constant *binary = llvm::ConstantDataArray::getString(
-      builder.getContext(), object.getObject().getValue(), addNull);
-  llvm::GlobalVariable *serializedObj =
-      new llvm::GlobalVariable(*module, binary->getType(), true,
-                               llvm::GlobalValue::LinkageTypes::InternalLinkage,
-                               binary, getBinaryIdentifier(op.getName()));
-
-  if (object.getProperties()) {
-    if (auto section = mlir::dyn_cast_or_null<mlir::StringAttr>(
-            object.getProperties().get(gpu::elfSectionName))) {
-      serializedObj->setSection(section.getValue());
-    }
-  }
-  serializedObj->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);
-  serializedObj->setAlignment(llvm::MaybeAlign(8));
-  serializedObj->setUnnamedAddr(llvm::GlobalValue::UnnamedAddr::None);
-  return success();
+  return embedBinaryImpl(op.getName(), object,
+                         *moduleTranslation.getLLVMModule());
 }
 
 namespace llvm {
@@ -153,15 +212,6 @@ class LaunchKernel {
   // Get the module function callee.
   FunctionCallee getModuleFunctionFn();
 
-  // Get the module load callee.
-  FunctionCallee getModuleLoadFn();
-
-  // Get the module load JIT callee.
-  FunctionCallee getModuleLoadJITFn();
-
-  // Get the module unload callee.
-  FunctionCallee getModuleUnloadFn();
-
   // Get the stream create callee.
   FunctionCallee getStreamCreateFn();
 
@@ -261,24 +311,6 @@ llvm::FunctionCallee llvm::LaunchKernel::getModuleFunctionFn() {
       FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, ptrTy}), false));
 }
 
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoad",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i64Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadJITFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoadJIT",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i32Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleUnloadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleUnload",
-      FunctionType::get(voidTy, ArrayRef<Type *>({ptrTy}), false));
-}
-
 llvm::FunctionCallee llvm::LaunchKernel::getStreamCreateFn() {
   return module.getOrInsertFunction("mgpuStreamCreate",
                                     FunctionType::get(ptrTy, false));
@@ -301,9 +333,9 @@ llvm::FunctionCallee llvm::LaunchKernel::getStreamSyncFn() {
 llvm::Value *llvm::LaunchKernel::getOrCreateFunctionName(StringRef moduleName,
                                                          StringRef kernelName) {
   std::string globalName =
-      std::string(formatv("{0}_{1}_kernel_name", moduleName, kernelName));
+      std::string(formatv("{0}_{1}_name", moduleName, kernelName));
 
-  if (GlobalVariable *gv = module.getGlobalVariable(globalName))
+  if (GlobalVariable *gv = module.getGlobalVariable(globalName, true))
     return gv;
 
   return builder.CreateGlobalString(kernelName, globalName);
@@ -346,16 +378,13 @@ llvm::LaunchKernel::createKernelArgArray(mlir::gpu::LaunchFuncOp op) {
 }
 
 // Emits LLVM IR to launch a kernel function:
-// %0 = call %binarygetter
-// %1 = call %moduleLoad(%0)
-// %2 = <see generateKernelNameConstant>
-// %3 = call %moduleGetFunction(%1, %2)
-// %4 = call %streamCreate()
-// %5 = <see generateParamsArray>
-// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)
-// call %streamSynchronize(%4)
-// call %streamDestroy(%4)
-// call %moduleUnload(%1)
+// %1 = load %global_module_object
+// %2 = call @mgpuModuleGetFunction(%1, %global_kernel_name)
+// %3 = call @mgpuStreamCreate()
+// %4 = <see createKernelArgArray()>
+// call @mgpuLaunchKernel(%2, ..., %3, %4, ...)
+// call @mgpuStreamSynchronize(%3)
+// call @mgpuStreamDestroy(%3)
 llvm::LogicalResult
 llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                        mlir::gpu::ObjectAttr object) {
@@ -385,58 +414,29 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
   // Create the argument array.
   Value *argArray = createKernelArgArray(op);
 
-  // Default JIT optimization level.
-  llvm::Constant *optV = llvm::ConstantInt::get(i32Ty, 0);
-  // Check if there's an optimization level embedded in the object.
-  DictionaryAttr objectProps = object.getProperties();
-  mlir::Attribute optAttr;
-  if (objectProps && (optAttr = objectProps.get("O"))) {
-    auto optLevel = dyn_cast<IntegerAttr>(optAttr);
-    if (!optLevel)
-      return op.emitError("the optimization level must be an integer");
-    optV = llvm::ConstantInt::get(i32Ty, optLevel.getValue());
-  }
-
-  // Load the kernel module.
-  StringRef moduleName = op.getKernelModuleName().getValue();
-  std::string binaryIdentifier = getBinaryIdentifier(moduleName);
-  Value *binary = module.getGlobalVariable(binaryIdentifier, true);
-  if (!binary)
-    return op.emitError() << "Couldn't find the binary: " << binaryIdentifier;
-
-  auto binaryVar = dyn_cast<llvm::GlobalVariable>(binary);
-  if (!binaryVar)
-    return op.emitError() << "Binary is not a global variable: "
-                          << binaryIdentifier;
-  llvm::Constant *binaryInit = binaryVar->getInitializer();
-  auto binaryDataSeq =
-      dyn_cast_if_present<llvm::ConstantDataSequential>(binaryInit);
-  if (!binaryDataSeq)
-    return op.emitError() << "Couldn't find binary data array: "
-                          << binaryIdentifier;
-  llvm::Constant *binarySize =
-      llvm::ConstantInt::get(i64Ty, binaryDataSeq->getNumElements() *
-                                        binaryDataSeq->getElementByteSize());
-
-  Value *moduleObject =
-      object.getFormat() == gpu::CompilationTarget::Assembly
-          ? builder.CreateCall(getModuleLoadJITFn(), {binary, optV})
-          : builder.CreateCall(getModuleLoadFn(), {binary, binarySize});
-
   // Load the kernel function.
-  Value *moduleFunction = builder.CreateCall(
-      getModuleFunctionFn(),
-      {moduleObject,
-       getOrCreateFunctionName(moduleName, op.getKernelName().getValue())});
+  StringRef moduleName = op.getKernelModuleName().getValue();
+  Twine moduleIdentifier = getModuleIdentifier(moduleName);
+  Value *modulePtr = module.getGlobalVariable(moduleIdentifier.str(), true);
+  if (!modulePtr)
+    return op.emitError() << "Couldn't find the binary: " << moduleIdentifier;
+  Value *moduleObj = builder.CreateLoad(ptrTy, modulePtr);
+  Value *functionName = getOrCreateFunctionName(moduleName, op.getKernelName());
+  Value *moduleFunction =
+      builder.CreateCall(getModuleFunctionFn(), {moduleObj, functionName});
 
   // Get the stream to use for execution. If there's no async object then create
   // a stream to make a synchronous kernel launch.
   Value *stream = nullptr;
-  bool handleStream = false;
+  // Sync & destroy the stream, for synchronous launches.
+  auto destroyStream = make_scope_exit([&]() {
+    builder.CreateCall(getStreamSyncFn(), {stream});
+    builder.CreateCall(getStreamDestroyFn(), {stream});
+  });
   if (mlir::Value asyncObject = op.getAsyncObject()) {
     stream = llvmValue(asyncObject);
+    destroyStream.release();
   } else {
-    handleStream = true;
     stream = builder.CreateCall(getStreamCreateFn(), {});
   }
 
@@ -462,14 +462,12 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                           argArray, nullPtr, paramsCount}));
   }
 
-  // Sync & destroy the stream, for synchronous launches.
-  if (handleStream) {
-    builder.CreateCall(getStreamSyncFn(), {stream});
-    builder.CreateCall(getStreamDestroyFn(), {stream});
-  }
-
-  // Unload the kernel module.
-  builder.CreateCall(getModuleUnloadFn(), {moduleObject});
-
   return success();
 }
+
+void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
+    DialectRegistry &registry) {
+  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
+    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
+  });
+}
diff --git a/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
new file mode 100644
index 0000000000000..80cc6d6bf91dd
--- /dev/null
+++ b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
@@ -0,0 +1,48 @@
+// Tests that we can run multiple kernels concurrently. Runs two kernels, which
+// increment a global atomic counter, then wait for the counter to reach 2.
+//
+// RUN: mlir-opt %s \
+// RUN: | mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-format=%gpu_compilation_format" \
+// RUN: | mlir-runner \
+// RUN:   --shared-libs=%mlir_cuda_runtime \
+// RUN:   --shared-libs=%mlir_runner_utils \
+// RUN:   --entry-point-result=void
+
+module attributes {gpu.container_module} {
+    gpu.module @kernels {
+        gpu.func @kernel(%memref: memref<i32>) kernel {
+            %c0 = arith.constant 0 : i32
+            %c1 = arith.constant 1 : i32
+            %c2 = arith.constant 2 : i32
+            %block = memref.atomic_rmw addi %c1, %memref[] : (i32, memref<i32>) -> i32
+            scf.while: () -> () {
+                %value = memref.atomic_rmw addi %c0, %memref[] : (i32, memref<i32>) -> i32
+                %cond = arith.cmpi slt, %value, %c2 : i32
+                scf.condition(%cond)
+            } do {
+                scf.yield
+            }
+            gpu.return
+        }
+    }
+
+    func.func @main() {
+        %memref = gpu.alloc host_shared () : memref<i32>
+        %c0 = arith.constant 0 : i32
+        memref.store %c0, %memref[] : memref<i32>
+
+        %0 = gpu.wait async
+        %1 = gpu.wait async
+        %c1 = arith.constant 1 : index
+        %2 = gpu.launch_func async [%0] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        %3 = gpu.launch_func async [%1] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        gpu.wait [%2, %3]
+        return
+    }
+}
diff --git a/mlir/test/Target/LLVMIR/gpu.mlir b/mlir/test/Target/LLVMIR/gpu.mlir
index 6b7e7fcc71960..0d29a95b12266 100644
--- a/mlir/test/Target/LLVMIR/gpu.mlir
+++ b/mlir/test/Target/LLVMIR/gpu.mlir
@@ -3,8 +3,11 @@
 // Checking the translation of the `gpu.binary` & `gpu.launch_fun` ops.
 module attributes {gpu.container_module} {
   // CHECK: [[ARGS_TY:%.*]] = type { i32, i32 }
-  // CHECK: @kernel_module_bin_cst = internal constant [4 x i8] c"BLOB", align 8
-  // CHECK: @kernel_module_kernel_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
+  // CHECK-DAG: @kernel_module_binary = internal constant [4 x i8] c"BLOB", align 8
+  // CHECK-DAG: kernel_module_module = internal global ptr null
+  // CHECK-DAG: @llvm.global_ctors = appending global {{.*}} @kernel_module_load
+  // CHECK-DAG: @llvm.global_dtors = appending global {{.*}} @kernel_module_unload
+  // CHECK-DAG: @kernel_module_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
   gpu.binary @kernel_module  [#gpu.object<#nvvm.target, "BLOB">]
   llvm.func @foo() {
     // CHECK: [[ARGS:%.*]] = alloca %{{.*}}, align 8
@@ -17,26 +20,32 @@ module attributes {gpu.container_module} {
     // CHECK: store i32 32, ptr [[ARG1]], align 4
     // CHECK: %{{.*}} = getelementptr ptr, ptr [[ARGS_ARRAY]], i32 1
     // CHECK: store ptr [[ARG1]], ptr %{{.*}}, align 8
-    // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad(ptr @kernel_module_bin_cst, i64 4)
-    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_kernel_name)
+    // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_name)
     // CHECK: [[STREAM:%.*]] = call ptr @mgpuStreamCreate()
     // CHECK: call void @mgpuLaunchKernel(ptr [[FUNC]], i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i32 256, ptr [[STREAM]], ptr [[ARGS_ARRAY]], ptr null, i64 2)
     // CHECK: call void @mgpuStreamSynchronize(ptr [[STREAM]])
     // CHECK: call void @mgpuStreamDestroy(ptr [[STREAM]])
-    // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
     %0 = llvm.mlir.constant(8 : index) : i64
     %1 = llvm.mlir.constant(32 : i32) : i32
     %2 = llvm.mlir.constant(256 : i32) : i32
     gpu.launch_func @kernel_module::@kernel blocks in (%0, %0, %0) threads in (%0, %0, %0) : i64 dynamic_shared_memory_size %2 args(%1 : i32, %1 : i32)
     llvm.return
   }
+  // CHECK: @kernel_module_load() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad
+  // CHECK: store ptr [[MODULE]], ptr @kernel_module_module
+  //
+  // CHECK: @kernel_module_unload() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+  // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
 }
 
 // -----
 
 // Checking the correct selection of the second object using an index as a selector.
 module {
-  // CHECK: @kernel_module_bin_cst = internal constant [1 x i8] c"1", align 8
+  // CHECK: @kernel_module_binary = internal constant [1 x i8] c"1", align 8
   gpu.binary @kernel_module <#gpu.select_object<1>> [#gpu.object<#nvvm.target, "0">, #gpu.object<#nvvm.target, "1">]
 }
 
@@ -44,7 +53,7 @@ module {
 
 // Checking the correct selection of the second object using a target as a selector.
 module {
-  // CHECK: @ker...
[truncated]

This allows concurrent execution of different kernels (different function or different module).

See https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect/72796/17.
@joker-eph
Copy link
Collaborator

Would it be possible to lazy load on first use?
Right now this would all happen eagerly on application startup, and consume GPU memory even for kernels that wouldn't be used (If I understand correctly the change)

…CUDA_MODULE_LOADING=EAGER in test.

Format test file.
@chsigg
Copy link
Contributor Author

chsigg commented Apr 13, 2025

Would it be possible to lazy load on first use?

Yes, I reverted the eager loading in the runtime. Instead, one can use CUDA_MODULE_LOADING=EAGER env variable to force eager loading.

@joker-eph
Copy link
Collaborator

git grep CUDA_MODULE_LOADING does not find anything in the repo, do you have a pointer to what you have in mind here?

@chsigg
Copy link
Contributor Author

chsigg commented Apr 14, 2025

do you have a pointer to what you have in mind here?

This environment variable controls lazy loading of kernels in the CUDA driver:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading

The documentation is not consistent whether to use CUDA_MODULE_LOADING or CUDA_MODULE_DATA_LOADING (see e.g. here). I checked that only the former works though: with two different kernels, the test fails without env variable setting or CUDA_MODULE_DATA_LOADING=EAGER and passes with CUDA_MODULE_LOADING=EAGER.

@chsigg
Copy link
Contributor Author

chsigg commented Apr 22, 2025

Friendly ping. I would appreciate a review. Thanks!

@joker-eph
Copy link
Collaborator

joker-eph commented Apr 22, 2025

LG (sorry I missed your answer last week)

@chsigg
Copy link
Contributor Author

chsigg commented Apr 22, 2025

No problem. Thanks a lot Mehdi.

@chsigg chsigg merged commit 7851b1b into llvm:main Apr 22, 2025
11 checks passed
@chsigg chsigg deleted the piper_export_cl_746544976 branch April 22, 2025 11:50
@clementval
Copy link
Contributor

Could the constructor be added/ not added based on a pass option? We are using the generated binary but we rely on a different runtime to load the module so in our case we would like to not generate it.

@joker-eph
Copy link
Collaborator

I wonder we could move back the logic into createKernelLaunch() where it was until now instead of during the translation?
We'd still inject the mgpuModuleLoad into a global constructor, but we'd do it only if we have gpu.launch present, that way a simple translation of a kernel binary does not do that and preserve other runtime/lowering implementations.

@clementval
Copy link
Contributor

I wonder we could move back the logic into createKernelLaunch() where it was until now instead of during the translation? We'd still inject the mgpuModuleLoad into a global constructor, but we'd do it only if we have gpu.launch present, that way a simple translation of a kernel binary does not do that and preserve other runtime/lowering implementations.

That would be nice to keep the translation of gpu.binary to the creation of the global. We can probably work around the current constructor generation in our case.

@chsigg
Copy link
Contributor Author

chsigg commented Apr 23, 2025

I'm sorry this change is causing trouble for you. I'm not sure now, can you work with the current state or do you prefer to move the module global+ctor/dtor creation back to createKernelLaunch?

@joker-eph
Copy link
Collaborator

I think we should move it to createKernelLaunch because this is a layering problem: the choice of the runtime should be part of the lowering of the GPU dialect, it is meant to be customizable for other runtime.
The move to the translation layer makes it non longer a customizable lowering thing, but something that requires non-trivial workarounds now (basically you need to inject an empty mgpuModuleLoad in the LLVM module to "cancel" this added code).

@Hardcode84
Copy link
Contributor

Some drive by comment: It would be nice to have gpu.launch_func version which takes func as ssa value + dedicated gpu.load_module/gpu.get_func, in addition to the existing gpu.launch_func which takes symbol ref. This way we can progressive lower thing completely within MLIR, while keeping runtime calls lowering straightforward and delegating kernel lazy loading/caching decisions to the user.

@clementval
Copy link
Contributor

I'm sorry this change is causing trouble for you. I'm not sure now, can you work with the current state or do you prefer to move the module global+ctor/dtor creation back to createKernelLaunch?

We can live with it and we already have a workaround but I think that having this with createKernelLaunch makes more sense.

@chsigg
Copy link
Contributor Author

chsigg commented Apr 23, 2025

I think we should move it to createKernelLaunch because this is a layering problem: the choice of the runtime should be part of the lowering of the GPU dialect, it is meant to be customizable for other runtime.

Well, either way it's part of the translation layer of a gpu op, no?

The way I understand it, the offloadingHandler of the gpu.binary op is the intended customization point how gpu.binary and gpu.launch_func are lowered to LLVM.

But yes, I can move this back to createKernelLaunch. It makes sense that lowering the gpu.binary itself should not automatically load that binary. But it's not ideal to be part of gpu.launch_func either, so...

I do like the idea of gpu.load_module / gpu.unload_module / gpu.get_func ops to make things more explicit. I'm wondering though if it is too CUDA specific. The runtime wrappers for Vulkan seem to not quite fit this interface.

I'm honestly a bit lost about how it would all play together and where we would lower to what so that things can be easily customized and are not tied to one specific API or runtime. If people have time and energy to come up with a plan though, I'm happy to help with coding.

@joker-eph
Copy link
Collaborator

Well, either way it's part of the translation layer of a gpu op, no?

Ah you're correct, I thought that gpu.launch_func was still going through a lowering to the LLVM dialect instead of a direct translation.

Let me ping @fabianmcg to this whole discussion to get some other perspective.

@fabianmcg
Copy link
Contributor

fabianmcg commented Apr 23, 2025

Ok, I'll provide an overview of the current mechanism, then some rationale and how I thought it could be used:

Currently, convert-to-llvm only legalizes the args of gpu.launch_func (ie. the args updated, but the ops remain). It's only during translation that gpu.binary and gpu.launch_func ops fully expand.

The translation process is handled by OffloadingLLVMTranslationAttrInterface which is an inherent attr of gpu.binary:

      "::llvm::LogicalResult", "embedBinary",
      (ins "::mlir::Operation*":$binaryOp,
           "::llvm::IRBuilderBase&":$hostBuilder,
           "::mlir::LLVM::ModuleTranslation&":$hostModuleTranslation)
      "::llvm::LogicalResult", "launchKernel",
      (ins "::mlir::Operation*":$launchFunc, "::mlir::Operation*":$binaryOp,
           "::llvm::IRBuilderBase&":$hostBuilder,
           "::mlir::LLVM::ModuleTranslation&":$hostModuleTranslation)

The rationale was, that users could customize the process by either adding new attributes implementing the interface, or by registering a different external interface model to an existing attribute. Also, it would avoid the pitfalls of modeling the process after a specific runtime like the CUDA runtime, CUDA driver or Vulkan...

Now, the idea of using translation instead of lowerings is that LLVM is getting project offload. Also, at the time there was a lot of infra in LLVM to handle certain offloading bits like registering kernels. So, it was decided to not reinvent the wheel and that it would be better to use those when they became available. An example of this infra, is the LLVM code used by clang for registering binaries with the CUDA runtime: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp#L305-L545

Now, an example of how I envisioned customization is shown in this PR #78117. It adds support to load/launch/register kernels and binaries using the CUDA and HIP runtimes. Avoiding the issues (synced kernel launches) trying to be solved by this PR.

@Hardcode84
Copy link
Contributor

gpu.load_module / gpu.unload_module / gpu.get_func should fit all 3 vendors compute APIs (CUDA, HIP, L0, OpenCL), not sure about Vulkan

@fabianmcg
Copy link
Contributor

gpu.load_module / gpu.unload_module / gpu.get_func should fit all 3 vendors compute APIs (CUDA, HIP, L0, OpenCL)

More than vendors, it's not a 1-1 mapping between runtimes. For example, the CUDA and HIP drivers will do:

  1. load
  2. get_func
  3. launch
  4. unload

But the CUDA and HIP runtimes will do:

  1. register at startup the binaries and map them to a host global (usually a func stub)
  2. launch

clementval added a commit that referenced this pull request Apr 23, 2025
@chsigg
Copy link
Contributor Author

chsigg commented Apr 24, 2025

Thanks Fabian for the explanations, I wasn't aware of the LLVM offloading project.

The existing gpu-to-llvm pass is already very specifically targeting the runtime wrappers, so I'm wondering if it would make sense to convert to (global) mgpuModuleLoad / mgpuLaunchKernel etc calls there. Currently gpu-to-llvm runs just before gpu-module-to-binary in some places, but I think it would be fine to swap those.

After gpu-to-llvm, the gpu.binary op would be dead and translation to LLVM would not use the OffloadingLLVMTranslation functionality. Would you consider this a step backwards from where you are heading? Or is it fine to remove the SelectObjectAttrImpl and expect that there will be an OffloadingLLVMTranslationAttrInterface implementation that fits LLVM offloading, or other implementations for users which do not run gpu-to-llvm.

@fabianmcg
Copy link
Contributor

I'm not blocking or voicing opposition to any changes. I just have a few questions:

  • What would we gain from moving the code to gpu-to-llvm?
  • Is there something we can't do today that we would be able to do with that change?

Would you consider this a step backwards from where you are heading?

From my initial idea, yes. Because, the idea was to reuse existing LLVM infra and not duplicate code. Moreover, the ops have clear low level meaning, thus they have direct translation, so I don't see any benefit in handling them during conversion.

I'll also say, that once project offload reaches maturity I'm going to propose migrating the existing GPU runtime wrappers to Offload's, which would fix issues like this.

@clementval
Copy link
Contributor

clementval commented Apr 24, 2025

Moreover, the ops have clear low level meaning, thus they have direct translation, so I don't see any benefit in handling them during conversion.

I agree with this statement but for me the loading/unloading of the module should be delegated to another operation that has this clear meaning so it would be more modular.

Or as an intermediate step we could have an attribute on the gpu.binary operation that indicates whether the translation should manage the loading/unloading or if it is left to another part of the pipeline.

@chsigg
Copy link
Contributor Author

chsigg commented Apr 25, 2025

  • What would we gain from moving the code to gpu-to-llvm?

My motivation would be to keep the things that are tied to MLIR's runtime wrappers in one place. It feels slightly spread out at the moment with most ops being lowered to mgpu* calls in gpu-to-llvm, except for gpu.binary and gpu.launch_kernel handled during translation.

  • Is there something we can't do today that we would be able to do with that change?

Not that I'm aware of, or at least I don't have a need to change anything. I merely wanted to show that concurrent kernel execution is possible, which required loading modules ahead of time. But I don't want to leave things in a worse state than it was before. So, I'm happy to help if you would like to move things in a particular direction, but I don't have an agenda myself or feel like I have enough background to formulate one.

Migrating to LLVM offloading sounds like a clear win to me and hopefully any changes that we make now should make that migration easier. I have no idea how this is being used, but I see that there is GenericDeviceTy::loadBinary so maybe a separate load/unload module op would be a good fit.

@fabianmcg
Copy link
Contributor

the loading/unloading of the module should be delegated to another operation that has this clear meaning so it would be more modular.

I think adding something like gpu.get_binary op that returns a ptr to the embedded binary could provide that flexibility. That way, we also avoid the pitfalls of modeling against a particular runtime, and if an user needs more customization they can take the ptr and load it via func call.

Or as an intermediate step we could have an attribute on the gpu.binary operation that indicates whether the translation should manage the loading/unloading or if it is left to another part of the pipeline.

Technically, gpu.binary already allows that: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td#L1467

It's possible for one to add an attr #cuf.offloading_handler implementing the interface. Put it on the binary, and then translation would only do what the interface says, and not load a module for example.

This is what #78117 does. It adds a new offloading attribute, that during translation adds a global constructor registering the binaries at startup with the CUDA runtime and then launch is just call to a runtime function. It would be up to the user to use gpu.select_object or gpu.offload_embedding in gpu.binary.

I'm happy to help if you would like to move things in a particular direction

At the moment, I would argue is better to not move things to gpu-to-llvm because we would ultimately move then out when offload comes around.
However, IMO a good cleanup would be merging load, get_func, launch into mgpuLaunchKernel, and handle everything inside that function. Including lazy loading the modules and unloading at program shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bazel "Peripheral" support tier build system: utils/bazel mlir:execution-engine mlir:gpu mlir:llvm mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants