[mlir][gpu] Change GPU modules to globals #135478

chsigg · 2025-04-12T06:18:33Z

Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of embedBinary() and launchKernel() use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for launchKernel() to align them again.

…en launching a kernel. Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

llvmbot · 2025-04-12T06:19:10Z

@llvm/pr-subscribers-mlir-execution-engine
@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Christian Sigg (chsigg)

Changes

Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of embedBinary() and launchKernel() use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for launchKernel() to align them again.

Patch is 25.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135478.diff

4 Files Affected:

(modified) mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp (+128-130)
(added) mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir (+48)
(modified) mlir/test/Target/LLVMIR/gpu.mlir (+35-36)
(modified) utils/bazel/llvm-project-overlay/mlir/BUILD.bazel (+1)

diff --git a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
index 8d4a0bcf8adbf..d3216d9ad17eb 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
@@ -18,11 +18,13 @@
 #include "mlir/Target/LLVMIR/Export.h"
 #include "mlir/Target/LLVMIR/ModuleTranslation.h"
 
+#include "llvm/ADT/ScopeExit.h"
 #include "llvm/IR/Constants.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Module.h"
 #include "llvm/Support/FormatVariadic.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
 
 using namespace mlir;
 
@@ -31,9 +33,13 @@ namespace {
 class SelectObjectAttrImpl
     : public gpu::OffloadingLLVMTranslationAttrInterface::FallbackModel<
           SelectObjectAttrImpl> {
+  // Returns the selected object for embedding.
+  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
+
 public:
   // Translates a `gpu.binary`, embedding the binary into a host LLVM module as
-  // global binary string.
+  // global binary string which gets loaded/unloaded into a global module
+  // object through a global ctor/dtor.
   LogicalResult embedBinary(Attribute attribute, Operation *operation,
                             llvm::IRBuilderBase &builder,
                             LLVM::ModuleTranslation &moduleTranslation) const;
@@ -45,23 +51,9 @@ class SelectObjectAttrImpl
                              Operation *binaryOperation,
                              llvm::IRBuilderBase &builder,
                              LLVM::ModuleTranslation &moduleTranslation) const;
-
-  // Returns the selected object for embedding.
-  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
 };
-// Returns an identifier for the global string holding the binary.
-std::string getBinaryIdentifier(StringRef binaryName) {
-  return binaryName.str() + "_bin_cst";
-}
 } // namespace
 
-void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
-    DialectRegistry &registry) {
-  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
-    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
-  });
-}
-
 gpu::ObjectAttr
 SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   ArrayRef<Attribute> objects = op.getObjectsAttr().getValue();
@@ -96,6 +88,94 @@ SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   return mlir::dyn_cast<gpu::ObjectAttr>(objects[index]);
 }
 
+static Twine getModuleIdentifier(StringRef moduleName) {
+  return moduleName + "_module";
+}
+
+namespace llvm {
+static LogicalResult embedBinaryImpl(StringRef moduleName,
+                                     gpu::ObjectAttr object, Module &module) {
+
+  // Embed the object as a global string.
+  // Add null for assembly output for JIT paths that expect null-terminated
+  // strings.
+  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
+  StringRef serializedStr = object.getObject().getValue();
+  Constant *serializedCst =
+      ConstantDataArray::getString(module.getContext(), serializedStr, addNull);
+  GlobalVariable *serializedObj =
+      new GlobalVariable(module, serializedCst->getType(), true,
+                         GlobalValue::LinkageTypes::InternalLinkage,
+                         serializedCst, moduleName + "_binary");
+  serializedObj->setAlignment(MaybeAlign(8));
+  serializedObj->setUnnamedAddr(GlobalValue::UnnamedAddr::None);
+
+  // Default JIT optimization level.
+  auto optLevel = APInt::getZero(32);
+
+  if (DictionaryAttr objectProps = object.getProperties()) {
+    if (auto section = dyn_cast_or_null<StringAttr>(
+            objectProps.get(gpu::elfSectionName))) {
+      serializedObj->setSection(section.getValue());
+    }
+    // Check if there's an optimization level embedded in the object.
+    if (auto optAttr = dyn_cast_or_null<IntegerAttr>(objectProps.get("O")))
+      optLevel = optAttr.getValue();
+  }
+
+  IRBuilder<> builder(module.getContext());
+  auto i32Ty = builder.getInt32Ty();
+  auto i64Ty = builder.getInt64Ty();
+  auto ptrTy = builder.getPtrTy(0);
+  auto voidTy = builder.getVoidTy();
+
+  // Embed the module as a global object.
+  auto *modulePtr = new GlobalVariable(
+      module, ptrTy, /*isConstant=*/false, GlobalValue::InternalLinkage,
+      /*Initializer=*/ConstantPointerNull::get(ptrTy),
+      getModuleIdentifier(moduleName));
+
+  auto *loadFn = Function::Create(FunctionType::get(voidTy, /*IsVarArg=*/false),
+                                  GlobalValue::InternalLinkage,
+                                  moduleName + "_load", module);
+  loadFn->setSection(".text.startup");
+  auto *loadBlock = BasicBlock::Create(module.getContext(), "entry", loadFn);
+  builder.SetInsertPoint(loadBlock);
+  Value *moduleObj = [&] {
+    if (object.getFormat() == gpu::CompilationTarget::Assembly) {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoadJIT", FunctionType::get(ptrTy, {ptrTy, i32Ty}, false));
+      Constant *optValue = ConstantInt::get(i32Ty, optLevel);
+      return builder.CreateCall(moduleLoadFn, {serializedObj, optValue});
+    } else {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoad", FunctionType::get(ptrTy, {ptrTy, i64Ty}, false));
+      Constant *binarySize =
+          ConstantInt::get(i64Ty, serializedStr.size() + (addNull ? 1 : 0));
+      return builder.CreateCall(moduleLoadFn, {serializedObj, binarySize});
+    }
+  }();
+  builder.CreateStore(moduleObj, modulePtr);
+  builder.CreateRetVoid();
+  appendToGlobalCtors(module, loadFn, /*Priority=*/123);
+
+  auto *unloadFn = Function::Create(
+      FunctionType::get(voidTy, /*IsVarArg=*/false),
+      GlobalValue::InternalLinkage, moduleName + "_unload", module);
+  unloadFn->setSection(".text.startup");
+  auto *unloadBlock =
+      BasicBlock::Create(module.getContext(), "entry", unloadFn);
+  builder.SetInsertPoint(unloadBlock);
+  FunctionCallee moduleUnloadFn = module.getOrInsertFunction(
+      "mgpuModuleUnload", FunctionType::get(voidTy, ptrTy, false));
+  builder.CreateCall(moduleUnloadFn, builder.CreateLoad(ptrTy, modulePtr));
+  builder.CreateRetVoid();
+  appendToGlobalDtors(module, unloadFn, /*Priority=*/123);
+
+  return success();
+}
+} // namespace llvm
+
 LogicalResult SelectObjectAttrImpl::embedBinary(
     Attribute attribute, Operation *operation, llvm::IRBuilderBase &builder,
     LLVM::ModuleTranslation &moduleTranslation) const {
@@ -113,29 +193,8 @@ LogicalResult SelectObjectAttrImpl::embedBinary(
   if (!object)
     return failure();
 
-  llvm::Module *module = moduleTranslation.getLLVMModule();
-
-  // Embed the object as a global string.
-  // Add null for assembly output for JIT paths that expect null-terminated
-  // strings.
-  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
-  llvm::Constant *binary = llvm::ConstantDataArray::getString(
-      builder.getContext(), object.getObject().getValue(), addNull);
-  llvm::GlobalVariable *serializedObj =
-      new llvm::GlobalVariable(*module, binary->getType(), true,
-                               llvm::GlobalValue::LinkageTypes::InternalLinkage,
-                               binary, getBinaryIdentifier(op.getName()));
-
-  if (object.getProperties()) {
-    if (auto section = mlir::dyn_cast_or_null<mlir::StringAttr>(
-            object.getProperties().get(gpu::elfSectionName))) {
-      serializedObj->setSection(section.getValue());
-    }
-  }
-  serializedObj->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);
-  serializedObj->setAlignment(llvm::MaybeAlign(8));
-  serializedObj->setUnnamedAddr(llvm::GlobalValue::UnnamedAddr::None);
-  return success();
+  return embedBinaryImpl(op.getName(), object,
+                         *moduleTranslation.getLLVMModule());
 }
 
 namespace llvm {
@@ -153,15 +212,6 @@ class LaunchKernel {
   // Get the module function callee.
   FunctionCallee getModuleFunctionFn();
 
-  // Get the module load callee.
-  FunctionCallee getModuleLoadFn();
-
-  // Get the module load JIT callee.
-  FunctionCallee getModuleLoadJITFn();
-
-  // Get the module unload callee.
-  FunctionCallee getModuleUnloadFn();
-
   // Get the stream create callee.
   FunctionCallee getStreamCreateFn();
 
@@ -261,24 +311,6 @@ llvm::FunctionCallee llvm::LaunchKernel::getModuleFunctionFn() {
       FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, ptrTy}), false));
 }
 
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoad",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i64Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadJITFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoadJIT",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i32Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleUnloadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleUnload",
-      FunctionType::get(voidTy, ArrayRef<Type *>({ptrTy}), false));
-}
-
 llvm::FunctionCallee llvm::LaunchKernel::getStreamCreateFn() {
   return module.getOrInsertFunction("mgpuStreamCreate",
                                     FunctionType::get(ptrTy, false));
@@ -301,9 +333,9 @@ llvm::FunctionCallee llvm::LaunchKernel::getStreamSyncFn() {
 llvm::Value *llvm::LaunchKernel::getOrCreateFunctionName(StringRef moduleName,
                                                          StringRef kernelName) {
   std::string globalName =
-      std::string(formatv("{0}_{1}_kernel_name", moduleName, kernelName));
+      std::string(formatv("{0}_{1}_name", moduleName, kernelName));
 
-  if (GlobalVariable *gv = module.getGlobalVariable(globalName))
+  if (GlobalVariable *gv = module.getGlobalVariable(globalName, true))
     return gv;
 
   return builder.CreateGlobalString(kernelName, globalName);
@@ -346,16 +378,13 @@ llvm::LaunchKernel::createKernelArgArray(mlir::gpu::LaunchFuncOp op) {
 }
 
 // Emits LLVM IR to launch a kernel function:
-// %0 = call %binarygetter
-// %1 = call %moduleLoad(%0)
-// %2 = <see generateKernelNameConstant>
-// %3 = call %moduleGetFunction(%1, %2)
-// %4 = call %streamCreate()
-// %5 = <see generateParamsArray>
-// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)
-// call %streamSynchronize(%4)
-// call %streamDestroy(%4)
-// call %moduleUnload(%1)
+// %1 = load %global_module_object
+// %2 = call @mgpuModuleGetFunction(%1, %global_kernel_name)
+// %3 = call @mgpuStreamCreate()
+// %4 = <see createKernelArgArray()>
+// call @mgpuLaunchKernel(%2, ..., %3, %4, ...)
+// call @mgpuStreamSynchronize(%3)
+// call @mgpuStreamDestroy(%3)
 llvm::LogicalResult
 llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                        mlir::gpu::ObjectAttr object) {
@@ -385,58 +414,29 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
   // Create the argument array.
   Value *argArray = createKernelArgArray(op);
 
-  // Default JIT optimization level.
-  llvm::Constant *optV = llvm::ConstantInt::get(i32Ty, 0);
-  // Check if there's an optimization level embedded in the object.
-  DictionaryAttr objectProps = object.getProperties();
-  mlir::Attribute optAttr;
-  if (objectProps && (optAttr = objectProps.get("O"))) {
-    auto optLevel = dyn_cast<IntegerAttr>(optAttr);
-    if (!optLevel)
-      return op.emitError("the optimization level must be an integer");
-    optV = llvm::ConstantInt::get(i32Ty, optLevel.getValue());
-  }
-
-  // Load the kernel module.
-  StringRef moduleName = op.getKernelModuleName().getValue();
-  std::string binaryIdentifier = getBinaryIdentifier(moduleName);
-  Value *binary = module.getGlobalVariable(binaryIdentifier, true);
-  if (!binary)
-    return op.emitError() << "Couldn't find the binary: " << binaryIdentifier;
-
-  auto binaryVar = dyn_cast<llvm::GlobalVariable>(binary);
-  if (!binaryVar)
-    return op.emitError() << "Binary is not a global variable: "
-                          << binaryIdentifier;
-  llvm::Constant *binaryInit = binaryVar->getInitializer();
-  auto binaryDataSeq =
-      dyn_cast_if_present<llvm::ConstantDataSequential>(binaryInit);
-  if (!binaryDataSeq)
-    return op.emitError() << "Couldn't find binary data array: "
-                          << binaryIdentifier;
-  llvm::Constant *binarySize =
-      llvm::ConstantInt::get(i64Ty, binaryDataSeq->getNumElements() *
-                                        binaryDataSeq->getElementByteSize());
-
-  Value *moduleObject =
-      object.getFormat() == gpu::CompilationTarget::Assembly
-          ? builder.CreateCall(getModuleLoadJITFn(), {binary, optV})
-          : builder.CreateCall(getModuleLoadFn(), {binary, binarySize});
-
   // Load the kernel function.
-  Value *moduleFunction = builder.CreateCall(
-      getModuleFunctionFn(),
-      {moduleObject,
-       getOrCreateFunctionName(moduleName, op.getKernelName().getValue())});
+  StringRef moduleName = op.getKernelModuleName().getValue();
+  Twine moduleIdentifier = getModuleIdentifier(moduleName);
+  Value *modulePtr = module.getGlobalVariable(moduleIdentifier.str(), true);
+  if (!modulePtr)
+    return op.emitError() << "Couldn't find the binary: " << moduleIdentifier;
+  Value *moduleObj = builder.CreateLoad(ptrTy, modulePtr);
+  Value *functionName = getOrCreateFunctionName(moduleName, op.getKernelName());
+  Value *moduleFunction =
+      builder.CreateCall(getModuleFunctionFn(), {moduleObj, functionName});
 
   // Get the stream to use for execution. If there's no async object then create
   // a stream to make a synchronous kernel launch.
   Value *stream = nullptr;
-  bool handleStream = false;
+  // Sync & destroy the stream, for synchronous launches.
+  auto destroyStream = make_scope_exit([&]() {
+    builder.CreateCall(getStreamSyncFn(), {stream});
+    builder.CreateCall(getStreamDestroyFn(), {stream});
+  });
   if (mlir::Value asyncObject = op.getAsyncObject()) {
     stream = llvmValue(asyncObject);
+    destroyStream.release();
   } else {
-    handleStream = true;
     stream = builder.CreateCall(getStreamCreateFn(), {});
   }
 
@@ -462,14 +462,12 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                           argArray, nullPtr, paramsCount}));
   }
 
-  // Sync & destroy the stream, for synchronous launches.
-  if (handleStream) {
-    builder.CreateCall(getStreamSyncFn(), {stream});
-    builder.CreateCall(getStreamDestroyFn(), {stream});
-  }
-
-  // Unload the kernel module.
-  builder.CreateCall(getModuleUnloadFn(), {moduleObject});
-
   return success();
 }
+
+void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
+    DialectRegistry &registry) {
+  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
+    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
+  });
+}
diff --git a/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
new file mode 100644
index 0000000000000..80cc6d6bf91dd
--- /dev/null
+++ b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
@@ -0,0 +1,48 @@
+// Tests that we can run multiple kernels concurrently. Runs two kernels, which
+// increment a global atomic counter, then wait for the counter to reach 2.
+//
+// RUN: mlir-opt %s \
+// RUN: | mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-format=%gpu_compilation_format" \
+// RUN: | mlir-runner \
+// RUN:   --shared-libs=%mlir_cuda_runtime \
+// RUN:   --shared-libs=%mlir_runner_utils \
+// RUN:   --entry-point-result=void
+
+module attributes {gpu.container_module} {
+    gpu.module @kernels {
+        gpu.func @kernel(%memref: memref<i32>) kernel {
+            %c0 = arith.constant 0 : i32
+            %c1 = arith.constant 1 : i32
+            %c2 = arith.constant 2 : i32
+            %block = memref.atomic_rmw addi %c1, %memref[] : (i32, memref<i32>) -> i32
+            scf.while: () -> () {
+                %value = memref.atomic_rmw addi %c0, %memref[] : (i32, memref<i32>) -> i32
+                %cond = arith.cmpi slt, %value, %c2 : i32
+                scf.condition(%cond)
+            } do {
+                scf.yield
+            }
+            gpu.return
+        }
+    }
+
+    func.func @main() {
+        %memref = gpu.alloc host_shared () : memref<i32>
+        %c0 = arith.constant 0 : i32
+        memref.store %c0, %memref[] : memref<i32>
+
+        %0 = gpu.wait async
+        %1 = gpu.wait async
+        %c1 = arith.constant 1 : index
+        %2 = gpu.launch_func async [%0] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        %3 = gpu.launch_func async [%1] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        gpu.wait [%2, %3]
+        return
+    }
+}
diff --git a/mlir/test/Target/LLVMIR/gpu.mlir b/mlir/test/Target/LLVMIR/gpu.mlir
index 6b7e7fcc71960..0d29a95b12266 100644
--- a/mlir/test/Target/LLVMIR/gpu.mlir
+++ b/mlir/test/Target/LLVMIR/gpu.mlir
@@ -3,8 +3,11 @@
 // Checking the translation of the `gpu.binary` & `gpu.launch_fun` ops.
 module attributes {gpu.container_module} {
   // CHECK: [[ARGS_TY:%.*]] = type { i32, i32 }
-  // CHECK: @kernel_module_bin_cst = internal constant [4 x i8] c"BLOB", align 8
-  // CHECK: @kernel_module_kernel_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
+  // CHECK-DAG: @kernel_module_binary = internal constant [4 x i8] c"BLOB", align 8
+  // CHECK-DAG: kernel_module_module = internal global ptr null
+  // CHECK-DAG: @llvm.global_ctors = appending global {{.*}} @kernel_module_load
+  // CHECK-DAG: @llvm.global_dtors = appending global {{.*}} @kernel_module_unload
+  // CHECK-DAG: @kernel_module_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
   gpu.binary @kernel_module  [#gpu.object<#nvvm.target, "BLOB">]
   llvm.func @foo() {
     // CHECK: [[ARGS:%.*]] = alloca %{{.*}}, align 8
@@ -17,26 +20,32 @@ module attributes {gpu.container_module} {
     // CHECK: store i32 32, ptr [[ARG1]], align 4
     // CHECK: %{{.*}} = getelementptr ptr, ptr [[ARGS_ARRAY]], i32 1
     // CHECK: store ptr [[ARG1]], ptr %{{.*}}, align 8
-    // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad(ptr @kernel_module_bin_cst, i64 4)
-    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_kernel_name)
+    // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_name)
     // CHECK: [[STREAM:%.*]] = call ptr @mgpuStreamCreate()
     // CHECK: call void @mgpuLaunchKernel(ptr [[FUNC]], i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i32 256, ptr [[STREAM]], ptr [[ARGS_ARRAY]], ptr null, i64 2)
     // CHECK: call void @mgpuStreamSynchronize(ptr [[STREAM]])
     // CHECK: call void @mgpuStreamDestroy(ptr [[STREAM]])
-    // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
     %0 = llvm.mlir.constant(8 : index) : i64
     %1 = llvm.mlir.constant(32 : i32) : i32
     %2 = llvm.mlir.constant(256 : i32) : i32
     gpu.launch_func @kernel_module::@kernel blocks in (%0, %0, %0) threads in (%0, %0, %0) : i64 dynamic_shared_memory_size %2 args(%1 : i32, %1 : i32)
     llvm.return
   }
+  // CHECK: @kernel_module_load() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad
+  // CHECK: store ptr [[MODULE]], ptr @kernel_module_module
+  //
+  // CHECK: @kernel_module_unload() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+  // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
 }
 
 // -----
 
 // Checking the correct selection of the second object using an index as a selector.
 module {
-  // CHECK: @kernel_module_bin_cst = internal constant [1 x i8] c"1", align 8
+  // CHECK: @kernel_module_binary = internal constant [1 x i8] c"1", align 8
   gpu.binary @kernel_module <#gpu.select_object<1>> [#gpu.object<#nvvm.target, "0">, #gpu.object<#nvvm.target, "1">]
 }
 
@@ -44,7 +53,7 @@ module {
 
 // Checking the correct selection of the second object using a target as a selector.
 module {
-  // CHECK: @ker...
[truncated]

llvmbot · 2025-04-12T06:19:11Z

@llvm/pr-subscribers-mlir-llvm

Author: Christian Sigg (chsigg)

Changes

Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of embedBinary() and launchKernel() use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for launchKernel() to align them again.

Patch is 25.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135478.diff

4 Files Affected:

(modified) mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp (+128-130)
(added) mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir (+48)
(modified) mlir/test/Target/LLVMIR/gpu.mlir (+35-36)
(modified) utils/bazel/llvm-project-overlay/mlir/BUILD.bazel (+1)

diff --git a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
index 8d4a0bcf8adbf..d3216d9ad17eb 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
@@ -18,11 +18,13 @@
 #include "mlir/Target/LLVMIR/Export.h"
 #include "mlir/Target/LLVMIR/ModuleTranslation.h"
 
+#include "llvm/ADT/ScopeExit.h"
 #include "llvm/IR/Constants.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Module.h"
 #include "llvm/Support/FormatVariadic.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
 
 using namespace mlir;
 
@@ -31,9 +33,13 @@ namespace {
 class SelectObjectAttrImpl
     : public gpu::OffloadingLLVMTranslationAttrInterface::FallbackModel<
           SelectObjectAttrImpl> {
+  // Returns the selected object for embedding.
+  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
+
 public:
   // Translates a `gpu.binary`, embedding the binary into a host LLVM module as
-  // global binary string.
+  // global binary string which gets loaded/unloaded into a global module
+  // object through a global ctor/dtor.
   LogicalResult embedBinary(Attribute attribute, Operation *operation,
                             llvm::IRBuilderBase &builder,
                             LLVM::ModuleTranslation &moduleTranslation) const;
@@ -45,23 +51,9 @@ class SelectObjectAttrImpl
                              Operation *binaryOperation,
                              llvm::IRBuilderBase &builder,
                              LLVM::ModuleTranslation &moduleTranslation) const;
-
-  // Returns the selected object for embedding.
-  gpu::ObjectAttr getSelectedObject(gpu::BinaryOp op) const;
 };
-// Returns an identifier for the global string holding the binary.
-std::string getBinaryIdentifier(StringRef binaryName) {
-  return binaryName.str() + "_bin_cst";
-}
 } // namespace
 
-void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
-    DialectRegistry &registry) {
-  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
-    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
-  });
-}
-
 gpu::ObjectAttr
 SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   ArrayRef<Attribute> objects = op.getObjectsAttr().getValue();
@@ -96,6 +88,94 @@ SelectObjectAttrImpl::getSelectedObject(gpu::BinaryOp op) const {
   return mlir::dyn_cast<gpu::ObjectAttr>(objects[index]);
 }
 
+static Twine getModuleIdentifier(StringRef moduleName) {
+  return moduleName + "_module";
+}
+
+namespace llvm {
+static LogicalResult embedBinaryImpl(StringRef moduleName,
+                                     gpu::ObjectAttr object, Module &module) {
+
+  // Embed the object as a global string.
+  // Add null for assembly output for JIT paths that expect null-terminated
+  // strings.
+  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
+  StringRef serializedStr = object.getObject().getValue();
+  Constant *serializedCst =
+      ConstantDataArray::getString(module.getContext(), serializedStr, addNull);
+  GlobalVariable *serializedObj =
+      new GlobalVariable(module, serializedCst->getType(), true,
+                         GlobalValue::LinkageTypes::InternalLinkage,
+                         serializedCst, moduleName + "_binary");
+  serializedObj->setAlignment(MaybeAlign(8));
+  serializedObj->setUnnamedAddr(GlobalValue::UnnamedAddr::None);
+
+  // Default JIT optimization level.
+  auto optLevel = APInt::getZero(32);
+
+  if (DictionaryAttr objectProps = object.getProperties()) {
+    if (auto section = dyn_cast_or_null<StringAttr>(
+            objectProps.get(gpu::elfSectionName))) {
+      serializedObj->setSection(section.getValue());
+    }
+    // Check if there's an optimization level embedded in the object.
+    if (auto optAttr = dyn_cast_or_null<IntegerAttr>(objectProps.get("O")))
+      optLevel = optAttr.getValue();
+  }
+
+  IRBuilder<> builder(module.getContext());
+  auto i32Ty = builder.getInt32Ty();
+  auto i64Ty = builder.getInt64Ty();
+  auto ptrTy = builder.getPtrTy(0);
+  auto voidTy = builder.getVoidTy();
+
+  // Embed the module as a global object.
+  auto *modulePtr = new GlobalVariable(
+      module, ptrTy, /*isConstant=*/false, GlobalValue::InternalLinkage,
+      /*Initializer=*/ConstantPointerNull::get(ptrTy),
+      getModuleIdentifier(moduleName));
+
+  auto *loadFn = Function::Create(FunctionType::get(voidTy, /*IsVarArg=*/false),
+                                  GlobalValue::InternalLinkage,
+                                  moduleName + "_load", module);
+  loadFn->setSection(".text.startup");
+  auto *loadBlock = BasicBlock::Create(module.getContext(), "entry", loadFn);
+  builder.SetInsertPoint(loadBlock);
+  Value *moduleObj = [&] {
+    if (object.getFormat() == gpu::CompilationTarget::Assembly) {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoadJIT", FunctionType::get(ptrTy, {ptrTy, i32Ty}, false));
+      Constant *optValue = ConstantInt::get(i32Ty, optLevel);
+      return builder.CreateCall(moduleLoadFn, {serializedObj, optValue});
+    } else {
+      FunctionCallee moduleLoadFn = module.getOrInsertFunction(
+          "mgpuModuleLoad", FunctionType::get(ptrTy, {ptrTy, i64Ty}, false));
+      Constant *binarySize =
+          ConstantInt::get(i64Ty, serializedStr.size() + (addNull ? 1 : 0));
+      return builder.CreateCall(moduleLoadFn, {serializedObj, binarySize});
+    }
+  }();
+  builder.CreateStore(moduleObj, modulePtr);
+  builder.CreateRetVoid();
+  appendToGlobalCtors(module, loadFn, /*Priority=*/123);
+
+  auto *unloadFn = Function::Create(
+      FunctionType::get(voidTy, /*IsVarArg=*/false),
+      GlobalValue::InternalLinkage, moduleName + "_unload", module);
+  unloadFn->setSection(".text.startup");
+  auto *unloadBlock =
+      BasicBlock::Create(module.getContext(), "entry", unloadFn);
+  builder.SetInsertPoint(unloadBlock);
+  FunctionCallee moduleUnloadFn = module.getOrInsertFunction(
+      "mgpuModuleUnload", FunctionType::get(voidTy, ptrTy, false));
+  builder.CreateCall(moduleUnloadFn, builder.CreateLoad(ptrTy, modulePtr));
+  builder.CreateRetVoid();
+  appendToGlobalDtors(module, unloadFn, /*Priority=*/123);
+
+  return success();
+}
+} // namespace llvm
+
 LogicalResult SelectObjectAttrImpl::embedBinary(
     Attribute attribute, Operation *operation, llvm::IRBuilderBase &builder,
     LLVM::ModuleTranslation &moduleTranslation) const {
@@ -113,29 +193,8 @@ LogicalResult SelectObjectAttrImpl::embedBinary(
   if (!object)
     return failure();
 
-  llvm::Module *module = moduleTranslation.getLLVMModule();
-
-  // Embed the object as a global string.
-  // Add null for assembly output for JIT paths that expect null-terminated
-  // strings.
-  bool addNull = (object.getFormat() == gpu::CompilationTarget::Assembly);
-  llvm::Constant *binary = llvm::ConstantDataArray::getString(
-      builder.getContext(), object.getObject().getValue(), addNull);
-  llvm::GlobalVariable *serializedObj =
-      new llvm::GlobalVariable(*module, binary->getType(), true,
-                               llvm::GlobalValue::LinkageTypes::InternalLinkage,
-                               binary, getBinaryIdentifier(op.getName()));
-
-  if (object.getProperties()) {
-    if (auto section = mlir::dyn_cast_or_null<mlir::StringAttr>(
-            object.getProperties().get(gpu::elfSectionName))) {
-      serializedObj->setSection(section.getValue());
-    }
-  }
-  serializedObj->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);
-  serializedObj->setAlignment(llvm::MaybeAlign(8));
-  serializedObj->setUnnamedAddr(llvm::GlobalValue::UnnamedAddr::None);
-  return success();
+  return embedBinaryImpl(op.getName(), object,
+                         *moduleTranslation.getLLVMModule());
 }
 
 namespace llvm {
@@ -153,15 +212,6 @@ class LaunchKernel {
   // Get the module function callee.
   FunctionCallee getModuleFunctionFn();
 
-  // Get the module load callee.
-  FunctionCallee getModuleLoadFn();
-
-  // Get the module load JIT callee.
-  FunctionCallee getModuleLoadJITFn();
-
-  // Get the module unload callee.
-  FunctionCallee getModuleUnloadFn();
-
   // Get the stream create callee.
   FunctionCallee getStreamCreateFn();
 
@@ -261,24 +311,6 @@ llvm::FunctionCallee llvm::LaunchKernel::getModuleFunctionFn() {
       FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, ptrTy}), false));
 }
 
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoad",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i64Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadJITFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleLoadJIT",
-      FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i32Ty}), false));
-}
-
-llvm::FunctionCallee llvm::LaunchKernel::getModuleUnloadFn() {
-  return module.getOrInsertFunction(
-      "mgpuModuleUnload",
-      FunctionType::get(voidTy, ArrayRef<Type *>({ptrTy}), false));
-}
-
 llvm::FunctionCallee llvm::LaunchKernel::getStreamCreateFn() {
   return module.getOrInsertFunction("mgpuStreamCreate",
                                     FunctionType::get(ptrTy, false));
@@ -301,9 +333,9 @@ llvm::FunctionCallee llvm::LaunchKernel::getStreamSyncFn() {
 llvm::Value *llvm::LaunchKernel::getOrCreateFunctionName(StringRef moduleName,
                                                          StringRef kernelName) {
   std::string globalName =
-      std::string(formatv("{0}_{1}_kernel_name", moduleName, kernelName));
+      std::string(formatv("{0}_{1}_name", moduleName, kernelName));
 
-  if (GlobalVariable *gv = module.getGlobalVariable(globalName))
+  if (GlobalVariable *gv = module.getGlobalVariable(globalName, true))
     return gv;
 
   return builder.CreateGlobalString(kernelName, globalName);
@@ -346,16 +378,13 @@ llvm::LaunchKernel::createKernelArgArray(mlir::gpu::LaunchFuncOp op) {
 }
 
 // Emits LLVM IR to launch a kernel function:
-// %0 = call %binarygetter
-// %1 = call %moduleLoad(%0)
-// %2 = <see generateKernelNameConstant>
-// %3 = call %moduleGetFunction(%1, %2)
-// %4 = call %streamCreate()
-// %5 = <see generateParamsArray>
-// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)
-// call %streamSynchronize(%4)
-// call %streamDestroy(%4)
-// call %moduleUnload(%1)
+// %1 = load %global_module_object
+// %2 = call @mgpuModuleGetFunction(%1, %global_kernel_name)
+// %3 = call @mgpuStreamCreate()
+// %4 = <see createKernelArgArray()>
+// call @mgpuLaunchKernel(%2, ..., %3, %4, ...)
+// call @mgpuStreamSynchronize(%3)
+// call @mgpuStreamDestroy(%3)
 llvm::LogicalResult
 llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                        mlir::gpu::ObjectAttr object) {
@@ -385,58 +414,29 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
   // Create the argument array.
   Value *argArray = createKernelArgArray(op);
 
-  // Default JIT optimization level.
-  llvm::Constant *optV = llvm::ConstantInt::get(i32Ty, 0);
-  // Check if there's an optimization level embedded in the object.
-  DictionaryAttr objectProps = object.getProperties();
-  mlir::Attribute optAttr;
-  if (objectProps && (optAttr = objectProps.get("O"))) {
-    auto optLevel = dyn_cast<IntegerAttr>(optAttr);
-    if (!optLevel)
-      return op.emitError("the optimization level must be an integer");
-    optV = llvm::ConstantInt::get(i32Ty, optLevel.getValue());
-  }
-
-  // Load the kernel module.
-  StringRef moduleName = op.getKernelModuleName().getValue();
-  std::string binaryIdentifier = getBinaryIdentifier(moduleName);
-  Value *binary = module.getGlobalVariable(binaryIdentifier, true);
-  if (!binary)
-    return op.emitError() << "Couldn't find the binary: " << binaryIdentifier;
-
-  auto binaryVar = dyn_cast<llvm::GlobalVariable>(binary);
-  if (!binaryVar)
-    return op.emitError() << "Binary is not a global variable: "
-                          << binaryIdentifier;
-  llvm::Constant *binaryInit = binaryVar->getInitializer();
-  auto binaryDataSeq =
-      dyn_cast_if_present<llvm::ConstantDataSequential>(binaryInit);
-  if (!binaryDataSeq)
-    return op.emitError() << "Couldn't find binary data array: "
-                          << binaryIdentifier;
-  llvm::Constant *binarySize =
-      llvm::ConstantInt::get(i64Ty, binaryDataSeq->getNumElements() *
-                                        binaryDataSeq->getElementByteSize());
-
-  Value *moduleObject =
-      object.getFormat() == gpu::CompilationTarget::Assembly
-          ? builder.CreateCall(getModuleLoadJITFn(), {binary, optV})
-          : builder.CreateCall(getModuleLoadFn(), {binary, binarySize});
-
   // Load the kernel function.
-  Value *moduleFunction = builder.CreateCall(
-      getModuleFunctionFn(),
-      {moduleObject,
-       getOrCreateFunctionName(moduleName, op.getKernelName().getValue())});
+  StringRef moduleName = op.getKernelModuleName().getValue();
+  Twine moduleIdentifier = getModuleIdentifier(moduleName);
+  Value *modulePtr = module.getGlobalVariable(moduleIdentifier.str(), true);
+  if (!modulePtr)
+    return op.emitError() << "Couldn't find the binary: " << moduleIdentifier;
+  Value *moduleObj = builder.CreateLoad(ptrTy, modulePtr);
+  Value *functionName = getOrCreateFunctionName(moduleName, op.getKernelName());
+  Value *moduleFunction =
+      builder.CreateCall(getModuleFunctionFn(), {moduleObj, functionName});
 
   // Get the stream to use for execution. If there's no async object then create
   // a stream to make a synchronous kernel launch.
   Value *stream = nullptr;
-  bool handleStream = false;
+  // Sync & destroy the stream, for synchronous launches.
+  auto destroyStream = make_scope_exit([&]() {
+    builder.CreateCall(getStreamSyncFn(), {stream});
+    builder.CreateCall(getStreamDestroyFn(), {stream});
+  });
   if (mlir::Value asyncObject = op.getAsyncObject()) {
     stream = llvmValue(asyncObject);
+    destroyStream.release();
   } else {
-    handleStream = true;
     stream = builder.CreateCall(getStreamCreateFn(), {});
   }
 
@@ -462,14 +462,12 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
                                           argArray, nullPtr, paramsCount}));
   }
 
-  // Sync & destroy the stream, for synchronous launches.
-  if (handleStream) {
-    builder.CreateCall(getStreamSyncFn(), {stream});
-    builder.CreateCall(getStreamDestroyFn(), {stream});
-  }
-
-  // Unload the kernel module.
-  builder.CreateCall(getModuleUnloadFn(), {moduleObject});
-
   return success();
 }
+
+void mlir::gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(
+    DialectRegistry &registry) {
+  registry.addExtension(+[](MLIRContext *ctx, gpu::GPUDialect *dialect) {
+    SelectObjectAttr::attachInterface<SelectObjectAttrImpl>(*ctx);
+  });
+}
diff --git a/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
new file mode 100644
index 0000000000000..80cc6d6bf91dd
--- /dev/null
+++ b/mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir
@@ -0,0 +1,48 @@
+// Tests that we can run multiple kernels concurrently. Runs two kernels, which
+// increment a global atomic counter, then wait for the counter to reach 2.
+//
+// RUN: mlir-opt %s \
+// RUN: | mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-format=%gpu_compilation_format" \
+// RUN: | mlir-runner \
+// RUN:   --shared-libs=%mlir_cuda_runtime \
+// RUN:   --shared-libs=%mlir_runner_utils \
+// RUN:   --entry-point-result=void
+
+module attributes {gpu.container_module} {
+    gpu.module @kernels {
+        gpu.func @kernel(%memref: memref<i32>) kernel {
+            %c0 = arith.constant 0 : i32
+            %c1 = arith.constant 1 : i32
+            %c2 = arith.constant 2 : i32
+            %block = memref.atomic_rmw addi %c1, %memref[] : (i32, memref<i32>) -> i32
+            scf.while: () -> () {
+                %value = memref.atomic_rmw addi %c0, %memref[] : (i32, memref<i32>) -> i32
+                %cond = arith.cmpi slt, %value, %c2 : i32
+                scf.condition(%cond)
+            } do {
+                scf.yield
+            }
+            gpu.return
+        }
+    }
+
+    func.func @main() {
+        %memref = gpu.alloc host_shared () : memref<i32>
+        %c0 = arith.constant 0 : i32
+        memref.store %c0, %memref[] : memref<i32>
+
+        %0 = gpu.wait async
+        %1 = gpu.wait async
+        %c1 = arith.constant 1 : index
+        %2 = gpu.launch_func async [%0] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        %3 = gpu.launch_func async [%1] @kernels::@kernel
+            blocks in (%c1, %c1, %c1)
+            threads in (%c1, %c1, %c1)
+            args(%memref: memref<i32>)
+        gpu.wait [%2, %3]
+        return
+    }
+}
diff --git a/mlir/test/Target/LLVMIR/gpu.mlir b/mlir/test/Target/LLVMIR/gpu.mlir
index 6b7e7fcc71960..0d29a95b12266 100644
--- a/mlir/test/Target/LLVMIR/gpu.mlir
+++ b/mlir/test/Target/LLVMIR/gpu.mlir
@@ -3,8 +3,11 @@
 // Checking the translation of the `gpu.binary` & `gpu.launch_fun` ops.
 module attributes {gpu.container_module} {
   // CHECK: [[ARGS_TY:%.*]] = type { i32, i32 }
-  // CHECK: @kernel_module_bin_cst = internal constant [4 x i8] c"BLOB", align 8
-  // CHECK: @kernel_module_kernel_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
+  // CHECK-DAG: @kernel_module_binary = internal constant [4 x i8] c"BLOB", align 8
+  // CHECK-DAG: kernel_module_module = internal global ptr null
+  // CHECK-DAG: @llvm.global_ctors = appending global {{.*}} @kernel_module_load
+  // CHECK-DAG: @llvm.global_dtors = appending global {{.*}} @kernel_module_unload
+  // CHECK-DAG: @kernel_module_kernel_name = private unnamed_addr constant [7 x i8] c"kernel\00", align 1
   gpu.binary @kernel_module  [#gpu.object<#nvvm.target, "BLOB">]
   llvm.func @foo() {
     // CHECK: [[ARGS:%.*]] = alloca %{{.*}}, align 8
@@ -17,26 +20,32 @@ module attributes {gpu.container_module} {
     // CHECK: store i32 32, ptr [[ARG1]], align 4
     // CHECK: %{{.*}} = getelementptr ptr, ptr [[ARGS_ARRAY]], i32 1
     // CHECK: store ptr [[ARG1]], ptr %{{.*}}, align 8
-    // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad(ptr @kernel_module_bin_cst, i64 4)
-    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_kernel_name)
+    // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+    // CHECK: [[FUNC:%.*]] = call ptr @mgpuModuleGetFunction(ptr [[MODULE]], ptr @kernel_module_kernel_name)
     // CHECK: [[STREAM:%.*]] = call ptr @mgpuStreamCreate()
     // CHECK: call void @mgpuLaunchKernel(ptr [[FUNC]], i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i32 256, ptr [[STREAM]], ptr [[ARGS_ARRAY]], ptr null, i64 2)
     // CHECK: call void @mgpuStreamSynchronize(ptr [[STREAM]])
     // CHECK: call void @mgpuStreamDestroy(ptr [[STREAM]])
-    // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
     %0 = llvm.mlir.constant(8 : index) : i64
     %1 = llvm.mlir.constant(32 : i32) : i32
     %2 = llvm.mlir.constant(256 : i32) : i32
     gpu.launch_func @kernel_module::@kernel blocks in (%0, %0, %0) threads in (%0, %0, %0) : i64 dynamic_shared_memory_size %2 args(%1 : i32, %1 : i32)
     llvm.return
   }
+  // CHECK: @kernel_module_load() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = call ptr @mgpuModuleLoad
+  // CHECK: store ptr [[MODULE]], ptr @kernel_module_module
+  //
+  // CHECK: @kernel_module_unload() section ".text.startup"
+  // CHECK: [[MODULE:%.*]] = load ptr, ptr @kernel_module_module
+  // CHECK: call void @mgpuModuleUnload(ptr [[MODULE]])
 }
 
 // -----
 
 // Checking the correct selection of the second object using an index as a selector.
 module {
-  // CHECK: @kernel_module_bin_cst = internal constant [1 x i8] c"1", align 8
+  // CHECK: @kernel_module_binary = internal constant [1 x i8] c"1", align 8
   gpu.binary @kernel_module <#gpu.select_object<1>> [#gpu.object<#nvvm.target, "0">, #gpu.object<#nvvm.target, "1">]
 }
 
@@ -44,7 +53,7 @@ module {
 
 // Checking the correct selection of the second object using a target as a selector.
 module {
-  // CHECK: @ker...
[truncated]

This allows concurrent execution of different kernels (different function or different module). See https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect/72796/17.

mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir

joker-eph · 2025-04-12T08:43:50Z

Would it be possible to lazy load on first use?
Right now this would all happen eagerly on application startup, and consume GPU memory even for kernels that wouldn't be used (If I understand correctly the change)

…CUDA_MODULE_LOADING=EAGER in test. Format test file.

chsigg · 2025-04-13T06:11:55Z

Would it be possible to lazy load on first use?

Yes, I reverted the eager loading in the runtime. Instead, one can use CUDA_MODULE_LOADING=EAGER env variable to force eager loading.

joker-eph · 2025-04-13T17:52:58Z

git grep CUDA_MODULE_LOADING does not find anything in the repo, do you have a pointer to what you have in mind here?

chsigg · 2025-04-14T05:25:18Z

do you have a pointer to what you have in mind here?

This environment variable controls lazy loading of kernels in the CUDA driver:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading

The documentation is not consistent whether to use CUDA_MODULE_LOADING or CUDA_MODULE_DATA_LOADING (see e.g. here). I checked that only the former works though: with two different kernels, the test fails without env variable setting or CUDA_MODULE_DATA_LOADING=EAGER and passes with CUDA_MODULE_LOADING=EAGER.

chsigg · 2025-04-22T08:14:18Z

Friendly ping. I would appreciate a review. Thanks!

joker-eph · 2025-04-22T11:21:21Z

LG (sorry I missed your answer last week)

chsigg · 2025-04-22T11:49:15Z

No problem. Thanks a lot Mehdi.

clementval · 2025-04-22T19:09:34Z

Could the constructor be added/ not added based on a pass option? We are using the generated binary but we rely on a different runtime to load the module so in our case we would like to not generate it.

joker-eph · 2025-04-22T19:36:53Z

I wonder we could move back the logic into createKernelLaunch() where it was until now instead of during the translation?
We'd still inject the mgpuModuleLoad into a global constructor, but we'd do it only if we have gpu.launch present, that way a simple translation of a kernel binary does not do that and preserve other runtime/lowering implementations.

clementval · 2025-04-22T20:21:28Z

I wonder we could move back the logic into createKernelLaunch() where it was until now instead of during the translation? We'd still inject the mgpuModuleLoad into a global constructor, but we'd do it only if we have gpu.launch present, that way a simple translation of a kernel binary does not do that and preserve other runtime/lowering implementations.

That would be nice to keep the translation of gpu.binary to the creation of the global. We can probably work around the current constructor generation in our case.

chsigg · 2025-04-23T04:56:05Z

I'm sorry this change is causing trouble for you. I'm not sure now, can you work with the current state or do you prefer to move the module global+ctor/dtor creation back to createKernelLaunch?

joker-eph · 2025-04-23T09:58:47Z

I think we should move it to createKernelLaunch because this is a layering problem: the choice of the runtime should be part of the lowering of the GPU dialect, it is meant to be customizable for other runtime.
The move to the translation layer makes it non longer a customizable lowering thing, but something that requires non-trivial workarounds now (basically you need to inject an empty mgpuModuleLoad in the LLVM module to "cancel" this added code).

Hardcode84 · 2025-04-23T10:25:55Z

Some drive by comment: It would be nice to have gpu.launch_func version which takes func as ssa value + dedicated gpu.load_module/gpu.get_func, in addition to the existing gpu.launch_func which takes symbol ref. This way we can progressive lower thing completely within MLIR, while keeping runtime calls lowering straightforward and delegating kernel lazy loading/caching decisions to the user.

clementval · 2025-04-23T15:52:16Z

I'm sorry this change is causing trouble for you. I'm not sure now, can you work with the current state or do you prefer to move the module global+ctor/dtor creation back to createKernelLaunch?

We can live with it and we already have a workaround but I think that having this with createKernelLaunch makes more sense.

chsigg · 2025-04-23T16:37:20Z

I think we should move it to createKernelLaunch because this is a layering problem: the choice of the runtime should be part of the lowering of the GPU dialect, it is meant to be customizable for other runtime.

Well, either way it's part of the translation layer of a gpu op, no?

The way I understand it, the offloadingHandler of the gpu.binary op is the intended customization point how gpu.binary and gpu.launch_func are lowered to LLVM.

But yes, I can move this back to createKernelLaunch. It makes sense that lowering the gpu.binary itself should not automatically load that binary. But it's not ideal to be part of gpu.launch_func either, so...

I do like the idea of gpu.load_module / gpu.unload_module / gpu.get_func ops to make things more explicit. I'm wondering though if it is too CUDA specific. The runtime wrappers for Vulkan seem to not quite fit this interface.

I'm honestly a bit lost about how it would all play together and where we would lower to what so that things can be easily customized and are not tied to one specific API or runtime. If people have time and energy to come up with a plan though, I'm happy to help with coding.

joker-eph · 2025-04-23T16:43:45Z

Well, either way it's part of the translation layer of a gpu op, no?

Ah you're correct, I thought that gpu.launch_func was still going through a lowering to the LLVM dialect instead of a direct translation.

Let me ping @fabianmcg to this whole discussion to get some other perspective.

fabianmcg · 2025-04-23T17:14:33Z

Ok, I'll provide an overview of the current mechanism, then some rationale and how I thought it could be used:

Currently, convert-to-llvm only legalizes the args of gpu.launch_func (ie. the args updated, but the ops remain). It's only during translation that gpu.binary and gpu.launch_func ops fully expand.

The translation process is handled by OffloadingLLVMTranslationAttrInterface which is an inherent attr of gpu.binary:

      "::llvm::LogicalResult", "embedBinary",
      (ins "::mlir::Operation*":$binaryOp,
           "::llvm::IRBuilderBase&":$hostBuilder,
           "::mlir::LLVM::ModuleTranslation&":$hostModuleTranslation)
      "::llvm::LogicalResult", "launchKernel",
      (ins "::mlir::Operation*":$launchFunc, "::mlir::Operation*":$binaryOp,
           "::llvm::IRBuilderBase&":$hostBuilder,
           "::mlir::LLVM::ModuleTranslation&":$hostModuleTranslation)

The rationale was, that users could customize the process by either adding new attributes implementing the interface, or by registering a different external interface model to an existing attribute. Also, it would avoid the pitfalls of modeling the process after a specific runtime like the CUDA runtime, CUDA driver or Vulkan...

Now, the idea of using translation instead of lowerings is that LLVM is getting project offload. Also, at the time there was a lot of infra in LLVM to handle certain offloading bits like registering kernels. So, it was decided to not reinvent the wheel and that it would be better to use those when they became available. An example of this infra, is the LLVM code used by clang for registering binaries with the CUDA runtime: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp#L305-L545

Now, an example of how I envisioned customization is shown in this PR #78117. It adds support to load/launch/register kernels and binaries using the CUDA and HIP runtimes. Avoiding the issues (synced kernel launches) trying to be solved by this PR.

Hardcode84 · 2025-04-23T17:31:37Z

gpu.load_module / gpu.unload_module / gpu.get_func should fit all 3 vendors compute APIs (CUDA, HIP, L0, OpenCL), not sure about Vulkan

fabianmcg · 2025-04-23T17:38:52Z

gpu.load_module / gpu.unload_module / gpu.get_func should fit all 3 vendors compute APIs (CUDA, HIP, L0, OpenCL)

More than vendors, it's not a 1-1 mapping between runtimes. For example, the CUDA and HIP drivers will do:

load
get_func
launch
unload

But the CUDA and HIP runtimes will do:

register at startup the binaries and map them to a host global (usually a func stub)
launch

Name was changed in #135478

chsigg · 2025-04-24T10:01:19Z

Thanks Fabian for the explanations, I wasn't aware of the LLVM offloading project.

The existing gpu-to-llvm pass is already very specifically targeting the runtime wrappers, so I'm wondering if it would make sense to convert to (global) mgpuModuleLoad / mgpuLaunchKernel etc calls there. Currently gpu-to-llvm runs just before gpu-module-to-binary in some places, but I think it would be fine to swap those.

After gpu-to-llvm, the gpu.binary op would be dead and translation to LLVM would not use the OffloadingLLVMTranslation functionality. Would you consider this a step backwards from where you are heading? Or is it fine to remove the SelectObjectAttrImpl and expect that there will be an OffloadingLLVMTranslationAttrInterface implementation that fits LLVM offloading, or other implementations for users which do not run gpu-to-llvm.

fabianmcg · 2025-04-24T21:03:30Z

I'm not blocking or voicing opposition to any changes. I just have a few questions:

What would we gain from moving the code to gpu-to-llvm?
Is there something we can't do today that we would be able to do with that change?

Would you consider this a step backwards from where you are heading?

From my initial idea, yes. Because, the idea was to reuse existing LLVM infra and not duplicate code. Moreover, the ops have clear low level meaning, thus they have direct translation, so I don't see any benefit in handling them during conversion.

I'll also say, that once project offload reaches maturity I'm going to propose migrating the existing GPU runtime wrappers to Offload's, which would fix issues like this.

clementval · 2025-04-24T23:35:54Z

Moreover, the ops have clear low level meaning, thus they have direct translation, so I don't see any benefit in handling them during conversion.

I agree with this statement but for me the loading/unloading of the module should be delegated to another operation that has this clear meaning so it would be more modular.

Or as an intermediate step we could have an attribute on the gpu.binary operation that indicates whether the translation should manage the loading/unloading or if it is left to another part of the pipeline.

chsigg · 2025-04-25T08:10:15Z

What would we gain from moving the code to gpu-to-llvm?

My motivation would be to keep the things that are tied to MLIR's runtime wrappers in one place. It feels slightly spread out at the moment with most ops being lowered to mgpu* calls in gpu-to-llvm, except for gpu.binary and gpu.launch_kernel handled during translation.

Is there something we can't do today that we would be able to do with that change?

Not that I'm aware of, or at least I don't have a need to change anything. I merely wanted to show that concurrent kernel execution is possible, which required loading modules ahead of time. But I don't want to leave things in a worse state than it was before. So, I'm happy to help if you would like to move things in a particular direction, but I don't have an agenda myself or feel like I have enough background to formulate one.

Migrating to LLVM offloading sounds like a clear win to me and hopefully any changes that we make now should make that migration easier. I have no idea how this is being used, but I see that there is GenericDeviceTy::loadBinary so maybe a separate load/unload module op would be a good fit.

fabianmcg · 2025-04-25T14:47:30Z

the loading/unloading of the module should be delegated to another operation that has this clear meaning so it would be more modular.

I think adding something like gpu.get_binary op that returns a ptr to the embedded binary could provide that flexibility. That way, we also avoid the pitfalls of modeling against a particular runtime, and if an user needs more customization they can take the ptr and load it via func call.

Or as an intermediate step we could have an attribute on the gpu.binary operation that indicates whether the translation should manage the loading/unloading or if it is left to another part of the pipeline.

Technically, gpu.binary already allows that: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td#L1467

It's possible for one to add an attr #cuf.offloading_handler implementing the interface. Put it on the binary, and then translation would only do what the interface says, and not load a module for example.

This is what #78117 does. It adds a new offloading attribute, that during translation adds a global constructor registering the binaries at startup with the CUDA runtime and then launch is just call to a runtime function. It would be up to the user to use gpu.select_object or gpu.offload_embedding in gpu.binary.

I'm happy to help if you would like to move things in a particular direction

At the moment, I would argue is better to not move things to gpu-to-llvm because we would ultimately move then out when offload comes around.
However, IMO a good cleanup would be merging load, get_func, launch into mgpuLaunchKernel, and handle everything inside that function. Including lazy loading the modules and unloading at program shutdown.

chsigg requested review from rupprecht, keith, aaronmondal and grypp as code owners April 12, 2025 06:18

llvmbot added mlir:llvm mlir:gpu mlir bazel "Peripheral" support tier build system: utils/bazel labels Apr 12, 2025

Load functions along with module.

473482c

This allows concurrent execution of different kernels (different function or different module). See https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect/72796/17.

llvmbot added the mlir:execution-engine label Apr 12, 2025

joker-eph reviewed Apr 12, 2025

View reviewed changes

mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir Outdated Show resolved Hide resolved

Remove explicit eager loading from runtime wrappers and replace with …

d38f98f

…CUDA_MODULE_LOADING=EAGER in test. Format test file.

joker-eph approved these changes Apr 22, 2025

View reviewed changes

chsigg merged commit 7851b1b into llvm:main Apr 22, 2025
11 checks passed

chsigg deleted the piper_export_cl_746544976 branch April 22, 2025 11:50

clementval mentioned this pull request Apr 23, 2025

[flang][cuda][NFC] Update binary name #137034

Merged

clementval added a commit that referenced this pull request Apr 23, 2025

[flang][cuda][NFC] Update binary name (#137034)

0fdb908

Name was changed in #135478

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][gpu] Change GPU modules to globals #135478

[mlir][gpu] Change GPU modules to globals #135478

chsigg commented Apr 12, 2025

llvmbot commented Apr 12, 2025 •

edited

Loading

llvmbot commented Apr 12, 2025

joker-eph commented Apr 12, 2025

chsigg commented Apr 13, 2025

joker-eph commented Apr 13, 2025

chsigg commented Apr 14, 2025 •

edited

Loading

chsigg commented Apr 22, 2025

joker-eph commented Apr 22, 2025 •

edited

Loading

chsigg commented Apr 22, 2025

clementval commented Apr 22, 2025

joker-eph commented Apr 22, 2025

clementval commented Apr 22, 2025

chsigg commented Apr 23, 2025

joker-eph commented Apr 23, 2025

Hardcode84 commented Apr 23, 2025

clementval commented Apr 23, 2025

chsigg commented Apr 23, 2025

joker-eph commented Apr 23, 2025

fabianmcg commented Apr 23, 2025 •

edited

Loading

Hardcode84 commented Apr 23, 2025

fabianmcg commented Apr 23, 2025

chsigg commented Apr 24, 2025

fabianmcg commented Apr 24, 2025

clementval commented Apr 24, 2025 •

edited

Loading

chsigg commented Apr 25, 2025

fabianmcg commented Apr 25, 2025

[mlir][gpu] Change GPU modules to globals #135478

[mlir][gpu] Change GPU modules to globals #135478

Conversation

chsigg commented Apr 12, 2025

llvmbot commented Apr 12, 2025 • edited Loading

llvmbot commented Apr 12, 2025

joker-eph commented Apr 12, 2025

chsigg commented Apr 13, 2025

joker-eph commented Apr 13, 2025

chsigg commented Apr 14, 2025 • edited Loading

chsigg commented Apr 22, 2025

joker-eph commented Apr 22, 2025 • edited Loading

chsigg commented Apr 22, 2025

clementval commented Apr 22, 2025

joker-eph commented Apr 22, 2025

clementval commented Apr 22, 2025

chsigg commented Apr 23, 2025

joker-eph commented Apr 23, 2025

Hardcode84 commented Apr 23, 2025

clementval commented Apr 23, 2025

chsigg commented Apr 23, 2025

joker-eph commented Apr 23, 2025

fabianmcg commented Apr 23, 2025 • edited Loading

Hardcode84 commented Apr 23, 2025

fabianmcg commented Apr 23, 2025

chsigg commented Apr 24, 2025

fabianmcg commented Apr 24, 2025

clementval commented Apr 24, 2025 • edited Loading

chsigg commented Apr 25, 2025

fabianmcg commented Apr 25, 2025

llvmbot commented Apr 12, 2025 •

edited

Loading

chsigg commented Apr 14, 2025 •

edited

Loading

joker-eph commented Apr 22, 2025 •

edited

Loading

fabianmcg commented Apr 23, 2025 •

edited

Loading

clementval commented Apr 24, 2025 •

edited

Loading