[Offload][AMDGPU] Correctly handle variable implicit argument sizes #142199

jhuber6 · 2025-05-30T18:49:25Z

Summary:
The size of the implicit argument struct can vary depending on
optimizations, it is not always the size as listed by the full struct.
Additionally, the implicit arguments are always aligned on a pointer
boundary. This patch updates the handling to use the correctly aligned
offset and only initialize the members if they are contained in the
reported size.

Additionally, we modify the alloc and free routines to allow
alloc(0) and free(nullptr) as these are mandated by the C standard
and allow us to easily handle cases where the user calls a kernel with
no arguments.

llvmbot · 2025-05-30T18:50:04Z

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
The size of the implicit argument struct can vary depending on
optimizations, it is not always the size as listed by the full struct.
Additionally, the implicit arguments are always aligned on a pointer
boundary. This patch updates the handling to use the correctly aligned
offset and only initialize the members if they are contained in the
reported size.

Additionally, we modify the alloc and free routines to allow
alloc(0) and free(nullptr) as these are mandated by the C standard
and allow us to easily handle cases where the user calls a kernel with
no arguments.

Full diff: https://github.com/llvm/llvm-project/pull/142199.diff

2 Files Affected:

(modified) offload/plugins-nextgen/amdgpu/src/rtl.cpp (+37-20)
(modified) offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h (+14-1)

diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index 2733796611d9b..88f3b1ac5416c 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -420,7 +420,7 @@ struct AMDGPUMemoryManagerTy : public DeviceAllocatorTy {
     assert(PtrStorage && "Invalid pointer storage");
 
     *PtrStorage = MemoryManager->allocate(Size, nullptr);
-    if (*PtrStorage == nullptr)
+    if (Size && *PtrStorage == nullptr)
       return Plugin::error(ErrorCode::OUT_OF_RESOURCES,
                            "failure to allocate from AMDGPU memory manager");
 
@@ -429,7 +429,8 @@ struct AMDGPUMemoryManagerTy : public DeviceAllocatorTy {
 
   /// Release an allocation to be reused.
   Error deallocate(void *Ptr) {
-    assert(Ptr && "Invalid pointer");
+    if (!Ptr)
+      return Plugin::success();
 
     if (MemoryManager->free(Ptr))
       return Plugin::error(ErrorCode::UNKNOWN,
@@ -3365,7 +3366,7 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
                                  KernelLaunchParamsTy LaunchParams,
                                  AsyncInfoWrapperTy &AsyncInfoWrapper) const {
   if (ArgsSize != LaunchParams.Size &&
-      ArgsSize != LaunchParams.Size + getImplicitArgsSize())
+      ArgsSize > LaunchParams.Size + getImplicitArgsSize())
     return Plugin::error(ErrorCode::INVALID_ARGUMENT,
                          "mismatch of kernel arguments size");
 
@@ -3401,23 +3402,39 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
   if (auto Err = AMDGPUDevice.getStream(AsyncInfoWrapper, Stream))
     return Err;
 
-  hsa_utils::AMDGPUImplicitArgsTy *ImplArgs = nullptr;
-  if (ArgsSize == LaunchParams.Size + getImplicitArgsSize()) {
-    ImplArgs = reinterpret_cast<hsa_utils::AMDGPUImplicitArgsTy *>(
-        utils::advancePtr(AllArgs, LaunchParams.Size));
-
-    // Set the COV5+ implicit arguments to the appropriate values.
-    std::memset(ImplArgs, 0, getImplicitArgsSize());
-    ImplArgs->BlockCountX = NumBlocks[0];
-    ImplArgs->BlockCountY = NumBlocks[1];
-    ImplArgs->BlockCountZ = NumBlocks[2];
-    ImplArgs->GroupSizeX = NumThreads[0];
-    ImplArgs->GroupSizeY = NumThreads[1];
-    ImplArgs->GroupSizeZ = NumThreads[2];
-    ImplArgs->GridDims = NumBlocks[2] * NumThreads[2] > 1
-                             ? 3
-                             : 1 + (NumBlocks[1] * NumThreads[1] != 1);
-    ImplArgs->DynamicLdsSize = KernelArgs.DynCGroupMem;
+  uint64_t ImplArgsOffset = utils::roundUp(
+      LaunchParams.Size, alignof(hsa_utils::AMDGPUImplicitArgsTy));
+  if (ArgsSize > ImplArgsOffset) {
+    hsa_utils::AMDGPUImplicitArgsTy *ImplArgs =
+        reinterpret_cast<hsa_utils::AMDGPUImplicitArgsTy *>(
+            utils::advancePtr(AllArgs, ImplArgsOffset));
+
+    // Set the COV5+ implicit arguments to the appropriate values if present.
+    uint64_t ImplArgsSize = ArgsSize - ImplArgsOffset;
+    std::memset(ImplArgs, 0, ImplArgsSize);
+
+    using ImplArgsTy = hsa_utils::AMDGPUImplicitArgsTy;
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::BlockCountX, ImplArgsSize,
+                           NumBlocks[0]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::BlockCountY, ImplArgsSize,
+                           NumBlocks[1]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::BlockCountZ, ImplArgsSize,
+                           NumBlocks[2]);
+
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GroupSizeX, ImplArgsSize,
+                           NumThreads[0]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GroupSizeY, ImplArgsSize,
+                           NumThreads[1]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GroupSizeZ, ImplArgsSize,
+                           NumThreads[2]);
+
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GridDims, ImplArgsSize,
+                           NumBlocks[2] * NumThreads[2] > 1
+                               ? 3
+                               : 1 + (NumBlocks[1] * NumThreads[1] != 1));
+
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::DynamicLdsSize, ImplArgsSize,
+                           KernelArgs.DynCGroupMem);
   }
 
   // Push the kernel launch into the stream.
diff --git a/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h b/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
index 609ead942dbb3..77c756e006029 100644
--- a/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
+++ b/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
@@ -13,6 +13,7 @@
 #include <cstdint>
 
 #include "Shared/Debug.h"
+#include "Shared/Utils.h"
 #include "Utils/ELF.h"
 
 #include "omptarget.h"
@@ -26,7 +27,7 @@ namespace plugin {
 namespace hsa_utils {
 
 // The implicit arguments of COV5 AMDGPU kernels.
-struct AMDGPUImplicitArgsTy {
+struct alignas(alignof(void *)) AMDGPUImplicitArgsTy {
   uint32_t BlockCountX;
   uint32_t BlockCountY;
   uint32_t BlockCountZ;
@@ -60,6 +61,18 @@ inline Error readAMDGPUMetaDataFromImage(
   return Err;
 }
 
+/// Initializes the HSA implicit argument if the struct size permits it. This is
+/// necessary because optimizations can modify the size of the struct if
+/// portions of it are unused.
+template <typename MemberTy, typename T>
+void initImplArg(AMDGPUImplicitArgsTy *Base,
+                 MemberTy AMDGPUImplicitArgsTy::*Member, size_t AvailableSize,
+                 T Value) {
+  uint64_t Offset = utils::getPtrDiff(&(Base->*Member), Base);
+  if (Offset + sizeof(MemberTy) <= AvailableSize)
+    Base->*Member = static_cast<MemberTy>(Value);
+}
+
 } // namespace hsa_utils
 } // namespace plugin
 } // namespace target

llvmbot · 2025-05-30T18:50:05Z

@llvm/pr-subscribers-offload

Author: Joseph Huber (jhuber6)

Changes

Summary:
The size of the implicit argument struct can vary depending on
optimizations, it is not always the size as listed by the full struct.
Additionally, the implicit arguments are always aligned on a pointer
boundary. This patch updates the handling to use the correctly aligned
offset and only initialize the members if they are contained in the
reported size.

Additionally, we modify the alloc and free routines to allow
alloc(0) and free(nullptr) as these are mandated by the C standard
and allow us to easily handle cases where the user calls a kernel with
no arguments.

Full diff: https://github.com/llvm/llvm-project/pull/142199.diff

2 Files Affected:

(modified) offload/plugins-nextgen/amdgpu/src/rtl.cpp (+37-20)
(modified) offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h (+14-1)

diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index 2733796611d9b..88f3b1ac5416c 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -420,7 +420,7 @@ struct AMDGPUMemoryManagerTy : public DeviceAllocatorTy {
     assert(PtrStorage && "Invalid pointer storage");
 
     *PtrStorage = MemoryManager->allocate(Size, nullptr);
-    if (*PtrStorage == nullptr)
+    if (Size && *PtrStorage == nullptr)
       return Plugin::error(ErrorCode::OUT_OF_RESOURCES,
                            "failure to allocate from AMDGPU memory manager");
 
@@ -429,7 +429,8 @@ struct AMDGPUMemoryManagerTy : public DeviceAllocatorTy {
 
   /// Release an allocation to be reused.
   Error deallocate(void *Ptr) {
-    assert(Ptr && "Invalid pointer");
+    if (!Ptr)
+      return Plugin::success();
 
     if (MemoryManager->free(Ptr))
       return Plugin::error(ErrorCode::UNKNOWN,
@@ -3365,7 +3366,7 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
                                  KernelLaunchParamsTy LaunchParams,
                                  AsyncInfoWrapperTy &AsyncInfoWrapper) const {
   if (ArgsSize != LaunchParams.Size &&
-      ArgsSize != LaunchParams.Size + getImplicitArgsSize())
+      ArgsSize > LaunchParams.Size + getImplicitArgsSize())
     return Plugin::error(ErrorCode::INVALID_ARGUMENT,
                          "mismatch of kernel arguments size");
 
@@ -3401,23 +3402,39 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
   if (auto Err = AMDGPUDevice.getStream(AsyncInfoWrapper, Stream))
     return Err;
 
-  hsa_utils::AMDGPUImplicitArgsTy *ImplArgs = nullptr;
-  if (ArgsSize == LaunchParams.Size + getImplicitArgsSize()) {
-    ImplArgs = reinterpret_cast<hsa_utils::AMDGPUImplicitArgsTy *>(
-        utils::advancePtr(AllArgs, LaunchParams.Size));
-
-    // Set the COV5+ implicit arguments to the appropriate values.
-    std::memset(ImplArgs, 0, getImplicitArgsSize());
-    ImplArgs->BlockCountX = NumBlocks[0];
-    ImplArgs->BlockCountY = NumBlocks[1];
-    ImplArgs->BlockCountZ = NumBlocks[2];
-    ImplArgs->GroupSizeX = NumThreads[0];
-    ImplArgs->GroupSizeY = NumThreads[1];
-    ImplArgs->GroupSizeZ = NumThreads[2];
-    ImplArgs->GridDims = NumBlocks[2] * NumThreads[2] > 1
-                             ? 3
-                             : 1 + (NumBlocks[1] * NumThreads[1] != 1);
-    ImplArgs->DynamicLdsSize = KernelArgs.DynCGroupMem;
+  uint64_t ImplArgsOffset = utils::roundUp(
+      LaunchParams.Size, alignof(hsa_utils::AMDGPUImplicitArgsTy));
+  if (ArgsSize > ImplArgsOffset) {
+    hsa_utils::AMDGPUImplicitArgsTy *ImplArgs =
+        reinterpret_cast<hsa_utils::AMDGPUImplicitArgsTy *>(
+            utils::advancePtr(AllArgs, ImplArgsOffset));
+
+    // Set the COV5+ implicit arguments to the appropriate values if present.
+    uint64_t ImplArgsSize = ArgsSize - ImplArgsOffset;
+    std::memset(ImplArgs, 0, ImplArgsSize);
+
+    using ImplArgsTy = hsa_utils::AMDGPUImplicitArgsTy;
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::BlockCountX, ImplArgsSize,
+                           NumBlocks[0]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::BlockCountY, ImplArgsSize,
+                           NumBlocks[1]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::BlockCountZ, ImplArgsSize,
+                           NumBlocks[2]);
+
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GroupSizeX, ImplArgsSize,
+                           NumThreads[0]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GroupSizeY, ImplArgsSize,
+                           NumThreads[1]);
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GroupSizeZ, ImplArgsSize,
+                           NumThreads[2]);
+
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::GridDims, ImplArgsSize,
+                           NumBlocks[2] * NumThreads[2] > 1
+                               ? 3
+                               : 1 + (NumBlocks[1] * NumThreads[1] != 1));
+
+    hsa_utils::initImplArg(ImplArgs, &ImplArgsTy::DynamicLdsSize, ImplArgsSize,
+                           KernelArgs.DynCGroupMem);
   }
 
   // Push the kernel launch into the stream.
diff --git a/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h b/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
index 609ead942dbb3..77c756e006029 100644
--- a/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
+++ b/offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
@@ -13,6 +13,7 @@
 #include <cstdint>
 
 #include "Shared/Debug.h"
+#include "Shared/Utils.h"
 #include "Utils/ELF.h"
 
 #include "omptarget.h"
@@ -26,7 +27,7 @@ namespace plugin {
 namespace hsa_utils {
 
 // The implicit arguments of COV5 AMDGPU kernels.
-struct AMDGPUImplicitArgsTy {
+struct alignas(alignof(void *)) AMDGPUImplicitArgsTy {
   uint32_t BlockCountX;
   uint32_t BlockCountY;
   uint32_t BlockCountZ;
@@ -60,6 +61,18 @@ inline Error readAMDGPUMetaDataFromImage(
   return Err;
 }
 
+/// Initializes the HSA implicit argument if the struct size permits it. This is
+/// necessary because optimizations can modify the size of the struct if
+/// portions of it are unused.
+template <typename MemberTy, typename T>
+void initImplArg(AMDGPUImplicitArgsTy *Base,
+                 MemberTy AMDGPUImplicitArgsTy::*Member, size_t AvailableSize,
+                 T Value) {
+  uint64_t Offset = utils::getPtrDiff(&(Base->*Member), Base);
+  if (Offset + sizeof(MemberTy) <= AvailableSize)
+    Base->*Member = static_cast<MemberTy>(Value);
+}
+
 } // namespace hsa_utils
 } // namespace plugin
 } // namespace target

Summary: The size of the implicit argument struct can vary depending on optimizations, it is not always the size as listed by the full struct. Additionally, the implicit arguments are always aligned on a pointer boundary. This patch updates the handling to use the correctly aligned offset and only initialize the members if they are contained in the reported size. Additionally, we modify the `alloc` and `free` routines to allow `alloc(0)` and `free(nullptr)` as these are mandated by the C standard and allow us to easily handle cases where the user calls a kernel with no arguments.

shiltian

If the implicit argument is not full set, how can we guarantee the order and existence of which arguments?

Additionally, we modify the alloc and free routines to allow
alloc(0) and free(nullptr) as these are mandated by the C standard
and allow us to easily handle cases where the user calls a kernel with
no arguments.

A separate PR pls.

offload/plugins-nextgen/amdgpu/src/rtl.cpp

JonChesterfield · 2025-06-02T16:02:55Z

Hold on, what "optimisation" is changing this struct? It's exposed in the ABI, in the literal sense, so LLVM really should not be hacking around with it.

jhuber6 · 2025-06-02T16:04:36Z

Hold on, what "optimisation" is changing this struct? It's exposed in the ABI, in the literal sense, so LLVM really should not be hacking around with it.

You can observe it w/ optimizations and checking the notes, @arsenm would probably know explicitly where it is. It's probably find as long as the runtime is aware of this since it reports the actual size of the struct which lets you know how much of the implicit args it's using.

…lvm#142199) Summary: The size of the implicit argument struct can vary depending on optimizations, it is not always the size as listed by the full struct. Additionally, the implicit arguments are always aligned on a pointer boundary. This patch updates the handling to use the correctly aligned offset and only initialize the members if they are contained in the reported size. Additionally, we modify the `alloc` and `free` routines to allow `alloc(0)` and `free(nullptr)` as these are mandated by the C standard and allow us to easily handle cases where the user calls a kernel with no arguments.

jhuber6 requested review from arsenm, jdoerfert, JonChesterfield, jplehr and shiltian May 30, 2025 18:49

llvmbot added backend:AMDGPU offload labels May 30, 2025

jhuber6 mentioned this pull request May 30, 2025

[Offload] Allow setting null arguments in olLaunchKernel #141958

Merged

jhuber6 force-pushed the ImplArguments branch from ff6f893 to 21f93c3 Compare June 2, 2025 11:35

shiltian reviewed Jun 2, 2025

View reviewed changes

offload/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

comments

6a54314

shiltian approved these changes Jun 2, 2025

View reviewed changes

jhuber6 merged commit 5b8031a into llvm:main Jun 2, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Offload][AMDGPU] Correctly handle variable implicit argument sizes #142199

[Offload][AMDGPU] Correctly handle variable implicit argument sizes #142199

Uh oh!

jhuber6 commented May 30, 2025

Uh oh!

llvmbot commented May 30, 2025

Uh oh!

llvmbot commented May 30, 2025

Uh oh!

shiltian left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

JonChesterfield commented Jun 2, 2025

Uh oh!

jhuber6 commented Jun 2, 2025

Uh oh!

Uh oh!

[Offload][AMDGPU] Correctly handle variable implicit argument sizes #142199

[Offload][AMDGPU] Correctly handle variable implicit argument sizes #142199

Uh oh!

Conversation

jhuber6 commented May 30, 2025

Uh oh!

llvmbot commented May 30, 2025

Uh oh!

llvmbot commented May 30, 2025

Uh oh!

shiltian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JonChesterfield commented Jun 2, 2025

Uh oh!

jhuber6 commented Jun 2, 2025

Uh oh!

Uh oh!

shiltian left a comment •

edited

Loading