[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042

shiltian · 2024-11-29T02:08:12Z

No description provided.

shiltian · 2024-11-29T02:08:33Z

[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

offload/CMakeLists.txt

github-actions · 2024-11-29T02:13:04Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 68e75eebec4cf5fc7eef7d9525b276c4ff5e1b17 01139c5ee50e2598a410cea8510978b2e843c951 --extensions c,h,cpp -- offload/test/offloading/ompx_bare_multi_dim.cpp offload/plugins-nextgen/amdgpu/src/rtl.cpp offload/plugins-nextgen/common/include/PluginInterface.h offload/plugins-nextgen/common/src/PluginInterface.cpp offload/plugins-nextgen/cuda/src/rtl.cpp offload/plugins-nextgen/host/src/rtl.cpp offload/src/interface.cpp offload/src/omptarget.cpp offload/test/api/omp_env_vars.c offload/test/offloading/info.c offload/test/offloading/ompx_bare.c offload/test/offloading/small_trip_count.c offload/test/offloading/small_trip_count_thread_limit.cpp

View the diff from clang-format here.

diff --git a/offload/test/api/omp_env_vars.c b/offload/test/api/omp_env_vars.c
index a8a574be8c..8b51d46db7 100644
--- a/offload/test/api/omp_env_vars.c
+++ b/offload/test/api/omp_env_vars.c
@@ -6,7 +6,8 @@
 #define N 256
 
 int main() {
-  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
+  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and
+  // [1,1,1] threads
 #pragma omp target teams
 #pragma omp parallel
   {}
diff --git a/offload/test/offloading/info.c b/offload/test/offloading/info.c
index d86644b871..c3dec8a140 100644
--- a/offload/test/offloading/info.c
+++ b/offload/test/offloading/info.c
@@ -25,7 +25,7 @@ int main() {
   int C[N];
   int val = 1;
 
-// clang-format off
+  // clang-format off
 // INFO: info: Entering OpenMP data region with being_mapper at info.c:{{[0-9]+}}:{{[0-9]+}} with 3 arguments:
 // INFO: info: alloc(A[0:64])[256]
 // INFO: info: tofrom(B[0:64])[256]
@@ -60,7 +60,7 @@ int main() {
 // INFO: info: OpenMP Host-Device pointer mappings after block at info.c:[[#%u,]]:[[#%u,]]:
 // INFO: info: Host Ptr  Target Ptr Size (B) DynRefCount HoldRefCount Declaration
 // INFO: info: [[#%#x,]] [[#%#x,]]  4        INF         0            global at unknown:0:0
-// clang-format on
+  // clang-format on
 #pragma omp target data map(alloc : A[0 : N])                                  \
     map(ompx_hold, tofrom : B[0 : N]) map(to : C[0 : N])
 #pragma omp target firstprivate(val)
diff --git a/offload/test/offloading/ompx_bare.c b/offload/test/offloading/ompx_bare.c
index 6a6ada9617..f3b8042749 100644
--- a/offload/test/offloading/ompx_bare.c
+++ b/offload/test/offloading/ompx_bare.c
@@ -15,7 +15,9 @@ int main(int argc, char *argv[]) {
   const int N = num_blocks * block_size;
   int *data = (int *)malloc(N * sizeof(int));
 
-  // CHECK: "PluginInterface" device 0 info: Launching kernel __omp_offloading_{{.*}} with [64,1,1] blocks and [64,1,1] threads in SPMD mode
+  // CHECK: "PluginInterface" device 0 info: Launching kernel
+  // __omp_offloading_{{.*}} with [64,1,1] blocks and [64,1,1] threads in SPMD
+  // mode
 
 #pragma omp target teams ompx_bare num_teams(num_blocks) thread_limit(block_size) map(from: data[0:N])
   {
diff --git a/offload/test/offloading/ompx_bare_multi_dim.cpp b/offload/test/offloading/ompx_bare_multi_dim.cpp
index d37278525f..38c379064b 100644
--- a/offload/test/offloading/ompx_bare_multi_dim.cpp
+++ b/offload/test/offloading/ompx_bare_multi_dim.cpp
@@ -1,13 +1,14 @@
 // RUN: %libomptarget-compilexx-generic
-// RUN: env LIBOMPTARGET_INFO=63 %libomptarget-run-generic 2>&1 | %fcheck-generic
-// REQUIRES: gpu
+// RUN: env LIBOMPTARGET_INFO=63 %libomptarget-run-generic 2>&1 |
+// %fcheck-generic REQUIRES: gpu
 
 #include <ompx.h>
 
 #include <cassert>
 #include <vector>
 
-// CHECK: "PluginInterface" device 0 info: Launching kernel __omp_offloading_{{.*}} with [2,4,6] blocks and [32,4,2] threads in SPMD mode
+// CHECK: "PluginInterface" device 0 info: Launching kernel
+// __omp_offloading_{{.*}} with [2,4,6] blocks and [32,4,2] threads in SPMD mode
 
 int main(int argc, char *argv[]) {
   int bs[3] = {32u, 4u, 2u};

jplehr · 2024-11-29T07:43:03Z

Out of curiosity: Have you looked at how this applies downstream?

shiltian · 2024-11-29T15:45:54Z

Out of curiosity: Have you looked at how this applies downstream?

No, I didn't.

offload/plugins-nextgen/common/include/PluginInterface.h

offload/plugins-nextgen/common/src/PluginInterface.cpp

llvmbot · 2024-11-30T01:29:47Z

@llvm/pr-subscribers-offload

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

Patch is 23.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118042.diff

8 Files Affected:

(modified) offload/plugins-nextgen/amdgpu/src/rtl.cpp (+38-31)
(modified) offload/plugins-nextgen/common/include/PluginInterface.h (+12-11)
(modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+38-30)
(modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+9-10)
(modified) offload/plugins-nextgen/host/src/rtl.cpp (+2-2)
(modified) offload/src/interface.cpp (+4-10)
(modified) offload/src/omptarget.cpp (-2)
(added) offload/test/offloading/ompx_bare_multi_dim.cpp (+54)

diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index 6356fa0554a9c1..2920435cd8b5c8 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -559,15 +559,15 @@ struct AMDGPUKernelTy : public GenericKernelTy {
   }
 
   /// Launch the AMDGPU kernel function.
-  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                   uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+                   uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
                    KernelLaunchParamsTy LaunchParams,
                    AsyncInfoWrapperTy &AsyncInfoWrapper) const override;
 
   /// Print more elaborate kernel launch info for AMDGPU
   Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
-                               KernelArgsTy &KernelArgs, uint32_t NumThreads,
-                               uint64_t NumBlocks) const override;
+                               KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
+                               uint32_t NumBlocks[3]) const override;
 
   /// Get group and private segment kernel size.
   uint32_t getGroupSize() const { return GroupSize; }
@@ -719,7 +719,7 @@ struct AMDGPUQueueTy {
   /// Push a kernel launch to the queue. The kernel launch requires an output
   /// signal and can define an optional input signal (nullptr if none).
   Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
-                         uint32_t NumThreads, uint64_t NumBlocks,
+                         uint32_t NumThreads[3], uint32_t NumBlocks[3],
                          uint32_t GroupSize, uint64_t StackSize,
                          AMDGPUSignalTy *OutputSignal,
                          AMDGPUSignalTy *InputSignal) {
@@ -746,14 +746,18 @@ struct AMDGPUQueueTy {
     assert(Packet && "Invalid packet");
 
     // The first 32 bits of the packet are written after the other fields
-    uint16_t Setup = UINT16_C(1) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
-    Packet->workgroup_size_x = NumThreads;
-    Packet->workgroup_size_y = 1;
-    Packet->workgroup_size_z = 1;
+    uint16_t Dims = NumBlocks[2] * NumThreads[2] > 1
+                        ? 3
+                        : 1 + (NumBlocks[1] * NumThreads[1] != 1);
+    uint16_t Setup = UINT16_C(Dims)
+                     << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+    Packet->workgroup_size_x = NumThreads[0];
+    Packet->workgroup_size_y = NumThreads[1];
+    Packet->workgroup_size_z = NumThreads[2];
     Packet->reserved0 = 0;
-    Packet->grid_size_x = NumBlocks * NumThreads;
-    Packet->grid_size_y = 1;
-    Packet->grid_size_z = 1;
+    Packet->grid_size_x = NumBlocks[0] * NumThreads[0];
+    Packet->grid_size_y = NumBlocks[1] * NumThreads[1];
+    Packet->grid_size_z = NumBlocks[2] * NumThreads[2];
     Packet->private_segment_size =
         Kernel.usesDynamicStack() ? StackSize : Kernel.getPrivateSize();
     Packet->group_segment_size = GroupSize;
@@ -1240,7 +1244,7 @@ struct AMDGPUStreamTy {
   /// the kernel finalizes. Once the kernel is finished, the stream will release
   /// the kernel args buffer to the specified memory manager.
   Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
-                         uint32_t NumThreads, uint64_t NumBlocks,
+                         uint32_t NumThreads[3], uint32_t NumBlocks[3],
                          uint32_t GroupSize, uint64_t StackSize,
                          AMDGPUMemoryManagerTy &MemoryManager) {
     if (Queue == nullptr)
@@ -2829,10 +2833,10 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     AsyncInfoWrapperTy AsyncInfoWrapper(*this, nullptr);
 
     KernelArgsTy KernelArgs = {};
-    if (auto Err =
-            AMDGPUKernel.launchImpl(*this, /*NumThread=*/1u,
-                                    /*NumBlocks=*/1ul, KernelArgs,
-                                    KernelLaunchParamsTy{}, AsyncInfoWrapper))
+    uint32_t NumBlocksAndThreads[3] = {1u, 1u, 1u};
+    if (auto Err = AMDGPUKernel.launchImpl(
+            *this, NumBlocksAndThreads, NumBlocksAndThreads, KernelArgs,
+            KernelLaunchParamsTy{}, AsyncInfoWrapper))
       return Err;
 
     Error Err = Plugin::success();
@@ -3330,7 +3334,7 @@ struct AMDGPUPluginTy final : public GenericPluginTy {
 };
 
 Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
-                                 uint32_t NumThreads, uint64_t NumBlocks,
+                                 uint32_t NumThreads[3], uint32_t NumBlocks[3],
                                  KernelArgsTy &KernelArgs,
                                  KernelLaunchParamsTy LaunchParams,
                                  AsyncInfoWrapperTy &AsyncInfoWrapper) const {
@@ -3387,13 +3391,15 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
   // Only COV5 implicitargs needs to be set. COV4 implicitargs are not used.
   if (ImplArgs &&
       getImplicitArgsSize() == sizeof(hsa_utils::AMDGPUImplicitArgsTy)) {
-    ImplArgs->BlockCountX = NumBlocks;
-    ImplArgs->BlockCountY = 1;
-    ImplArgs->BlockCountZ = 1;
-    ImplArgs->GroupSizeX = NumThreads;
-    ImplArgs->GroupSizeY = 1;
-    ImplArgs->GroupSizeZ = 1;
-    ImplArgs->GridDims = 1;
+    ImplArgs->BlockCountX = NumBlocks[0];
+    ImplArgs->BlockCountY = NumBlocks[1];
+    ImplArgs->BlockCountZ = NumBlocks[2];
+    ImplArgs->GroupSizeX = NumThreads[0];
+    ImplArgs->GroupSizeY = NumThreads[1];
+    ImplArgs->GroupSizeZ = NumThreads[2];
+    ImplArgs->GridDims = NumBlocks[2] * NumThreads[2] > 1
+                             ? 3
+                             : 1 + (NumBlocks[1] * NumThreads[1] != 1);
     ImplArgs->DynamicLdsSize = KernelArgs.DynCGroupMem;
   }
 
@@ -3404,8 +3410,8 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
 
 Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
                                              KernelArgsTy &KernelArgs,
-                                             uint32_t NumThreads,
-                                             uint64_t NumBlocks) const {
+                                             uint32_t NumThreads[3],
+                                             uint32_t NumBlocks[3]) const {
   // Only do all this when the output is requested
   if (!(getInfoLevel() & OMP_INFOTYPE_PLUGIN_KERNEL))
     return Plugin::success();
@@ -3442,12 +3448,13 @@ Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
   // S/VGPR Spill Count: how many S/VGPRs are spilled by the kernel
   // Tripcount: loop tripcount for the kernel
   INFO(OMP_INFOTYPE_PLUGIN_KERNEL, GenericDevice.getDeviceId(),
-       "#Args: %d Teams x Thrds: %4lux%4u (MaxFlatWorkGroupSize: %u) LDS "
+       "#Args: %d Teams x Thrds: %4ux%4u (MaxFlatWorkGroupSize: %u) LDS "
        "Usage: %uB #SGPRs/VGPRs: %u/%u #SGPR/VGPR Spills: %u/%u Tripcount: "
        "%lu\n",
-       ArgNum, NumGroups, ThreadsPerGroup, MaxFlatWorkgroupSize,
-       GroupSegmentSize, SGPRCount, VGPRCount, SGPRSpillCount, VGPRSpillCount,
-       LoopTripCount);
+       ArgNum, NumGroups[0] * NumGroups[1] * NumGroups[2],
+       ThreadsPerGroup[0] * ThreadsPerGroup[1] * ThreadsPerGroup[2],
+       MaxFlatWorkgroupSize, GroupSegmentSize, SGPRCount, VGPRCount,
+       SGPRSpillCount, VGPRSpillCount, LoopTripCount);
 
   return Plugin::success();
 }
diff --git a/offload/plugins-nextgen/common/include/PluginInterface.h b/offload/plugins-nextgen/common/include/PluginInterface.h
index 41cc0f286a581f..be3467c3f7098f 100644
--- a/offload/plugins-nextgen/common/include/PluginInterface.h
+++ b/offload/plugins-nextgen/common/include/PluginInterface.h
@@ -265,8 +265,9 @@ struct GenericKernelTy {
   Error launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
                ptrdiff_t *ArgOffsets, KernelArgsTy &KernelArgs,
                AsyncInfoWrapperTy &AsyncInfoWrapper) const;
-  virtual Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                           uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  virtual Error launchImpl(GenericDeviceTy &GenericDevice,
+                           uint32_t NumThreads[3], uint32_t NumBlocks[3],
+                           KernelArgsTy &KernelArgs,
                            KernelLaunchParamsTy LaunchParams,
                            AsyncInfoWrapperTy &AsyncInfoWrapper) const = 0;
 
@@ -316,15 +317,15 @@ struct GenericKernelTy {
 
   /// Prints generic kernel launch information.
   Error printLaunchInfo(GenericDeviceTy &GenericDevice,
-                        KernelArgsTy &KernelArgs, uint32_t NumThreads,
-                        uint64_t NumBlocks) const;
+                        KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
+                        uint32_t NumBlocks[3]) const;
 
   /// Prints plugin-specific kernel launch information after generic kernel
   /// launch information
   virtual Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
                                        KernelArgsTy &KernelArgs,
-                                       uint32_t NumThreads,
-                                       uint64_t NumBlocks) const;
+                                       uint32_t NumThreads[3],
+                                       uint32_t NumBlocks[3]) const;
 
 private:
   /// Prepare the arguments before launching the kernel.
@@ -337,15 +338,15 @@ struct GenericKernelTy {
 
   /// Get the number of threads and blocks for the kernel based on the
   /// user-defined threads and block clauses.
-  uint32_t getNumThreads(GenericDeviceTy &GenericDevice,
-                         uint32_t ThreadLimitClause[3]) const;
+  void getNumThreads(GenericDeviceTy &GenericDevice,
+                     uint32_t ThreadLimitClause[3]) const;
 
   /// The number of threads \p NumThreads can be adjusted by this method.
   /// \p IsNumThreadsFromUser is true is \p NumThreads is defined by user via
   /// thread_limit clause.
-  uint64_t getNumBlocks(GenericDeviceTy &GenericDevice,
-                        uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
-                        uint32_t &NumThreads, bool IsNumThreadsFromUser) const;
+  void getNumBlocks(GenericDeviceTy &GenericDevice,
+                    uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
+                    uint32_t &NumThreads, bool IsNumThreadsFromUser) const;
 
   /// Indicate if the kernel works in Generic SPMD, Generic or SPMD mode.
   bool isGenericSPMDMode() const {
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index 25b815b7f96694..1abc521f47ef33 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -526,20 +526,21 @@ GenericKernelTy::getKernelLaunchEnvironment(
 
 Error GenericKernelTy::printLaunchInfo(GenericDeviceTy &GenericDevice,
                                        KernelArgsTy &KernelArgs,
-                                       uint32_t NumThreads,
-                                       uint64_t NumBlocks) const {
+                                       uint32_t NumThreads[3],
+                                       uint32_t NumBlocks[3]) const {
   INFO(OMP_INFOTYPE_PLUGIN_KERNEL, GenericDevice.getDeviceId(),
-       "Launching kernel %s with %" PRIu64
-       " blocks and %d threads in %s mode\n",
-       getName(), NumBlocks, NumThreads, getExecutionModeName());
+       "Launching kernel %s with [%u,%u,%u] blocks and [%u,%u,%u] threads in "
+       "%s mode\n",
+       getName(), NumBlocks[0], NumBlocks[1], NumBlocks[2], NumThreads[0],
+       NumThreads[1], NumThreads[2], getExecutionModeName());
   return printLaunchInfoDetails(GenericDevice, KernelArgs, NumThreads,
                                 NumBlocks);
 }
 
 Error GenericKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
                                               KernelArgsTy &KernelArgs,
-                                              uint32_t NumThreads,
-                                              uint64_t NumBlocks) const {
+                                              uint32_t NumThreads[3],
+                                              uint32_t NumBlocks[3]) const {
   return Plugin::success();
 }
 
@@ -566,10 +567,14 @@ Error GenericKernelTy::launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
                     Args, Ptrs, *KernelLaunchEnvOrErr);
   }
 
-  uint32_t NumThreads = getNumThreads(GenericDevice, KernelArgs.ThreadLimit);
-  uint64_t NumBlocks =
-      getNumBlocks(GenericDevice, KernelArgs.NumTeams, KernelArgs.Tripcount,
-                   NumThreads, KernelArgs.ThreadLimit[0] > 0);
+  uint32_t NumThreads[3] = {KernelArgs.ThreadLimit[0],
+                            KernelArgs.ThreadLimit[1],
+                            KernelArgs.ThreadLimit[2]};
+  uint32_t NumBlocks[3] = {KernelArgs.NumTeams[0], KernelArgs.NumTeams[1],
+                           KernelArgs.NumTeams[2]};
+  getNumThreads(GenericDevice, NumThreads);
+  getNumBlocks(GenericDevice, NumBlocks, KernelArgs.Tripcount, NumThreads[0],
+               NumThreads[0] > 0);
 
   // Record the kernel description after we modified the argument count and num
   // blocks/threads.
@@ -578,7 +583,8 @@ Error GenericKernelTy::launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
     RecordReplay.saveImage(getName(), getImage());
     RecordReplay.saveKernelInput(getName(), getImage());
     RecordReplay.saveKernelDescr(getName(), LaunchParams, KernelArgs.NumArgs,
-                                 NumBlocks, NumThreads, KernelArgs.Tripcount);
+                                 NumBlocks[0], NumThreads[0],
+                                 KernelArgs.Tripcount);
   }
 
   if (auto Err =
@@ -616,38 +622,39 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
   return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};
 }
 
-uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
-                                        uint32_t ThreadLimitClause[3]) const {
+void GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
+                                    uint32_t ThreadLimitClause[3]) const {
+  if (IsBareKernel)
+    return;
+
   assert(ThreadLimitClause[1] == 0 && ThreadLimitClause[2] == 0 &&
          "Multi dimensional launch not supported yet.");
 
-  if (IsBareKernel && ThreadLimitClause[0] > 0)
-    return ThreadLimitClause[0];
-
   if (ThreadLimitClause[0] > 0 && isGenericMode())
     ThreadLimitClause[0] += GenericDevice.getWarpSize();
 
-  return std::min(MaxNumThreads, (ThreadLimitClause[0] > 0)
-                                     ? ThreadLimitClause[0]
-                                     : PreferredNumThreads);
+  ThreadLimitClause[0] =
+      std::min(MaxNumThreads, (ThreadLimitClause[0] > 0) ? ThreadLimitClause[0]
+                                                         : PreferredNumThreads);
 }
 
-uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
-                                       uint32_t NumTeamsClause[3],
-                                       uint64_t LoopTripCount,
-                                       uint32_t &NumThreads,
-                                       bool IsNumThreadsFromUser) const {
+void GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
+                                   uint32_t NumTeamsClause[3],
+                                   uint64_t LoopTripCount, uint32_t &NumThreads,
+                                   bool IsNumThreadsFromUser) const {
+  if (IsBareKernel)
+    return;
+
   assert(NumTeamsClause[1] == 0 && NumTeamsClause[2] == 0 &&
          "Multi dimensional launch not supported yet.");
 
-  if (IsBareKernel && NumTeamsClause[0] > 0)
-    return NumTeamsClause[0];
-
   if (NumTeamsClause[0] > 0) {
     // TODO: We need to honor any value and consequently allow more than the
     // block limit. For this we might need to start multiple kernels or let the
     // blocks start again until the requested number has been started.
-    return std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
+    NumTeamsClause[0] =
+        std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
+    return;
   }
 
   uint64_t DefaultNumBlocks = GenericDevice.getDefaultNumBlocks();
@@ -719,7 +726,8 @@ uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
   // If the loops are long running we rather reuse blocks than spawn too many.
   if (GenericDevice.getReuseBlocksForHighTripCount())
     PreferredNumBlocks = std::min(TripCountNumBlocks, DefaultNumBlocks);
-  return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
+  NumTeamsClause[0] =
+      std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
 }
 
 GenericDeviceTy::GenericDeviceTy(GenericPluginTy &Plugin, int32_t DeviceId,
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index 015c7775ba3513..1e3d79c9554c67 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -149,8 +149,8 @@ struct CUDAKernelTy : public GenericKernelTy {
   }
 
   /// Launch the CUDA kernel function.
-  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                   uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+                   uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
                    KernelLaunchParamsTy LaunchParams,
                    AsyncInfoWrapperTy &AsyncInfoWrapper) const override;
 
@@ -1230,10 +1230,10 @@ struct CUDADeviceTy : public GenericDeviceTy {
     AsyncInfoWrapperTy AsyncInfoWrapper(*this, nullptr);
 
     KernelArgsTy KernelArgs = {};
-    if (auto Err =
-            CUDAKernel.launchImpl(*this, /*NumThread=*/1u,
-                                  /*NumBlocks=*/1ul, KernelArgs,
-                                  KernelLaunchParamsTy{}, AsyncInfoWrapper))
+    uint32_t NumBlocksAndThreads[3] = {1u, 1u, 1u};
+    if (auto Err = CUDAKernel.launchImpl(
+            *this, NumBlocksAndThreads, NumBlocksAndThreads, KernelArgs,
+            KernelLaunchParamsTy{}, AsyncInfoWrapper))
       return Err;
 
     Error Err = Plugin::success();
@@ -1276,7 +1276,7 @@ struct CUDADeviceTy : public GenericDeviceTy {
 };
 
 Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
-                               uint32_t NumThreads, uint64_t NumBlocks,
+                               uint32_t NumThreads[3], uint32_t NumBlocks[3],
                                KernelArgsTy &KernelArgs,
                                KernelLaunchParamsTy LaunchParams,
                                AsyncInfoWrapperTy &AsyncInfoWrapper) const {
@@ -1294,9 +1294,8 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
                     reinterpret_cast<void *>(&LaunchParams.Size),
                     CU_LAUNCH_PARAM_END};
 
-  CUresult Res = cuLaunchKernel(Func, NumBlocks, /*gridDimY=*/1,
-                                /*gridDimZ=*/1, NumThreads,
-                                /*blockDimY=*/1, /*blockDimZ=*/1,
+  CUresult Res = cuLaunchKernel(Func, NumBlocks[0], NumBlocks[1], NumBlocks[2],
+                                NumThreads[0], NumThreads[1], NumThreads[2],
                                 MaxDynCGroupMem, Stream, nullptr, Config);
   return Plugin::check(Res, "Error in cuLaunchKernel for '%s': %s", getName());
 }
diff --git a/offload/plugins-nextgen/host/src/rtl.cpp b/offload/plugins-nextgen/host/src/rtl.cpp
index 6f2e3d8604ec82..915c41e88c5828 100644
--- a/offload/plugins-nextgen/host/src/rtl.cpp
+++ b/offload/plugins-nextgen/host/src/rtl.cpp
@@ -89,8 +89,8 @@ struct GenELF64KernelTy : public GenericKernelTy {
   }
 
   /// Launch the kernel using the libffi.
-  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                   uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+                   uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
                    KernelLaunchParamsTy LaunchParams,
                    AsyncInfoWrapperTy &AsyncInfoWrapper)...
[truncated]

arsenm · 2024-12-02T13:12:40Z

offload/plugins-nextgen/common/src/PluginInterface.cpp

+  // FIXME: This is a WA to "calibrate" the bad work done in the front end.
+  // Delete this ugly code after the front end emits proper values.
+  auto CorrectMultiDim = [](uint32_t (&Val)[3]) {


Just directly fix this? Where is this coming from?

This is from the front end. In regular OpenMP, 0 represents users don't specify any value such that the runtime needs to choose one. The multi-dim kernel launch is purely an extension and it has not been standardized yet. I'll do a follow up patch that emits [0,1,1] for regular OpenMP if nothing is specified and [x,y,z] for OMPX.

jplehr · 2024-12-02T14:59:53Z

I ran this through our buildbot config and the new tests do not seem to pass. The 0-th dimension appears to be the culprit. Can you double check that it works?
If you need any help / info, let me know.

# .---command stderr------------
# | llvm-project/offload/test/api/omp_env_vars.c:8:12: error: CHECK: expected string not found in input                                                                                                       
# |  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
# |            ^
# | <stdin>:1:1: note: scanning from here
# | "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode                                                                                     
# | ^
# | <stdin>:1:55: note: possible intended match here
# | "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode                                                                                     
# |                                                       ^
# |
# | Input file: <stdin>
# | Check file: llvm-project/offload/test/api/omp_env_vars.c
# |
# | -dump-input=help explains the following input dump.
# |
# | Input was:
# | <<<<<<
# |            1: "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode                                                                       
# | check:8'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found                                                
# | check:8'1                                                           ?                                                                                          possible intended match                                              
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1

shiltian · 2024-12-02T16:12:37Z

Hmm, something wrong in the AMD part. I tested it on a local NVIDIA GPU and it works. Will take a look.

shiltian · 2024-12-02T19:18:21Z

@jplehr I tested it locally but couldn't reproduce the failure locally.

➜  clang -fopenmp --offload-arch=gfx1030 llvm-project/offload/test/api/omp_env_vars.c -o main
➜  env OMP_NUM_TEAMS=1 OMP_TEAMS_THREAD_LIMIT=1 LIBOMPTARGET_INFO=16 ./main
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10302_8ee073e_main_l9 with [1,1,1] blocks and [1,1,1] threads in SPMD mode
AMDGPU device 0 info: #Args: 1 Teams x Thrds:    1x   1 (MaxFlatWorkGroupSize: 256) LDS Usage: 56B #SGPRs/VGPRs: 39/32 #SGPR/VGPR Spills: 0/0 Tripcount: 0

offload/test/offloading/ompx_bare_multi_dim.cpp

jdoerfert

LG, fix the test.
Consider not printing dim 2 and 3 if we run in OpenMP mode, but OK for now.

shiltian · 2024-12-05T20:13:53Z

@jplehr Is that the only failure you observed? Could you help check the latest change?

jplehr · 2024-12-05T20:49:30Z

@jplehr Is that the only failure you observed? Could you help check the latest change?

Let me give this another shot. I'll let you know!

jplehr · 2024-12-05T21:36:13Z

I see these two failures on a local test run

  libomptarget :: x86_64-unknown-linux-gnu :: api/omp_env_vars.c
  libomptarget :: x86_64-unknown-linux-gnu-LTO :: api/omp_env_vars.c

The error message is

# .---command stderr------------
# | /work/janplehr/git/llvm-project/offload/test/api/omp_env_vars.c:8:12: error: CHECK: expected string not found in input
# |  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
# |            ^
...
# | Input was:
# | <<<<<<
# |            1: "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode 
# | check:8'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
# | check:8'1                                                           ?                                                                                          possible intended match
# | >>>>>>

shiltian · 2024-12-05T22:44:22Z

I ended up with forcing api/omp_env_vars.c to use GPU.

jplehr

Ran this through local testing and saw the issues resolved.

ronlieb · 2024-12-08T04:51:16Z

Out of curiosity: Have you looked at how this applies downstream?

No, I didn't.

in hindsight, i wish you had

jhuber6 · 2024-12-08T05:09:28Z

in hindsight, i wish you had

This pain is very much self-inflicted I'm afraid

dhruvachak · 2024-12-08T07:19:34Z

offload/plugins-nextgen/common/src/PluginInterface.cpp

+                            KernelArgs.ThreadLimit[2]};
+  uint32_t NumBlocks[3] = {KernelArgs.NumTeams[0], KernelArgs.NumTeams[1],
+                           KernelArgs.NumTeams[2]};
+  if (!IsBareKernel) {


Before this patch, IsBareKernel was not checked. Is this change required for the multi-dim support? Or is it unrelated change? Same question for other new uses of IsBareKernel.

Before this patch the check was in the two getter functions.

…lvm#118042) Change-Id: Iab77154e209eec3a902e4d079a4a233e52a32a8e

…lvm#118042)

shiltian requested review from jhuber6, jdoerfert and saiislam and removed request for jhuber6 and jdoerfert November 29, 2024 02:08

shiltian requested a review from jdoerfert November 29, 2024 02:08

shiltian added the openmp:libomptarget OpenMP offload runtime label Nov 29, 2024

shiltian requested a review from jplehr November 29, 2024 02:08

shiltian commented Nov 29, 2024

View reviewed changes

offload/CMakeLists.txt Outdated Show resolved Hide resolved

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 91376a4 to 6e81e16 Compare November 29, 2024 02:09

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 6e81e16 to 008c218 Compare November 29, 2024 02:20

jhuber6 reviewed Nov 29, 2024

View reviewed changes

offload/plugins-nextgen/common/include/PluginInterface.h Outdated Show resolved Hide resolved

offload/plugins-nextgen/common/src/PluginInterface.cpp Outdated Show resolved Hide resolved

offload/plugins-nextgen/common/src/PluginInterface.cpp Outdated Show resolved Hide resolved

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 008c218 to ab2eb63 Compare November 30, 2024 01:28

shiltian marked this pull request as ready for review November 30, 2024 01:29

llvmbot added backend:AMDGPU offload labels Nov 30, 2024

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch 2 times, most recently from cca25b5 to f6b6ea5 Compare December 2, 2024 02:02

arsenm reviewed Dec 2, 2024

View reviewed changes

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from f6b6ea5 to 47a3a30 Compare December 3, 2024 02:21

jplehr reviewed Dec 3, 2024

View reviewed changes

offload/test/offloading/ompx_bare_multi_dim.cpp Outdated Show resolved Hide resolved

jdoerfert approved these changes Dec 5, 2024

View reviewed changes

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 47a3a30 to 5908b81 Compare December 5, 2024 20:11

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 5908b81 to cd4c7f5 Compare December 5, 2024 20:19

[Offload][OMPX] Add the runtime support for multi-dim grid and block

01139c5

shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from cd4c7f5 to 01139c5 Compare December 5, 2024 22:43

jplehr approved these changes Dec 6, 2024

View reviewed changes

shiltian merged commit 92376c3 into main Dec 6, 2024
5 of 6 checks passed

shiltian deleted the users/shiltian/multi-dim-grid-runtime branch December 6, 2024 14:07

dhruvachak reviewed Dec 8, 2024

View reviewed changes

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Dec 8, 2024

[Offload][OMPX] Add the runtime support for multi-dim grid and block (l…

1e7d08c

…lvm#118042) Change-Id: Iab77154e209eec3a902e4d079a4a233e52a32a8e

broxigarchen pushed a commit to broxigarchen/llvm-project that referenced this pull request Dec 10, 2024

[Offload][OMPX] Add the runtime support for multi-dim grid and block (l…

152c299

…lvm#118042)

chrsmcgrr pushed a commit to RooflineAI/llvm-project that referenced this pull request Dec 12, 2024

[Offload][OMPX] Add the runtime support for multi-dim grid and block (l…

fec69eb

…lvm#118042)

TIFitis pushed a commit to TIFitis/llvm-project that referenced this pull request Dec 18, 2024

[Offload][OMPX] Add the runtime support for multi-dim grid and block (l…

a147a4b

…lvm#118042)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042

[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042

shiltian commented Nov 29, 2024

shiltian commented Nov 29, 2024

github-actions bot commented Nov 29, 2024 •

edited

Loading

jplehr commented Nov 29, 2024

shiltian commented Nov 29, 2024

llvmbot commented Nov 30, 2024 •

edited

Loading

arsenm Dec 2, 2024

shiltian Dec 6, 2024

jplehr commented Dec 2, 2024

shiltian commented Dec 2, 2024

shiltian commented Dec 2, 2024

jdoerfert left a comment

shiltian commented Dec 5, 2024

jplehr commented Dec 5, 2024

jplehr commented Dec 5, 2024

shiltian commented Dec 5, 2024

jplehr left a comment

ronlieb commented Dec 8, 2024

jhuber6 commented Dec 8, 2024

dhruvachak Dec 8, 2024

shiltian Dec 8, 2024 •

edited

Loading

[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042

[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042

Conversation

shiltian commented Nov 29, 2024

shiltian commented Nov 29, 2024

github-actions bot commented Nov 29, 2024 • edited Loading

jplehr commented Nov 29, 2024

shiltian commented Nov 29, 2024

llvmbot commented Nov 30, 2024 • edited Loading

arsenm Dec 2, 2024

Choose a reason for hiding this comment

shiltian Dec 6, 2024

Choose a reason for hiding this comment

jplehr commented Dec 2, 2024

shiltian commented Dec 2, 2024

shiltian commented Dec 2, 2024

jdoerfert left a comment

Choose a reason for hiding this comment

shiltian commented Dec 5, 2024

jplehr commented Dec 5, 2024

jplehr commented Dec 5, 2024

shiltian commented Dec 5, 2024

jplehr left a comment

Choose a reason for hiding this comment

ronlieb commented Dec 8, 2024

jhuber6 commented Dec 8, 2024

dhruvachak Dec 8, 2024

Choose a reason for hiding this comment

shiltian Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 29, 2024 •

edited

Loading

llvmbot commented Nov 30, 2024 •

edited

Loading

shiltian Dec 8, 2024 •

edited

Loading