Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

shiltian
Copy link
Contributor

No description provided.

@shiltian shiltian requested review from jhuber6, jdoerfert and saiislam and removed request for jhuber6 and jdoerfert November 29, 2024 02:08
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@shiltian shiltian requested a review from jdoerfert November 29, 2024 02:08
@shiltian shiltian added the openmp:libomptarget OpenMP offload runtime label Nov 29, 2024
@shiltian shiltian requested a review from jplehr November 29, 2024 02:08
offload/CMakeLists.txt Outdated Show resolved Hide resolved
@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 91376a4 to 6e81e16 Compare November 29, 2024 02:09
Copy link

github-actions bot commented Nov 29, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 68e75eebec4cf5fc7eef7d9525b276c4ff5e1b17 01139c5ee50e2598a410cea8510978b2e843c951 --extensions c,h,cpp -- offload/test/offloading/ompx_bare_multi_dim.cpp offload/plugins-nextgen/amdgpu/src/rtl.cpp offload/plugins-nextgen/common/include/PluginInterface.h offload/plugins-nextgen/common/src/PluginInterface.cpp offload/plugins-nextgen/cuda/src/rtl.cpp offload/plugins-nextgen/host/src/rtl.cpp offload/src/interface.cpp offload/src/omptarget.cpp offload/test/api/omp_env_vars.c offload/test/offloading/info.c offload/test/offloading/ompx_bare.c offload/test/offloading/small_trip_count.c offload/test/offloading/small_trip_count_thread_limit.cpp
View the diff from clang-format here.
diff --git a/offload/test/api/omp_env_vars.c b/offload/test/api/omp_env_vars.c
index a8a574be8c..8b51d46db7 100644
--- a/offload/test/api/omp_env_vars.c
+++ b/offload/test/api/omp_env_vars.c
@@ -6,7 +6,8 @@
 #define N 256
 
 int main() {
-  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
+  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and
+  // [1,1,1] threads
 #pragma omp target teams
 #pragma omp parallel
   {}
diff --git a/offload/test/offloading/info.c b/offload/test/offloading/info.c
index d86644b871..c3dec8a140 100644
--- a/offload/test/offloading/info.c
+++ b/offload/test/offloading/info.c
@@ -25,7 +25,7 @@ int main() {
   int C[N];
   int val = 1;
 
-// clang-format off
+  // clang-format off
 // INFO: info: Entering OpenMP data region with being_mapper at info.c:{{[0-9]+}}:{{[0-9]+}} with 3 arguments:
 // INFO: info: alloc(A[0:64])[256]
 // INFO: info: tofrom(B[0:64])[256]
@@ -60,7 +60,7 @@ int main() {
 // INFO: info: OpenMP Host-Device pointer mappings after block at info.c:[[#%u,]]:[[#%u,]]:
 // INFO: info: Host Ptr  Target Ptr Size (B) DynRefCount HoldRefCount Declaration
 // INFO: info: [[#%#x,]] [[#%#x,]]  4        INF         0            global at unknown:0:0
-// clang-format on
+  // clang-format on
 #pragma omp target data map(alloc : A[0 : N])                                  \
     map(ompx_hold, tofrom : B[0 : N]) map(to : C[0 : N])
 #pragma omp target firstprivate(val)
diff --git a/offload/test/offloading/ompx_bare.c b/offload/test/offloading/ompx_bare.c
index 6a6ada9617..f3b8042749 100644
--- a/offload/test/offloading/ompx_bare.c
+++ b/offload/test/offloading/ompx_bare.c
@@ -15,7 +15,9 @@ int main(int argc, char *argv[]) {
   const int N = num_blocks * block_size;
   int *data = (int *)malloc(N * sizeof(int));
 
-  // CHECK: "PluginInterface" device 0 info: Launching kernel __omp_offloading_{{.*}} with [64,1,1] blocks and [64,1,1] threads in SPMD mode
+  // CHECK: "PluginInterface" device 0 info: Launching kernel
+  // __omp_offloading_{{.*}} with [64,1,1] blocks and [64,1,1] threads in SPMD
+  // mode
 
 #pragma omp target teams ompx_bare num_teams(num_blocks) thread_limit(block_size) map(from: data[0:N])
   {
diff --git a/offload/test/offloading/ompx_bare_multi_dim.cpp b/offload/test/offloading/ompx_bare_multi_dim.cpp
index d37278525f..38c379064b 100644
--- a/offload/test/offloading/ompx_bare_multi_dim.cpp
+++ b/offload/test/offloading/ompx_bare_multi_dim.cpp
@@ -1,13 +1,14 @@
 // RUN: %libomptarget-compilexx-generic
-// RUN: env LIBOMPTARGET_INFO=63 %libomptarget-run-generic 2>&1 | %fcheck-generic
-// REQUIRES: gpu
+// RUN: env LIBOMPTARGET_INFO=63 %libomptarget-run-generic 2>&1 |
+// %fcheck-generic REQUIRES: gpu
 
 #include <ompx.h>
 
 #include <cassert>
 #include <vector>
 
-// CHECK: "PluginInterface" device 0 info: Launching kernel __omp_offloading_{{.*}} with [2,4,6] blocks and [32,4,2] threads in SPMD mode
+// CHECK: "PluginInterface" device 0 info: Launching kernel
+// __omp_offloading_{{.*}} with [2,4,6] blocks and [32,4,2] threads in SPMD mode
 
 int main(int argc, char *argv[]) {
   int bs[3] = {32u, 4u, 2u};

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 6e81e16 to 008c218 Compare November 29, 2024 02:20
@jplehr
Copy link
Contributor

jplehr commented Nov 29, 2024

Out of curiosity: Have you looked at how this applies downstream?

@shiltian
Copy link
Contributor Author

Out of curiosity: Have you looked at how this applies downstream?

No, I didn't.

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 008c218 to ab2eb63 Compare November 30, 2024 01:28
@shiltian shiltian marked this pull request as ready for review November 30, 2024 01:29
@llvmbot
Copy link
Member

llvmbot commented Nov 30, 2024

@llvm/pr-subscribers-offload

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

Patch is 23.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118042.diff

8 Files Affected:

  • (modified) offload/plugins-nextgen/amdgpu/src/rtl.cpp (+38-31)
  • (modified) offload/plugins-nextgen/common/include/PluginInterface.h (+12-11)
  • (modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+38-30)
  • (modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+9-10)
  • (modified) offload/plugins-nextgen/host/src/rtl.cpp (+2-2)
  • (modified) offload/src/interface.cpp (+4-10)
  • (modified) offload/src/omptarget.cpp (-2)
  • (added) offload/test/offloading/ompx_bare_multi_dim.cpp (+54)
diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index 6356fa0554a9c1..2920435cd8b5c8 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -559,15 +559,15 @@ struct AMDGPUKernelTy : public GenericKernelTy {
   }
 
   /// Launch the AMDGPU kernel function.
-  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                   uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+                   uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
                    KernelLaunchParamsTy LaunchParams,
                    AsyncInfoWrapperTy &AsyncInfoWrapper) const override;
 
   /// Print more elaborate kernel launch info for AMDGPU
   Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
-                               KernelArgsTy &KernelArgs, uint32_t NumThreads,
-                               uint64_t NumBlocks) const override;
+                               KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
+                               uint32_t NumBlocks[3]) const override;
 
   /// Get group and private segment kernel size.
   uint32_t getGroupSize() const { return GroupSize; }
@@ -719,7 +719,7 @@ struct AMDGPUQueueTy {
   /// Push a kernel launch to the queue. The kernel launch requires an output
   /// signal and can define an optional input signal (nullptr if none).
   Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
-                         uint32_t NumThreads, uint64_t NumBlocks,
+                         uint32_t NumThreads[3], uint32_t NumBlocks[3],
                          uint32_t GroupSize, uint64_t StackSize,
                          AMDGPUSignalTy *OutputSignal,
                          AMDGPUSignalTy *InputSignal) {
@@ -746,14 +746,18 @@ struct AMDGPUQueueTy {
     assert(Packet && "Invalid packet");
 
     // The first 32 bits of the packet are written after the other fields
-    uint16_t Setup = UINT16_C(1) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
-    Packet->workgroup_size_x = NumThreads;
-    Packet->workgroup_size_y = 1;
-    Packet->workgroup_size_z = 1;
+    uint16_t Dims = NumBlocks[2] * NumThreads[2] > 1
+                        ? 3
+                        : 1 + (NumBlocks[1] * NumThreads[1] != 1);
+    uint16_t Setup = UINT16_C(Dims)
+                     << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+    Packet->workgroup_size_x = NumThreads[0];
+    Packet->workgroup_size_y = NumThreads[1];
+    Packet->workgroup_size_z = NumThreads[2];
     Packet->reserved0 = 0;
-    Packet->grid_size_x = NumBlocks * NumThreads;
-    Packet->grid_size_y = 1;
-    Packet->grid_size_z = 1;
+    Packet->grid_size_x = NumBlocks[0] * NumThreads[0];
+    Packet->grid_size_y = NumBlocks[1] * NumThreads[1];
+    Packet->grid_size_z = NumBlocks[2] * NumThreads[2];
     Packet->private_segment_size =
         Kernel.usesDynamicStack() ? StackSize : Kernel.getPrivateSize();
     Packet->group_segment_size = GroupSize;
@@ -1240,7 +1244,7 @@ struct AMDGPUStreamTy {
   /// the kernel finalizes. Once the kernel is finished, the stream will release
   /// the kernel args buffer to the specified memory manager.
   Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
-                         uint32_t NumThreads, uint64_t NumBlocks,
+                         uint32_t NumThreads[3], uint32_t NumBlocks[3],
                          uint32_t GroupSize, uint64_t StackSize,
                          AMDGPUMemoryManagerTy &MemoryManager) {
     if (Queue == nullptr)
@@ -2829,10 +2833,10 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     AsyncInfoWrapperTy AsyncInfoWrapper(*this, nullptr);
 
     KernelArgsTy KernelArgs = {};
-    if (auto Err =
-            AMDGPUKernel.launchImpl(*this, /*NumThread=*/1u,
-                                    /*NumBlocks=*/1ul, KernelArgs,
-                                    KernelLaunchParamsTy{}, AsyncInfoWrapper))
+    uint32_t NumBlocksAndThreads[3] = {1u, 1u, 1u};
+    if (auto Err = AMDGPUKernel.launchImpl(
+            *this, NumBlocksAndThreads, NumBlocksAndThreads, KernelArgs,
+            KernelLaunchParamsTy{}, AsyncInfoWrapper))
       return Err;
 
     Error Err = Plugin::success();
@@ -3330,7 +3334,7 @@ struct AMDGPUPluginTy final : public GenericPluginTy {
 };
 
 Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
-                                 uint32_t NumThreads, uint64_t NumBlocks,
+                                 uint32_t NumThreads[3], uint32_t NumBlocks[3],
                                  KernelArgsTy &KernelArgs,
                                  KernelLaunchParamsTy LaunchParams,
                                  AsyncInfoWrapperTy &AsyncInfoWrapper) const {
@@ -3387,13 +3391,15 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
   // Only COV5 implicitargs needs to be set. COV4 implicitargs are not used.
   if (ImplArgs &&
       getImplicitArgsSize() == sizeof(hsa_utils::AMDGPUImplicitArgsTy)) {
-    ImplArgs->BlockCountX = NumBlocks;
-    ImplArgs->BlockCountY = 1;
-    ImplArgs->BlockCountZ = 1;
-    ImplArgs->GroupSizeX = NumThreads;
-    ImplArgs->GroupSizeY = 1;
-    ImplArgs->GroupSizeZ = 1;
-    ImplArgs->GridDims = 1;
+    ImplArgs->BlockCountX = NumBlocks[0];
+    ImplArgs->BlockCountY = NumBlocks[1];
+    ImplArgs->BlockCountZ = NumBlocks[2];
+    ImplArgs->GroupSizeX = NumThreads[0];
+    ImplArgs->GroupSizeY = NumThreads[1];
+    ImplArgs->GroupSizeZ = NumThreads[2];
+    ImplArgs->GridDims = NumBlocks[2] * NumThreads[2] > 1
+                             ? 3
+                             : 1 + (NumBlocks[1] * NumThreads[1] != 1);
     ImplArgs->DynamicLdsSize = KernelArgs.DynCGroupMem;
   }
 
@@ -3404,8 +3410,8 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
 
 Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
                                              KernelArgsTy &KernelArgs,
-                                             uint32_t NumThreads,
-                                             uint64_t NumBlocks) const {
+                                             uint32_t NumThreads[3],
+                                             uint32_t NumBlocks[3]) const {
   // Only do all this when the output is requested
   if (!(getInfoLevel() & OMP_INFOTYPE_PLUGIN_KERNEL))
     return Plugin::success();
@@ -3442,12 +3448,13 @@ Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
   // S/VGPR Spill Count: how many S/VGPRs are spilled by the kernel
   // Tripcount: loop tripcount for the kernel
   INFO(OMP_INFOTYPE_PLUGIN_KERNEL, GenericDevice.getDeviceId(),
-       "#Args: %d Teams x Thrds: %4lux%4u (MaxFlatWorkGroupSize: %u) LDS "
+       "#Args: %d Teams x Thrds: %4ux%4u (MaxFlatWorkGroupSize: %u) LDS "
        "Usage: %uB #SGPRs/VGPRs: %u/%u #SGPR/VGPR Spills: %u/%u Tripcount: "
        "%lu\n",
-       ArgNum, NumGroups, ThreadsPerGroup, MaxFlatWorkgroupSize,
-       GroupSegmentSize, SGPRCount, VGPRCount, SGPRSpillCount, VGPRSpillCount,
-       LoopTripCount);
+       ArgNum, NumGroups[0] * NumGroups[1] * NumGroups[2],
+       ThreadsPerGroup[0] * ThreadsPerGroup[1] * ThreadsPerGroup[2],
+       MaxFlatWorkgroupSize, GroupSegmentSize, SGPRCount, VGPRCount,
+       SGPRSpillCount, VGPRSpillCount, LoopTripCount);
 
   return Plugin::success();
 }
diff --git a/offload/plugins-nextgen/common/include/PluginInterface.h b/offload/plugins-nextgen/common/include/PluginInterface.h
index 41cc0f286a581f..be3467c3f7098f 100644
--- a/offload/plugins-nextgen/common/include/PluginInterface.h
+++ b/offload/plugins-nextgen/common/include/PluginInterface.h
@@ -265,8 +265,9 @@ struct GenericKernelTy {
   Error launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
                ptrdiff_t *ArgOffsets, KernelArgsTy &KernelArgs,
                AsyncInfoWrapperTy &AsyncInfoWrapper) const;
-  virtual Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                           uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  virtual Error launchImpl(GenericDeviceTy &GenericDevice,
+                           uint32_t NumThreads[3], uint32_t NumBlocks[3],
+                           KernelArgsTy &KernelArgs,
                            KernelLaunchParamsTy LaunchParams,
                            AsyncInfoWrapperTy &AsyncInfoWrapper) const = 0;
 
@@ -316,15 +317,15 @@ struct GenericKernelTy {
 
   /// Prints generic kernel launch information.
   Error printLaunchInfo(GenericDeviceTy &GenericDevice,
-                        KernelArgsTy &KernelArgs, uint32_t NumThreads,
-                        uint64_t NumBlocks) const;
+                        KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
+                        uint32_t NumBlocks[3]) const;
 
   /// Prints plugin-specific kernel launch information after generic kernel
   /// launch information
   virtual Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
                                        KernelArgsTy &KernelArgs,
-                                       uint32_t NumThreads,
-                                       uint64_t NumBlocks) const;
+                                       uint32_t NumThreads[3],
+                                       uint32_t NumBlocks[3]) const;
 
 private:
   /// Prepare the arguments before launching the kernel.
@@ -337,15 +338,15 @@ struct GenericKernelTy {
 
   /// Get the number of threads and blocks for the kernel based on the
   /// user-defined threads and block clauses.
-  uint32_t getNumThreads(GenericDeviceTy &GenericDevice,
-                         uint32_t ThreadLimitClause[3]) const;
+  void getNumThreads(GenericDeviceTy &GenericDevice,
+                     uint32_t ThreadLimitClause[3]) const;
 
   /// The number of threads \p NumThreads can be adjusted by this method.
   /// \p IsNumThreadsFromUser is true is \p NumThreads is defined by user via
   /// thread_limit clause.
-  uint64_t getNumBlocks(GenericDeviceTy &GenericDevice,
-                        uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
-                        uint32_t &NumThreads, bool IsNumThreadsFromUser) const;
+  void getNumBlocks(GenericDeviceTy &GenericDevice,
+                    uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
+                    uint32_t &NumThreads, bool IsNumThreadsFromUser) const;
 
   /// Indicate if the kernel works in Generic SPMD, Generic or SPMD mode.
   bool isGenericSPMDMode() const {
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index 25b815b7f96694..1abc521f47ef33 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -526,20 +526,21 @@ GenericKernelTy::getKernelLaunchEnvironment(
 
 Error GenericKernelTy::printLaunchInfo(GenericDeviceTy &GenericDevice,
                                        KernelArgsTy &KernelArgs,
-                                       uint32_t NumThreads,
-                                       uint64_t NumBlocks) const {
+                                       uint32_t NumThreads[3],
+                                       uint32_t NumBlocks[3]) const {
   INFO(OMP_INFOTYPE_PLUGIN_KERNEL, GenericDevice.getDeviceId(),
-       "Launching kernel %s with %" PRIu64
-       " blocks and %d threads in %s mode\n",
-       getName(), NumBlocks, NumThreads, getExecutionModeName());
+       "Launching kernel %s with [%u,%u,%u] blocks and [%u,%u,%u] threads in "
+       "%s mode\n",
+       getName(), NumBlocks[0], NumBlocks[1], NumBlocks[2], NumThreads[0],
+       NumThreads[1], NumThreads[2], getExecutionModeName());
   return printLaunchInfoDetails(GenericDevice, KernelArgs, NumThreads,
                                 NumBlocks);
 }
 
 Error GenericKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
                                               KernelArgsTy &KernelArgs,
-                                              uint32_t NumThreads,
-                                              uint64_t NumBlocks) const {
+                                              uint32_t NumThreads[3],
+                                              uint32_t NumBlocks[3]) const {
   return Plugin::success();
 }
 
@@ -566,10 +567,14 @@ Error GenericKernelTy::launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
                     Args, Ptrs, *KernelLaunchEnvOrErr);
   }
 
-  uint32_t NumThreads = getNumThreads(GenericDevice, KernelArgs.ThreadLimit);
-  uint64_t NumBlocks =
-      getNumBlocks(GenericDevice, KernelArgs.NumTeams, KernelArgs.Tripcount,
-                   NumThreads, KernelArgs.ThreadLimit[0] > 0);
+  uint32_t NumThreads[3] = {KernelArgs.ThreadLimit[0],
+                            KernelArgs.ThreadLimit[1],
+                            KernelArgs.ThreadLimit[2]};
+  uint32_t NumBlocks[3] = {KernelArgs.NumTeams[0], KernelArgs.NumTeams[1],
+                           KernelArgs.NumTeams[2]};
+  getNumThreads(GenericDevice, NumThreads);
+  getNumBlocks(GenericDevice, NumBlocks, KernelArgs.Tripcount, NumThreads[0],
+               NumThreads[0] > 0);
 
   // Record the kernel description after we modified the argument count and num
   // blocks/threads.
@@ -578,7 +583,8 @@ Error GenericKernelTy::launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
     RecordReplay.saveImage(getName(), getImage());
     RecordReplay.saveKernelInput(getName(), getImage());
     RecordReplay.saveKernelDescr(getName(), LaunchParams, KernelArgs.NumArgs,
-                                 NumBlocks, NumThreads, KernelArgs.Tripcount);
+                                 NumBlocks[0], NumThreads[0],
+                                 KernelArgs.Tripcount);
   }
 
   if (auto Err =
@@ -616,38 +622,39 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
   return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};
 }
 
-uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
-                                        uint32_t ThreadLimitClause[3]) const {
+void GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
+                                    uint32_t ThreadLimitClause[3]) const {
+  if (IsBareKernel)
+    return;
+
   assert(ThreadLimitClause[1] == 0 && ThreadLimitClause[2] == 0 &&
          "Multi dimensional launch not supported yet.");
 
-  if (IsBareKernel && ThreadLimitClause[0] > 0)
-    return ThreadLimitClause[0];
-
   if (ThreadLimitClause[0] > 0 && isGenericMode())
     ThreadLimitClause[0] += GenericDevice.getWarpSize();
 
-  return std::min(MaxNumThreads, (ThreadLimitClause[0] > 0)
-                                     ? ThreadLimitClause[0]
-                                     : PreferredNumThreads);
+  ThreadLimitClause[0] =
+      std::min(MaxNumThreads, (ThreadLimitClause[0] > 0) ? ThreadLimitClause[0]
+                                                         : PreferredNumThreads);
 }
 
-uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
-                                       uint32_t NumTeamsClause[3],
-                                       uint64_t LoopTripCount,
-                                       uint32_t &NumThreads,
-                                       bool IsNumThreadsFromUser) const {
+void GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
+                                   uint32_t NumTeamsClause[3],
+                                   uint64_t LoopTripCount, uint32_t &NumThreads,
+                                   bool IsNumThreadsFromUser) const {
+  if (IsBareKernel)
+    return;
+
   assert(NumTeamsClause[1] == 0 && NumTeamsClause[2] == 0 &&
          "Multi dimensional launch not supported yet.");
 
-  if (IsBareKernel && NumTeamsClause[0] > 0)
-    return NumTeamsClause[0];
-
   if (NumTeamsClause[0] > 0) {
     // TODO: We need to honor any value and consequently allow more than the
     // block limit. For this we might need to start multiple kernels or let the
     // blocks start again until the requested number has been started.
-    return std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
+    NumTeamsClause[0] =
+        std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
+    return;
   }
 
   uint64_t DefaultNumBlocks = GenericDevice.getDefaultNumBlocks();
@@ -719,7 +726,8 @@ uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
   // If the loops are long running we rather reuse blocks than spawn too many.
   if (GenericDevice.getReuseBlocksForHighTripCount())
     PreferredNumBlocks = std::min(TripCountNumBlocks, DefaultNumBlocks);
-  return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
+  NumTeamsClause[0] =
+      std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
 }
 
 GenericDeviceTy::GenericDeviceTy(GenericPluginTy &Plugin, int32_t DeviceId,
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index 015c7775ba3513..1e3d79c9554c67 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -149,8 +149,8 @@ struct CUDAKernelTy : public GenericKernelTy {
   }
 
   /// Launch the CUDA kernel function.
-  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                   uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+                   uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
                    KernelLaunchParamsTy LaunchParams,
                    AsyncInfoWrapperTy &AsyncInfoWrapper) const override;
 
@@ -1230,10 +1230,10 @@ struct CUDADeviceTy : public GenericDeviceTy {
     AsyncInfoWrapperTy AsyncInfoWrapper(*this, nullptr);
 
     KernelArgsTy KernelArgs = {};
-    if (auto Err =
-            CUDAKernel.launchImpl(*this, /*NumThread=*/1u,
-                                  /*NumBlocks=*/1ul, KernelArgs,
-                                  KernelLaunchParamsTy{}, AsyncInfoWrapper))
+    uint32_t NumBlocksAndThreads[3] = {1u, 1u, 1u};
+    if (auto Err = CUDAKernel.launchImpl(
+            *this, NumBlocksAndThreads, NumBlocksAndThreads, KernelArgs,
+            KernelLaunchParamsTy{}, AsyncInfoWrapper))
       return Err;
 
     Error Err = Plugin::success();
@@ -1276,7 +1276,7 @@ struct CUDADeviceTy : public GenericDeviceTy {
 };
 
 Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
-                               uint32_t NumThreads, uint64_t NumBlocks,
+                               uint32_t NumThreads[3], uint32_t NumBlocks[3],
                                KernelArgsTy &KernelArgs,
                                KernelLaunchParamsTy LaunchParams,
                                AsyncInfoWrapperTy &AsyncInfoWrapper) const {
@@ -1294,9 +1294,8 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
                     reinterpret_cast<void *>(&LaunchParams.Size),
                     CU_LAUNCH_PARAM_END};
 
-  CUresult Res = cuLaunchKernel(Func, NumBlocks, /*gridDimY=*/1,
-                                /*gridDimZ=*/1, NumThreads,
-                                /*blockDimY=*/1, /*blockDimZ=*/1,
+  CUresult Res = cuLaunchKernel(Func, NumBlocks[0], NumBlocks[1], NumBlocks[2],
+                                NumThreads[0], NumThreads[1], NumThreads[2],
                                 MaxDynCGroupMem, Stream, nullptr, Config);
   return Plugin::check(Res, "Error in cuLaunchKernel for '%s': %s", getName());
 }
diff --git a/offload/plugins-nextgen/host/src/rtl.cpp b/offload/plugins-nextgen/host/src/rtl.cpp
index 6f2e3d8604ec82..915c41e88c5828 100644
--- a/offload/plugins-nextgen/host/src/rtl.cpp
+++ b/offload/plugins-nextgen/host/src/rtl.cpp
@@ -89,8 +89,8 @@ struct GenELF64KernelTy : public GenericKernelTy {
   }
 
   /// Launch the kernel using the libffi.
-  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
-                   uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+  Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+                   uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
                    KernelLaunchParamsTy LaunchParams,
                    AsyncInfoWrapperTy &AsyncInfoWrapper)...
[truncated]

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch 2 times, most recently from cca25b5 to f6b6ea5 Compare December 2, 2024 02:02
Comment on lines 575 to 577
// FIXME: This is a WA to "calibrate" the bad work done in the front end.
// Delete this ugly code after the front end emits proper values.
auto CorrectMultiDim = [](uint32_t (&Val)[3]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just directly fix this? Where is this coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the front end. In regular OpenMP, 0 represents users don't specify any value such that the runtime needs to choose one. The multi-dim kernel launch is purely an extension and it has not been standardized yet. I'll do a follow up patch that emits [0,1,1] for regular OpenMP if nothing is specified and [x,y,z] for OMPX.

@jplehr
Copy link
Contributor

jplehr commented Dec 2, 2024

I ran this through our buildbot config and the new tests do not seem to pass. The 0-th dimension appears to be the culprit. Can you double check that it works?
If you need any help / info, let me know.

# .---command stderr------------
# | llvm-project/offload/test/api/omp_env_vars.c:8:12: error: CHECK: expected string not found in input                                                                                                       
# |  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
# |            ^
# | <stdin>:1:1: note: scanning from here
# | "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode                                                                                     
# | ^
# | <stdin>:1:55: note: possible intended match here
# | "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode                                                                                     
# |                                                       ^
# |
# | Input file: <stdin>
# | Check file: llvm-project/offload/test/api/omp_env_vars.c
# |
# | -dump-input=help explains the following input dump.
# |
# | Input was:
# | <<<<<<
# |            1: "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode                                                                       
# | check:8'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found                                                
# | check:8'1                                                           ?                                                                                          possible intended match                                              
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1

@shiltian
Copy link
Contributor Author

shiltian commented Dec 2, 2024

Hmm, something wrong in the AMD part. I tested it on a local NVIDIA GPU and it works. Will take a look.

@shiltian
Copy link
Contributor Author

shiltian commented Dec 2, 2024

@jplehr I tested it locally but couldn't reproduce the failure locally.

➜  clang -fopenmp --offload-arch=gfx1030 llvm-project/offload/test/api/omp_env_vars.c -o main
➜  env OMP_NUM_TEAMS=1 OMP_TEAMS_THREAD_LIMIT=1 LIBOMPTARGET_INFO=16 ./main
"PluginInterface" device 0 info: Launching kernel __omp_offloading_10302_8ee073e_main_l9 with [1,1,1] blocks and [1,1,1] threads in SPMD mode
AMDGPU device 0 info: #Args: 1 Teams x Thrds:    1x   1 (MaxFlatWorkGroupSize: 256) LDS Usage: 56B #SGPRs/VGPRs: 39/32 #SGPR/VGPR Spills: 0/0 Tripcount: 0

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from f6b6ea5 to 47a3a30 Compare December 3, 2024 02:21
Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, fix the test.
Consider not printing dim 2 and 3 if we run in OpenMP mode, but OK for now.

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 47a3a30 to 5908b81 Compare December 5, 2024 20:11
@shiltian
Copy link
Contributor Author

shiltian commented Dec 5, 2024

@jplehr Is that the only failure you observed? Could you help check the latest change?

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from 5908b81 to cd4c7f5 Compare December 5, 2024 20:19
@jplehr
Copy link
Contributor

jplehr commented Dec 5, 2024

@jplehr Is that the only failure you observed? Could you help check the latest change?

Let me give this another shot. I'll let you know!

@jplehr
Copy link
Contributor

jplehr commented Dec 5, 2024

I see these two failures on a local test run

  libomptarget :: x86_64-unknown-linux-gnu :: api/omp_env_vars.c
  libomptarget :: x86_64-unknown-linux-gnu-LTO :: api/omp_env_vars.c

The error message is

# .---command stderr------------
# | /work/janplehr/git/llvm-project/offload/test/api/omp_env_vars.c:8:12: error: CHECK: expected string not found in input
# |  // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
# |            ^
...
# | Input was:
# | <<<<<<
# |            1: "PluginInterface" device 0 info: Launching kernel __omp_offloading_fd00_61c4b8b_main_l9 with [0,1,1] blocks and [0,1,1] threads in Generic mode 
# | check:8'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
# | check:8'1                                                           ?                                                                                          possible intended match
# | >>>>>>

@shiltian shiltian force-pushed the users/shiltian/multi-dim-grid-runtime branch from cd4c7f5 to 01139c5 Compare December 5, 2024 22:43
@shiltian
Copy link
Contributor Author

shiltian commented Dec 5, 2024

I ended up with forcing api/omp_env_vars.c to use GPU.

Copy link
Contributor

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this through local testing and saw the issues resolved.

@shiltian shiltian merged commit 92376c3 into main Dec 6, 2024
5 of 6 checks passed
@shiltian shiltian deleted the users/shiltian/multi-dim-grid-runtime branch December 6, 2024 14:07
@ronlieb
Copy link
Contributor

ronlieb commented Dec 8, 2024

Out of curiosity: Have you looked at how this applies downstream?

No, I didn't.

in hindsight, i wish you had

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 8, 2024

in hindsight, i wish you had

This pain is very much self-inflicted I'm afraid

KernelArgs.ThreadLimit[2]};
uint32_t NumBlocks[3] = {KernelArgs.NumTeams[0], KernelArgs.NumTeams[1],
KernelArgs.NumTeams[2]};
if (!IsBareKernel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this patch, IsBareKernel was not checked. Is this change required for the multi-dim support? Or is it unrelated change? Same question for other new uses of IsBareKernel.

Copy link
Contributor Author

@shiltian shiltian Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this patch the check was in the two getter functions.

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Dec 8, 2024
broxigarchen pushed a commit to broxigarchen/llvm-project that referenced this pull request Dec 10, 2024
chrsmcgrr pushed a commit to RooflineAI/llvm-project that referenced this pull request Dec 12, 2024
TIFitis pushed a commit to TIFitis/llvm-project that referenced this pull request Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants