-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Offload][OMPX] Add the runtime support for multi-dim grid and block #118042
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
91376a4
to
6e81e16
Compare
You can test this locally with the following command:git-clang-format --diff 68e75eebec4cf5fc7eef7d9525b276c4ff5e1b17 01139c5ee50e2598a410cea8510978b2e843c951 --extensions c,h,cpp -- offload/test/offloading/ompx_bare_multi_dim.cpp offload/plugins-nextgen/amdgpu/src/rtl.cpp offload/plugins-nextgen/common/include/PluginInterface.h offload/plugins-nextgen/common/src/PluginInterface.cpp offload/plugins-nextgen/cuda/src/rtl.cpp offload/plugins-nextgen/host/src/rtl.cpp offload/src/interface.cpp offload/src/omptarget.cpp offload/test/api/omp_env_vars.c offload/test/offloading/info.c offload/test/offloading/ompx_bare.c offload/test/offloading/small_trip_count.c offload/test/offloading/small_trip_count_thread_limit.cpp View the diff from clang-format here.diff --git a/offload/test/api/omp_env_vars.c b/offload/test/api/omp_env_vars.c
index a8a574be8c..8b51d46db7 100644
--- a/offload/test/api/omp_env_vars.c
+++ b/offload/test/api/omp_env_vars.c
@@ -6,7 +6,8 @@
#define N 256
int main() {
- // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and [1,1,1] threads
+ // CHECK: Launching kernel [[KERNEL:.+_main_.+]] with [1,1,1] blocks and
+ // [1,1,1] threads
#pragma omp target teams
#pragma omp parallel
{}
diff --git a/offload/test/offloading/info.c b/offload/test/offloading/info.c
index d86644b871..c3dec8a140 100644
--- a/offload/test/offloading/info.c
+++ b/offload/test/offloading/info.c
@@ -25,7 +25,7 @@ int main() {
int C[N];
int val = 1;
-// clang-format off
+ // clang-format off
// INFO: info: Entering OpenMP data region with being_mapper at info.c:{{[0-9]+}}:{{[0-9]+}} with 3 arguments:
// INFO: info: alloc(A[0:64])[256]
// INFO: info: tofrom(B[0:64])[256]
@@ -60,7 +60,7 @@ int main() {
// INFO: info: OpenMP Host-Device pointer mappings after block at info.c:[[#%u,]]:[[#%u,]]:
// INFO: info: Host Ptr Target Ptr Size (B) DynRefCount HoldRefCount Declaration
// INFO: info: [[#%#x,]] [[#%#x,]] 4 INF 0 global at unknown:0:0
-// clang-format on
+ // clang-format on
#pragma omp target data map(alloc : A[0 : N]) \
map(ompx_hold, tofrom : B[0 : N]) map(to : C[0 : N])
#pragma omp target firstprivate(val)
diff --git a/offload/test/offloading/ompx_bare.c b/offload/test/offloading/ompx_bare.c
index 6a6ada9617..f3b8042749 100644
--- a/offload/test/offloading/ompx_bare.c
+++ b/offload/test/offloading/ompx_bare.c
@@ -15,7 +15,9 @@ int main(int argc, char *argv[]) {
const int N = num_blocks * block_size;
int *data = (int *)malloc(N * sizeof(int));
- // CHECK: "PluginInterface" device 0 info: Launching kernel __omp_offloading_{{.*}} with [64,1,1] blocks and [64,1,1] threads in SPMD mode
+ // CHECK: "PluginInterface" device 0 info: Launching kernel
+ // __omp_offloading_{{.*}} with [64,1,1] blocks and [64,1,1] threads in SPMD
+ // mode
#pragma omp target teams ompx_bare num_teams(num_blocks) thread_limit(block_size) map(from: data[0:N])
{
diff --git a/offload/test/offloading/ompx_bare_multi_dim.cpp b/offload/test/offloading/ompx_bare_multi_dim.cpp
index d37278525f..38c379064b 100644
--- a/offload/test/offloading/ompx_bare_multi_dim.cpp
+++ b/offload/test/offloading/ompx_bare_multi_dim.cpp
@@ -1,13 +1,14 @@
// RUN: %libomptarget-compilexx-generic
-// RUN: env LIBOMPTARGET_INFO=63 %libomptarget-run-generic 2>&1 | %fcheck-generic
-// REQUIRES: gpu
+// RUN: env LIBOMPTARGET_INFO=63 %libomptarget-run-generic 2>&1 |
+// %fcheck-generic REQUIRES: gpu
#include <ompx.h>
#include <cassert>
#include <vector>
-// CHECK: "PluginInterface" device 0 info: Launching kernel __omp_offloading_{{.*}} with [2,4,6] blocks and [32,4,2] threads in SPMD mode
+// CHECK: "PluginInterface" device 0 info: Launching kernel
+// __omp_offloading_{{.*}} with [2,4,6] blocks and [32,4,2] threads in SPMD mode
int main(int argc, char *argv[]) {
int bs[3] = {32u, 4u, 2u};
|
6e81e16
to
008c218
Compare
Out of curiosity: Have you looked at how this applies downstream? |
No, I didn't. |
008c218
to
ab2eb63
Compare
@llvm/pr-subscribers-offload @llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) ChangesPatch is 23.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/118042.diff 8 Files Affected:
diff --git a/offload/plugins-nextgen/amdgpu/src/rtl.cpp b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
index 6356fa0554a9c1..2920435cd8b5c8 100644
--- a/offload/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/offload/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -559,15 +559,15 @@ struct AMDGPUKernelTy : public GenericKernelTy {
}
/// Launch the AMDGPU kernel function.
- Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
- uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+ Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+ uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const override;
/// Print more elaborate kernel launch info for AMDGPU
Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
- KernelArgsTy &KernelArgs, uint32_t NumThreads,
- uint64_t NumBlocks) const override;
+ KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
+ uint32_t NumBlocks[3]) const override;
/// Get group and private segment kernel size.
uint32_t getGroupSize() const { return GroupSize; }
@@ -719,7 +719,7 @@ struct AMDGPUQueueTy {
/// Push a kernel launch to the queue. The kernel launch requires an output
/// signal and can define an optional input signal (nullptr if none).
Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
- uint32_t NumThreads, uint64_t NumBlocks,
+ uint32_t NumThreads[3], uint32_t NumBlocks[3],
uint32_t GroupSize, uint64_t StackSize,
AMDGPUSignalTy *OutputSignal,
AMDGPUSignalTy *InputSignal) {
@@ -746,14 +746,18 @@ struct AMDGPUQueueTy {
assert(Packet && "Invalid packet");
// The first 32 bits of the packet are written after the other fields
- uint16_t Setup = UINT16_C(1) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
- Packet->workgroup_size_x = NumThreads;
- Packet->workgroup_size_y = 1;
- Packet->workgroup_size_z = 1;
+ uint16_t Dims = NumBlocks[2] * NumThreads[2] > 1
+ ? 3
+ : 1 + (NumBlocks[1] * NumThreads[1] != 1);
+ uint16_t Setup = UINT16_C(Dims)
+ << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+ Packet->workgroup_size_x = NumThreads[0];
+ Packet->workgroup_size_y = NumThreads[1];
+ Packet->workgroup_size_z = NumThreads[2];
Packet->reserved0 = 0;
- Packet->grid_size_x = NumBlocks * NumThreads;
- Packet->grid_size_y = 1;
- Packet->grid_size_z = 1;
+ Packet->grid_size_x = NumBlocks[0] * NumThreads[0];
+ Packet->grid_size_y = NumBlocks[1] * NumThreads[1];
+ Packet->grid_size_z = NumBlocks[2] * NumThreads[2];
Packet->private_segment_size =
Kernel.usesDynamicStack() ? StackSize : Kernel.getPrivateSize();
Packet->group_segment_size = GroupSize;
@@ -1240,7 +1244,7 @@ struct AMDGPUStreamTy {
/// the kernel finalizes. Once the kernel is finished, the stream will release
/// the kernel args buffer to the specified memory manager.
Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
- uint32_t NumThreads, uint64_t NumBlocks,
+ uint32_t NumThreads[3], uint32_t NumBlocks[3],
uint32_t GroupSize, uint64_t StackSize,
AMDGPUMemoryManagerTy &MemoryManager) {
if (Queue == nullptr)
@@ -2829,10 +2833,10 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
AsyncInfoWrapperTy AsyncInfoWrapper(*this, nullptr);
KernelArgsTy KernelArgs = {};
- if (auto Err =
- AMDGPUKernel.launchImpl(*this, /*NumThread=*/1u,
- /*NumBlocks=*/1ul, KernelArgs,
- KernelLaunchParamsTy{}, AsyncInfoWrapper))
+ uint32_t NumBlocksAndThreads[3] = {1u, 1u, 1u};
+ if (auto Err = AMDGPUKernel.launchImpl(
+ *this, NumBlocksAndThreads, NumBlocksAndThreads, KernelArgs,
+ KernelLaunchParamsTy{}, AsyncInfoWrapper))
return Err;
Error Err = Plugin::success();
@@ -3330,7 +3334,7 @@ struct AMDGPUPluginTy final : public GenericPluginTy {
};
Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
- uint32_t NumThreads, uint64_t NumBlocks,
+ uint32_t NumThreads[3], uint32_t NumBlocks[3],
KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const {
@@ -3387,13 +3391,15 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
// Only COV5 implicitargs needs to be set. COV4 implicitargs are not used.
if (ImplArgs &&
getImplicitArgsSize() == sizeof(hsa_utils::AMDGPUImplicitArgsTy)) {
- ImplArgs->BlockCountX = NumBlocks;
- ImplArgs->BlockCountY = 1;
- ImplArgs->BlockCountZ = 1;
- ImplArgs->GroupSizeX = NumThreads;
- ImplArgs->GroupSizeY = 1;
- ImplArgs->GroupSizeZ = 1;
- ImplArgs->GridDims = 1;
+ ImplArgs->BlockCountX = NumBlocks[0];
+ ImplArgs->BlockCountY = NumBlocks[1];
+ ImplArgs->BlockCountZ = NumBlocks[2];
+ ImplArgs->GroupSizeX = NumThreads[0];
+ ImplArgs->GroupSizeY = NumThreads[1];
+ ImplArgs->GroupSizeZ = NumThreads[2];
+ ImplArgs->GridDims = NumBlocks[2] * NumThreads[2] > 1
+ ? 3
+ : 1 + (NumBlocks[1] * NumThreads[1] != 1);
ImplArgs->DynamicLdsSize = KernelArgs.DynCGroupMem;
}
@@ -3404,8 +3410,8 @@ Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs,
- uint32_t NumThreads,
- uint64_t NumBlocks) const {
+ uint32_t NumThreads[3],
+ uint32_t NumBlocks[3]) const {
// Only do all this when the output is requested
if (!(getInfoLevel() & OMP_INFOTYPE_PLUGIN_KERNEL))
return Plugin::success();
@@ -3442,12 +3448,13 @@ Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
// S/VGPR Spill Count: how many S/VGPRs are spilled by the kernel
// Tripcount: loop tripcount for the kernel
INFO(OMP_INFOTYPE_PLUGIN_KERNEL, GenericDevice.getDeviceId(),
- "#Args: %d Teams x Thrds: %4lux%4u (MaxFlatWorkGroupSize: %u) LDS "
+ "#Args: %d Teams x Thrds: %4ux%4u (MaxFlatWorkGroupSize: %u) LDS "
"Usage: %uB #SGPRs/VGPRs: %u/%u #SGPR/VGPR Spills: %u/%u Tripcount: "
"%lu\n",
- ArgNum, NumGroups, ThreadsPerGroup, MaxFlatWorkgroupSize,
- GroupSegmentSize, SGPRCount, VGPRCount, SGPRSpillCount, VGPRSpillCount,
- LoopTripCount);
+ ArgNum, NumGroups[0] * NumGroups[1] * NumGroups[2],
+ ThreadsPerGroup[0] * ThreadsPerGroup[1] * ThreadsPerGroup[2],
+ MaxFlatWorkgroupSize, GroupSegmentSize, SGPRCount, VGPRCount,
+ SGPRSpillCount, VGPRSpillCount, LoopTripCount);
return Plugin::success();
}
diff --git a/offload/plugins-nextgen/common/include/PluginInterface.h b/offload/plugins-nextgen/common/include/PluginInterface.h
index 41cc0f286a581f..be3467c3f7098f 100644
--- a/offload/plugins-nextgen/common/include/PluginInterface.h
+++ b/offload/plugins-nextgen/common/include/PluginInterface.h
@@ -265,8 +265,9 @@ struct GenericKernelTy {
Error launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
ptrdiff_t *ArgOffsets, KernelArgsTy &KernelArgs,
AsyncInfoWrapperTy &AsyncInfoWrapper) const;
- virtual Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
- uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+ virtual Error launchImpl(GenericDeviceTy &GenericDevice,
+ uint32_t NumThreads[3], uint32_t NumBlocks[3],
+ KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const = 0;
@@ -316,15 +317,15 @@ struct GenericKernelTy {
/// Prints generic kernel launch information.
Error printLaunchInfo(GenericDeviceTy &GenericDevice,
- KernelArgsTy &KernelArgs, uint32_t NumThreads,
- uint64_t NumBlocks) const;
+ KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
+ uint32_t NumBlocks[3]) const;
/// Prints plugin-specific kernel launch information after generic kernel
/// launch information
virtual Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs,
- uint32_t NumThreads,
- uint64_t NumBlocks) const;
+ uint32_t NumThreads[3],
+ uint32_t NumBlocks[3]) const;
private:
/// Prepare the arguments before launching the kernel.
@@ -337,15 +338,15 @@ struct GenericKernelTy {
/// Get the number of threads and blocks for the kernel based on the
/// user-defined threads and block clauses.
- uint32_t getNumThreads(GenericDeviceTy &GenericDevice,
- uint32_t ThreadLimitClause[3]) const;
+ void getNumThreads(GenericDeviceTy &GenericDevice,
+ uint32_t ThreadLimitClause[3]) const;
/// The number of threads \p NumThreads can be adjusted by this method.
/// \p IsNumThreadsFromUser is true is \p NumThreads is defined by user via
/// thread_limit clause.
- uint64_t getNumBlocks(GenericDeviceTy &GenericDevice,
- uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
- uint32_t &NumThreads, bool IsNumThreadsFromUser) const;
+ void getNumBlocks(GenericDeviceTy &GenericDevice,
+ uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
+ uint32_t &NumThreads, bool IsNumThreadsFromUser) const;
/// Indicate if the kernel works in Generic SPMD, Generic or SPMD mode.
bool isGenericSPMDMode() const {
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index 25b815b7f96694..1abc521f47ef33 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -526,20 +526,21 @@ GenericKernelTy::getKernelLaunchEnvironment(
Error GenericKernelTy::printLaunchInfo(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs,
- uint32_t NumThreads,
- uint64_t NumBlocks) const {
+ uint32_t NumThreads[3],
+ uint32_t NumBlocks[3]) const {
INFO(OMP_INFOTYPE_PLUGIN_KERNEL, GenericDevice.getDeviceId(),
- "Launching kernel %s with %" PRIu64
- " blocks and %d threads in %s mode\n",
- getName(), NumBlocks, NumThreads, getExecutionModeName());
+ "Launching kernel %s with [%u,%u,%u] blocks and [%u,%u,%u] threads in "
+ "%s mode\n",
+ getName(), NumBlocks[0], NumBlocks[1], NumBlocks[2], NumThreads[0],
+ NumThreads[1], NumThreads[2], getExecutionModeName());
return printLaunchInfoDetails(GenericDevice, KernelArgs, NumThreads,
NumBlocks);
}
Error GenericKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs,
- uint32_t NumThreads,
- uint64_t NumBlocks) const {
+ uint32_t NumThreads[3],
+ uint32_t NumBlocks[3]) const {
return Plugin::success();
}
@@ -566,10 +567,14 @@ Error GenericKernelTy::launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
Args, Ptrs, *KernelLaunchEnvOrErr);
}
- uint32_t NumThreads = getNumThreads(GenericDevice, KernelArgs.ThreadLimit);
- uint64_t NumBlocks =
- getNumBlocks(GenericDevice, KernelArgs.NumTeams, KernelArgs.Tripcount,
- NumThreads, KernelArgs.ThreadLimit[0] > 0);
+ uint32_t NumThreads[3] = {KernelArgs.ThreadLimit[0],
+ KernelArgs.ThreadLimit[1],
+ KernelArgs.ThreadLimit[2]};
+ uint32_t NumBlocks[3] = {KernelArgs.NumTeams[0], KernelArgs.NumTeams[1],
+ KernelArgs.NumTeams[2]};
+ getNumThreads(GenericDevice, NumThreads);
+ getNumBlocks(GenericDevice, NumBlocks, KernelArgs.Tripcount, NumThreads[0],
+ NumThreads[0] > 0);
// Record the kernel description after we modified the argument count and num
// blocks/threads.
@@ -578,7 +583,8 @@ Error GenericKernelTy::launch(GenericDeviceTy &GenericDevice, void **ArgPtrs,
RecordReplay.saveImage(getName(), getImage());
RecordReplay.saveKernelInput(getName(), getImage());
RecordReplay.saveKernelDescr(getName(), LaunchParams, KernelArgs.NumArgs,
- NumBlocks, NumThreads, KernelArgs.Tripcount);
+ NumBlocks[0], NumThreads[0],
+ KernelArgs.Tripcount);
}
if (auto Err =
@@ -616,38 +622,39 @@ KernelLaunchParamsTy GenericKernelTy::prepareArgs(
return KernelLaunchParamsTy{sizeof(void *) * NumArgs, &Args[0], &Ptrs[0]};
}
-uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
- uint32_t ThreadLimitClause[3]) const {
+void GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
+ uint32_t ThreadLimitClause[3]) const {
+ if (IsBareKernel)
+ return;
+
assert(ThreadLimitClause[1] == 0 && ThreadLimitClause[2] == 0 &&
"Multi dimensional launch not supported yet.");
- if (IsBareKernel && ThreadLimitClause[0] > 0)
- return ThreadLimitClause[0];
-
if (ThreadLimitClause[0] > 0 && isGenericMode())
ThreadLimitClause[0] += GenericDevice.getWarpSize();
- return std::min(MaxNumThreads, (ThreadLimitClause[0] > 0)
- ? ThreadLimitClause[0]
- : PreferredNumThreads);
+ ThreadLimitClause[0] =
+ std::min(MaxNumThreads, (ThreadLimitClause[0] > 0) ? ThreadLimitClause[0]
+ : PreferredNumThreads);
}
-uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
- uint32_t NumTeamsClause[3],
- uint64_t LoopTripCount,
- uint32_t &NumThreads,
- bool IsNumThreadsFromUser) const {
+void GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
+ uint32_t NumTeamsClause[3],
+ uint64_t LoopTripCount, uint32_t &NumThreads,
+ bool IsNumThreadsFromUser) const {
+ if (IsBareKernel)
+ return;
+
assert(NumTeamsClause[1] == 0 && NumTeamsClause[2] == 0 &&
"Multi dimensional launch not supported yet.");
- if (IsBareKernel && NumTeamsClause[0] > 0)
- return NumTeamsClause[0];
-
if (NumTeamsClause[0] > 0) {
// TODO: We need to honor any value and consequently allow more than the
// block limit. For this we might need to start multiple kernels or let the
// blocks start again until the requested number has been started.
- return std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
+ NumTeamsClause[0] =
+ std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
+ return;
}
uint64_t DefaultNumBlocks = GenericDevice.getDefaultNumBlocks();
@@ -719,7 +726,8 @@ uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
// If the loops are long running we rather reuse blocks than spawn too many.
if (GenericDevice.getReuseBlocksForHighTripCount())
PreferredNumBlocks = std::min(TripCountNumBlocks, DefaultNumBlocks);
- return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
+ NumTeamsClause[0] =
+ std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
}
GenericDeviceTy::GenericDeviceTy(GenericPluginTy &Plugin, int32_t DeviceId,
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index 015c7775ba3513..1e3d79c9554c67 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -149,8 +149,8 @@ struct CUDAKernelTy : public GenericKernelTy {
}
/// Launch the CUDA kernel function.
- Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
- uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+ Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+ uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const override;
@@ -1230,10 +1230,10 @@ struct CUDADeviceTy : public GenericDeviceTy {
AsyncInfoWrapperTy AsyncInfoWrapper(*this, nullptr);
KernelArgsTy KernelArgs = {};
- if (auto Err =
- CUDAKernel.launchImpl(*this, /*NumThread=*/1u,
- /*NumBlocks=*/1ul, KernelArgs,
- KernelLaunchParamsTy{}, AsyncInfoWrapper))
+ uint32_t NumBlocksAndThreads[3] = {1u, 1u, 1u};
+ if (auto Err = CUDAKernel.launchImpl(
+ *this, NumBlocksAndThreads, NumBlocksAndThreads, KernelArgs,
+ KernelLaunchParamsTy{}, AsyncInfoWrapper))
return Err;
Error Err = Plugin::success();
@@ -1276,7 +1276,7 @@ struct CUDADeviceTy : public GenericDeviceTy {
};
Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
- uint32_t NumThreads, uint64_t NumBlocks,
+ uint32_t NumThreads[3], uint32_t NumBlocks[3],
KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const {
@@ -1294,9 +1294,8 @@ Error CUDAKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
reinterpret_cast<void *>(&LaunchParams.Size),
CU_LAUNCH_PARAM_END};
- CUresult Res = cuLaunchKernel(Func, NumBlocks, /*gridDimY=*/1,
- /*gridDimZ=*/1, NumThreads,
- /*blockDimY=*/1, /*blockDimZ=*/1,
+ CUresult Res = cuLaunchKernel(Func, NumBlocks[0], NumBlocks[1], NumBlocks[2],
+ NumThreads[0], NumThreads[1], NumThreads[2],
MaxDynCGroupMem, Stream, nullptr, Config);
return Plugin::check(Res, "Error in cuLaunchKernel for '%s': %s", getName());
}
diff --git a/offload/plugins-nextgen/host/src/rtl.cpp b/offload/plugins-nextgen/host/src/rtl.cpp
index 6f2e3d8604ec82..915c41e88c5828 100644
--- a/offload/plugins-nextgen/host/src/rtl.cpp
+++ b/offload/plugins-nextgen/host/src/rtl.cpp
@@ -89,8 +89,8 @@ struct GenELF64KernelTy : public GenericKernelTy {
}
/// Launch the kernel using the libffi.
- Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads,
- uint64_t NumBlocks, KernelArgsTy &KernelArgs,
+ Error launchImpl(GenericDeviceTy &GenericDevice, uint32_t NumThreads[3],
+ uint32_t NumBlocks[3], KernelArgsTy &KernelArgs,
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper)...
[truncated]
|
cca25b5
to
f6b6ea5
Compare
// FIXME: This is a WA to "calibrate" the bad work done in the front end. | ||
// Delete this ugly code after the front end emits proper values. | ||
auto CorrectMultiDim = [](uint32_t (&Val)[3]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just directly fix this? Where is this coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the front end. In regular OpenMP, 0 represents users don't specify any value such that the runtime needs to choose one. The multi-dim kernel launch is purely an extension and it has not been standardized yet. I'll do a follow up patch that emits [0,1,1]
for regular OpenMP if nothing is specified and [x,y,z]
for OMPX.
I ran this through our buildbot config and the new tests do not seem to pass. The 0-th dimension appears to be the culprit. Can you double check that it works?
|
Hmm, something wrong in the AMD part. I tested it on a local NVIDIA GPU and it works. Will take a look. |
@jplehr I tested it locally but couldn't reproduce the failure locally.
|
f6b6ea5
to
47a3a30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, fix the test.
Consider not printing dim 2 and 3 if we run in OpenMP mode, but OK for now.
47a3a30
to
5908b81
Compare
@jplehr Is that the only failure you observed? Could you help check the latest change? |
5908b81
to
cd4c7f5
Compare
Let me give this another shot. I'll let you know! |
I see these two failures on a local test run
The error message is
|
cd4c7f5
to
01139c5
Compare
I ended up with forcing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran this through local testing and saw the issues resolved.
in hindsight, i wish you had |
This pain is very much self-inflicted I'm afraid |
KernelArgs.ThreadLimit[2]}; | ||
uint32_t NumBlocks[3] = {KernelArgs.NumTeams[0], KernelArgs.NumTeams[1], | ||
KernelArgs.NumTeams[2]}; | ||
if (!IsBareKernel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this patch, IsBareKernel was not checked. Is this change required for the multi-dim support? Or is it unrelated change? Same question for other new uses of IsBareKernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this patch the check was in the two getter functions.
…lvm#118042) Change-Id: Iab77154e209eec3a902e4d079a4a233e52a32a8e
No description provided.