Skip to content

[SYCL][CUDA][HIP] Device info query for maximum registers per block and targetted exception on out-of-registers for CUDA. #9106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
8f867f4
[SYCL][CUDA][HIP] Add device info query for maximum registers per blo…
GeorgeWeb Apr 18, 2023
ca1ceed
[SYCL][CUDA] Improve error propagation and handling for out of launch…
GeorgeWeb Apr 18, 2023
33943d9
Apply missed git clang-format.
GeorgeWeb Apr 18, 2023
33adc69
Fix wrong value return type (size_t -> uint32_t) for the HIP query - …
GeorgeWeb Apr 19, 2023
93a4fa1
Update the out of registers test for CUDA to use more registers and e…
GeorgeWeb Apr 19, 2023
3e0a80d
Add the missing sum part of the vadd and sum kernel for the out of re…
GeorgeWeb Apr 20, 2023
f3964ef
Add new symbols to ABI dumps for Windows
GeorgeWeb Apr 20, 2023
7977029
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ma…
GeorgeWeb Apr 21, 2023
54bef4f
Add extension documentation and Feature Macro to enable it.
GeorgeWeb Apr 28, 2023
38ad146
Merge upstream/sycl and fix conflicts
GeorgeWeb Apr 28, 2023
322599f
Use the DeviceImplPtr changes in device_info
GeorgeWeb Apr 28, 2023
512bb6f
Bump PI header minor version and add change notes
GeorgeWeb Apr 28, 2023
0885d18
Update sycl ABI symbols for Linux and Windows
GeorgeWeb May 1, 2023
aa749c4
Apply clang-format to PI_H
GeorgeWeb May 1, 2023
b797b97
Update sycl/source/detail/error_handling/error_handling.cpp
GeorgeWeb May 1, 2023
ec15e8f
Update the usage example in the extension documentation.
GeorgeWeb May 2, 2023
c499af7
Remove unnecessary section from the extension doc
GeorgeWeb May 5, 2023
66683fc
Remove leftover subsection from extension doc.
GeorgeWeb May 9, 2023
b41ece9
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ma…
GeorgeWeb May 9, 2023
9b0dd50
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl-cuda-ma…
GeorgeWeb May 11, 2023
aaef3c8
Update related e2e tests run commands.
GeorgeWeb May 12, 2023
8173dbd
Update sycl/doc/extensions/experimental/sycl_ext_codeplay_max_registe…
GeorgeWeb May 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# sycl_ext_codeplay_max_registers_per_work_group_query

## Notice

This document describes an **experimental** API that applications can use to try
out a new feature. Future versions of this API may change in ways that are
incompatible with this experimental version.


## Introduction

This extension adds a new device information descriptor that provides the ability to query a device for the maximum number of registers available per work-group.

OpenCL never offered such query due to the nature of being a very platform specific one - which is why it is also absent from SYCL. Now that SYCL supports back-ends where the register usage is a limiting resource factor of the possible maximum work-group size for a kernel, having the ability to query that limit is important for writing safe and portable code.

## Feature test macro

As encouraged by the SYCL specification, a feature-test macro, `SYCL_EXT_CODEPLAY_MAX_REGISTERS_PER_WORK_GROUP_QUERY`, is provided to determine whether this extension is implemented.

## New device descriptor

| Device descriptor | Return type | Description |
| ------------------------------------------------------ | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ext::codeplay::experimental::info::device::max_registers_per_work_group |  unsigned int | Returns the maximum number of registers available for use per work-group based on the capability of the device. |

### Note

## Examples

```c++
sycl::device gpu = sycl::device{sycl::gpu_selector_v};
std::cout << gpu.get_info<sycl::info::device::name>() << '\n';

#ifdef SYCL_EXT_CODEPLAY_MAX_REGISTERS_PER_WORK_GROUP_QUERY
unsigned int registers_per_group = gpu.get_info<sycl::ext::codeplay::experimental::info::device::max_registers_per_work_group>();
std::cout << "Max registers per work-group: " << registers_per_group << '\n';
#endif
```

Ouputs to the console:

Executed using the CUDA back-end on NVIDIA.

```
NVIDIA ...
Max registers per work-group: 65536
```

- See: [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities)
6 changes: 5 additions & 1 deletion sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,11 @@
// native handles.
// 12.29 Support PI_EXT_PLATFORM_INFO_BACKEND query in piPlatformGetInfo
// 12.30 Added PI_EXT_INTEL_DEVICE_INFO_MEM_CHANNEL_SUPPORT device info query.
// 12.31 Added PI_EXT_CODEPLAY_DEVICE_INFO_MAX_REGISTERS_PER_WORK_GROUP device
// info query.

#define _PI_H_VERSION_MAJOR 12
#define _PI_H_VERSION_MINOR 30
#define _PI_H_VERSION_MINOR 31

#define _PI_STRING_HELPER(a) #a
#define _PI_CONCAT(a, b) _PI_STRING_HELPER(a.b)
Expand Down Expand Up @@ -350,6 +352,8 @@ typedef enum {
PI_EXT_DEVICE_INFO_ATOMIC_FENCE_ORDER_CAPABILITIES = 0x20006,
PI_EXT_DEVICE_INFO_ATOMIC_FENCE_SCOPE_CAPABILITIES = 0x20007,
PI_EXT_INTEL_DEVICE_INFO_MEM_CHANNEL_SUPPORT = 0x20008,
// The number of max registers per block (device specific)
PI_EXT_CODEPLAY_DEVICE_INFO_MAX_REGISTERS_PER_WORK_GROUP = 0x20009,
} _pi_device_info;

typedef enum {
Expand Down
3 changes: 3 additions & 0 deletions sycl/include/sycl/info/ext_codeplay_device_traits.def
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
#define __SYCL_PARAM_TRAITS_TEMPLATE_SPEC __SYCL_PARAM_TRAITS_SPEC
#endif
__SYCL_PARAM_TRAITS_SPEC(ext::codeplay::experimental,device, supports_fusion, bool, PI_EXT_CODEPLAY_DEVICE_INFO_SUPPORTS_FUSION)
__SYCL_PARAM_TRAITS_SPEC(
ext::codeplay::experimental, device, max_registers_per_work_group, uint32_t,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this extension documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now added sycl/doc/extensions/experimental/sycl_ext_codeplay_max_registers_per_work_group_query.md. I have seen another extension for num_regs having just an inline documentation where it was added in the PI header, but thought I'd add a full doc for sycl_ext_codeplay_max_registers_per_work_group with a MACRO.
Does that sound okay to you? And if so, and I am missing something, please let me know. Thank you! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Yes, I am okay with that and I think I see the one you are referring to. That one might have slipped through the cracks, but the general rule is that public interfaces like that should either be specification or extension interfaces.

@AerialMantis - On a related note, are there any particular rhyme or reason to when we use codeplay in the extension features vs. intel and/or oneapi?

PI_EXT_CODEPLAY_DEVICE_INFO_MAX_REGISTERS_PER_WORK_GROUP)
#ifdef __SYCL_PARAM_TRAITS_TEMPLATE_SPEC_NEEDS_UNDEF
#undef __SYCL_PARAM_TRAITS_TEMPLATE_SPEC
#undef __SYCL_PARAM_TRAITS_TEMPLATE_SPEC_NEEDS_UNDEF
Expand Down
44 changes: 44 additions & 0 deletions sycl/plugins/cuda/pi_cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,27 @@ bool getMaxRegistersJitOptionValue(const std::string &build_options,
return true;
}

// Helper to verify out-of-registers case (exceeded block max registers).
// If the kernel requires a number of registers for the entire thread
// block exceeds the hardware limitations, then the cuLaunchKernel call
// will fail to launch with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES error.
bool hasExceededMaxRegistersPerBlock(pi_device device, pi_kernel kernel,
size_t blockSize) {
assert(device);
assert(kernel);

int maxRegsPerBlock{0};
PI_CHECK_ERROR(cuDeviceGetAttribute(
&maxRegsPerBlock, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK,
device->get()));

int regsPerThread{0};
PI_CHECK_ERROR(cuFuncGetAttribute(&regsPerThread, CU_FUNC_ATTRIBUTE_NUM_REGS,
kernel->get()));

return blockSize * regsPerThread > size_t(maxRegsPerBlock);
};

} // anonymous namespace

/// ------ Error handling, matching OpenCL plugin semantics.
Expand Down Expand Up @@ -2111,6 +2132,21 @@ pi_result cuda_piDeviceGetInfo(pi_device device, pi_device_info param_name,
false);
}

case PI_EXT_CODEPLAY_DEVICE_INFO_MAX_REGISTERS_PER_WORK_GROUP: {
// Maximum number of 32-bit registers available to a thread block.
// Note: This number is shared by all thread blocks simultaneously resident
// on a multiprocessor.
int max_registers{-1};
PI_CHECK_ERROR(cuDeviceGetAttribute(
&max_registers, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK,
device->get()));

sycl::detail::pi::assertion(max_registers >= 0);

return getInfo(param_value_size, param_value, param_value_size_ret,
static_cast<uint32_t>(max_registers));
}

// TODO: Investigate if this information is available on CUDA.
case PI_DEVICE_INFO_PCI_ADDRESS:
case PI_DEVICE_INFO_GPU_EU_COUNT:
Expand Down Expand Up @@ -3218,10 +3254,18 @@ pi_result cuda_piEnqueueKernelLaunch(
return PI_SUCCESS;
};

size_t kernelLocalWorkGroupSize = 0;
for (size_t dim = 0; dim < work_dim; dim++) {
auto err = isValid(dim);
if (err != PI_SUCCESS)
return err;
// If no error then sum the total local work size per dim.
kernelLocalWorkGroupSize += local_work_size[dim];
}

if (hasExceededMaxRegistersPerBlock(command_queue->device_, kernel,
kernelLocalWorkGroupSize)) {
return PI_ERROR_INVALID_WORK_GROUP_SIZE;
}
} else {
guessLocalWorkSize(command_queue->device_, threadsPerBlock,
Expand Down
16 changes: 16 additions & 0 deletions sycl/plugins/hip/pi_hip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1970,6 +1970,22 @@ pi_result hip_piDeviceGetInfo(pi_device device, pi_device_info param_name,
false);
}

case PI_EXT_CODEPLAY_DEVICE_INFO_MAX_REGISTERS_PER_WORK_GROUP: {
// Maximum number of 32-bit registers available to a thread block.
// Note: This number is shared by all thread blocks simultaneously resident
// on a multiprocessor.
int max_registers{-1};
sycl::detail::pi::assertion(
hipDeviceGetAttribute(&max_registers,
hipDeviceAttributeMaxRegistersPerBlock,
device->get()) == hipSuccess);

sycl::detail::pi::assertion(max_registers >= 0);

return getInfo(param_value_size, param_value, param_value_size_ret,
static_cast<uint32_t>(max_registers));
}

// TODO: Investigate if this information is available on HIP.
case PI_DEVICE_INFO_PCI_ADDRESS:
case PI_DEVICE_INFO_GPU_EU_COUNT:
Expand Down
24 changes: 24 additions & 0 deletions sycl/source/detail/device_info.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -780,6 +780,22 @@ struct get_device_info_impl<
}
};

// Specialization for max registers per work-group
template <>
struct get_device_info_impl<
uint32_t,
ext::codeplay::experimental::info::device::max_registers_per_work_group> {
static uint32_t get(const DeviceImplPtr &Dev) {
uint32_t maxRegsPerWG;
Dev->getPlugin().call<PiApiKind::piDeviceGetInfo>(
Dev->getHandleRef(),
PiInfoCode<ext::codeplay::experimental::info::device::
max_registers_per_work_group>::value,
sizeof(maxRegsPerWG), &maxRegsPerWG, nullptr);
return maxRegsPerWG;
}
};

template <typename Param>
typename Param::return_type get_device_info(const DeviceImplPtr &Dev) {
static_assert(is_device_info_desc<Param>::value,
Expand Down Expand Up @@ -1660,6 +1676,14 @@ inline bool get_device_info_host<
return false;
}

template <>
inline uint32_t get_device_info_host<
ext::codeplay::experimental::info::device::max_registers_per_work_group>() {
throw runtime_error("Obtaining the maximum number of available registers per "
"work-group is not supported on HOST device",
PI_ERROR_INVALID_DEVICE);
}

} // namespace detail
} // __SYCL_INLINE_VER_NAMESPACE(_V1)
} // namespace sycl
43 changes: 43 additions & 0 deletions sycl/source/detail/error_handling/error_handling.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ void handleInvalidWorkGroupSize(const device_impl &DeviceImpl, pi_kernel Kernel,
bool IsOpenCLV1x = false; // Backend is OpenCL 1.x
bool IsOpenCLVGE20 = false; // Backend is Greater or Equal to OpenCL 2.0
bool IsLevelZero = false; // Backend is any OneAPI Level 0 version
bool IsCuda = false; // Backend is CUDA
auto Backend = Platform.get_backend();
if (Backend == sycl::backend::opencl) {
std::string VersionString =
Expand All @@ -63,6 +64,8 @@ void handleInvalidWorkGroupSize(const device_impl &DeviceImpl, pi_kernel Kernel,
(VersionString.find("2.") == 0) || (VersionString.find("3.") == 0);
} else if (Backend == sycl::backend::ext_oneapi_level_zero) {
IsLevelZero = true;
} else if (Backend == sycl::backend::ext_oneapi_cuda) {
IsCuda = true;
}

size_t CompileWGSize[3] = {0};
Expand Down Expand Up @@ -237,6 +240,46 @@ void handleInvalidWorkGroupSize(const device_impl &DeviceImpl, pi_kernel Kernel,
// else unknown. fallback (below)
}
}
} else if (IsCuda) {
// CUDA:
// PI_ERROR_INVALID_WORK_GROUP_SIZE is returned when the kernel registers
// required for the launch config exceeds the maximum number of registers
// per block (PI_EXT_CODEPLAY_DEVICE_INFO_MAX_REGISTERS_PER_WORK_GROUP).
// This is if local_work_size[0] * ... * local_work_size[work_dim - 1]
// multiplied by PI_KERNEL_GROUP_INFO_NUM_REGS is greater than the value
// of PI_KERNEL_MAX_NUM_REGISTERS_PER_BLOCK. See Table 15: Technical
// Specifications per Compute Capability, for limitations.
const size_t TotalNumberOfWIs =
NDRDesc.LocalSize[0] * NDRDesc.LocalSize[1] * NDRDesc.LocalSize[2];

uint32_t NumRegisters = 0;
Plugin.call<PiApiKind::piKernelGetGroupInfo>(
Kernel, Device, PI_KERNEL_GROUP_INFO_NUM_REGS, sizeof(NumRegisters),
&NumRegisters, nullptr);

uint32_t MaxRegistersPerBlock =
DeviceImpl.get_info<ext::codeplay::experimental::info::device::
max_registers_per_work_group>();

const bool HasExceededAvailableRegisters =
TotalNumberOfWIs * NumRegisters > MaxRegistersPerBlock;

if (HasExceededAvailableRegisters) {
std::string message(
"Exceeded the number of registers available on the hardware.\n");
throw sycl::nd_range_error(
// Additional information which can be helpful to the user.
message.append(
"\tThe number registers per work-group cannot exceed " +
std::to_string(MaxRegistersPerBlock) +
" for this kernel on this device.\n"
"\tThe kernel uses " +
std::to_string(NumRegisters) +
" registers per work-item for a total of " +
std::to_string(TotalNumberOfWIs) +
" work-items per work-group.\n"),
PI_ERROR_INVALID_WORK_GROUP_SIZE);
}
} else {
// TODO: Decide what checks (if any) we need for the other backends
}
Expand Down
1 change: 1 addition & 0 deletions sycl/source/feature_test.hpp.in
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ __SYCL_INLINE_VER_NAMESPACE(_V1) {
#define SYCL_EXT_CODEPLAY_KERNEL_FUSION 1
#endif
#define SYCL_EXT_INTEL_CACHE_CONFIG 1
#define SYCL_EXT_CODEPLAY_MAX_REGISTERS_PER_WORK_GROUP_QUERY 1

#ifndef __has_include
#define __has_include(x) 0
Expand Down
24 changes: 24 additions & 0 deletions sycl/test-e2e/Basic/max_registers_per_work_group_query.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// REQUIRES: cuda || hip
// RUN: %{build} -o %t.out
// RUN: %{run} %t.out

#include <sycl/sycl.hpp>

int main() {
sycl::queue q;
sycl::device dev = q.get_device();

#if !defined(SYCL_EXT_CODEPLAY_MAX_REGISTERS_PER_WORK_GROUP_QUERY)
#error SYCL_EXT_CODEPLAY_MAX_REGISTERS_PER_WORK_GROUP_QUERY is not defined!
#endif

auto max_regs_per_wg =
dev.get_info<sycl::ext::codeplay::experimental::info::device::
max_registers_per_work_group>();
std::cout << "Max register per work-group: " << max_regs_per_wg << std::endl;

assert(max_regs_per_wg > 0);

std::cout << "Passed!" << std::endl;
return 0;
}
Loading