Description
Describe the bug
A SYCL context can be constructed with either a single device, or multiple devices as long as all of those devices are of the same platform. However, the CUDA backend currently doesn't support the multi-device option.
This is due to a limitation in the implementation of the context in the PI plugin for CUDA, which derives from a limitation in the CUDA programming model, where a CUDA context can only be associated with a single CUDA device, and a decision in the initial implementation of the CUDA backend to map a SYCL context 1:1 with the CUDA context.
This limits the multi-device context use case which is supported by other DPC++ backends, which could potentially lead to users configuring contexts differently depending on the backend.
To Reproduce
You can reproduce this by constructing a context from multiple devices of the same platform, when targeting the CUDA backend.
auto platforms = cl::sycl::platform::get_platforms();
if (platforms.size() > 0) {
auto devices = platforms[0].get_devices();
if (devices.size() > 1) {
// This constructor if reached will fail.
auto multiDeviceContext = cl::sycl::context(devices));
}
}
Proposed solution
Note this idea is still a work in progress, but I wanted to share what I had so far to get some feedback on it.
The proposed solution here would be to alter the implementation of the PI CUDA context such that it contains multiple CUDA contexts, where each one corresponds to a CUDA device. This would allow the SYCL context to represent multiple devices as is expected.
However, the caveat to this is that the PI CUDA context would now have multiple CUDA contexts and devices, which means that any point in the DPC++ SYCL runtime where a context-specific operation needs to be performed, it would then be necessary to differentiate which CUDA context should be used, which requires knowledge of the target device.
This means that certain parts of the DPC++ SYCL runtime may need to be altered in order to ensure that when a context is needed the device is also accessible. I am still investigating this further in order to identify what specific changes would need to be made and whether this would cause any significant problems, but I have an initial high-level assessment of potential problem areas.
- Buffers - generally memory allocations and copy operations are derived from command groups which have knowledge of the device, however, there may be cases where the device associated with a dependency is not available.
- USM - generally USM operations such as
malloc_*
are associated with a context and a device, either directly or via a queue, however, thefree
function only takes a context, so the device which the memory was allocated on may not be known. - Events - generally an event is associated with a context and doesn't have any knowledge of the device where the event came from.
- Kernel bundles - kernel bundles are also associated with a context, though they are also associated with a set of devices, so this might require some changes, but I suspect that this shouldn't cause an issue.
- Interop - Interop with the CUDA backend would need to change in that the native object for a context would become a vector of CUDA contexts and there would be an implicit relationship with the devices they are associated with that would need to be documented.
There may be other areas to consider, but this is what I have identified so far. Some of these problem areas may also require minor modifications to the SYCL specification, I suspect and hope that won't be necessary, though it's something to consider.
Another potential problem is that the changes described above may change an underlying assumption in the DPC++ SYCL runtime (as I understand, please correct me if I'm wrong) that if a memory object is in the same context no explicit memory movement is required. A possible solution to this is to introduce a PI plugin API for moving data between two devices on the same context, which for most backends could be a no-op, though for the OpenCL this could be an opportunity to could use clEnqueueMigrateMemObjects
, but for the CUDA backend would perform peer-to-peer copies between the contexts (as implemented in #4332).