Host task lacks proper synchronization capabilities when used for interoperability

From SYCL specification: 4.10. Host tasks:
> A [host task](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#host-task) can optionally be used to interoperate with the [native backend objects](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#native-backend-object) associated with the [queue](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#queue) executing the [host task](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#host-task), the [context](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#context) that the [queue](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#queue) is associated with, the [device](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#device) that the [queue](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#queue) is associated with and the [accessors](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#accessor) that have been captured in the callable, via an optional interop_handle parameter.

> This allows [host tasks](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#host-task) to be used for two purposes: either as a task which can perform arbitrary C++ code within the scheduling of the [SYCL runtime](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sycl-runtime) **or as a task which can perform interoperability at a point within the scheduling of the** [**SYCL runtime**](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sycl-runtime).

So one of the main purposes of the host task is to provide users with the capabilities to embed a 3rd party library call into the SYCL programming model. The problem is that the embedding doesn't work properly.

Example of the problem:
```cpp
    // CUDA backend.
    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
        });
    });
    e.wait(); // the host task is guarantied to be completed but `cudnnAddTensor` is not.
```
The out of order queue for CUDA and HIP backends is implemented via using multiple CUDA/HIP streams inside a single SYCL queue. In the very beginning the SYCL queue always contained a single CUDA/HIP stream therefore the lack of synchronization was not a big problem at that point as the operations were always executed in order.

After implementing the out of order via multiple streams the problem becomes severe and requires a proper solution otherwise the host task becomes nearly unusable for one of the main purposes - interoperability. On top of that, the profiling capabilities do not work as well.
As an option, for CUDA (the same can be used for HIP as well), the problem could be solved via using `cudaEvent_t`. While submitting the host task SYCL RT could use `cudaEventRecord` to capture the content of the stream after executing the host task so that when the submitted operation within the host task is executed the `cudaEvent_t` can notify us about it. The `cudaEvent_t` can be wrapped in SYCL event and returned to the users. I understand that it's probably hard to automatically identify when SYCL RT should use `cudaEventRecord` so as an option SYCL can provide some API so that users could let the SYCL RT know about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Host task lacks proper synchronization capabilities when used for interoperability #11284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Host task lacks proper synchronization capabilities when used for interoperability #11284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions