Skip to content

Host task lacks proper synchronization capabilities when used for interoperability #11284

Open
@densamoilov

Description

@densamoilov

From SYCL specification: 4.10. Host tasks:

A host task can optionally be used to interoperate with the native backend objects associated with the queue executing the host task, the context that the queue is associated with, the device that the queue is associated with and the accessors that have been captured in the callable, via an optional interop_handle parameter.

This allows host tasks to be used for two purposes: either as a task which can perform arbitrary C++ code within the scheduling of the SYCL runtime or as a task which can perform interoperability at a point within the scheduling of the SYCL runtime.

So one of the main purposes of the host task is to provide users with the capabilities to embed a 3rd party library call into the SYCL programming model. The problem is that the embedding doesn't work properly.

Example of the problem:

    // CUDA backend.
    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(...);
        });
    });
    e.wait(); // the host task is guarantied to be completed but `cudnnAddTensor` is not.

The out of order queue for CUDA and HIP backends is implemented via using multiple CUDA/HIP streams inside a single SYCL queue. In the very beginning the SYCL queue always contained a single CUDA/HIP stream therefore the lack of synchronization was not a big problem at that point as the operations were always executed in order.

After implementing the out of order via multiple streams the problem becomes severe and requires a proper solution otherwise the host task becomes nearly unusable for one of the main purposes - interoperability. On top of that, the profiling capabilities do not work as well.
As an option, for CUDA (the same can be used for HIP as well), the problem could be solved via using cudaEvent_t. While submitting the host task SYCL RT could use cudaEventRecord to capture the content of the stream after executing the host task so that when the submitted operation within the host task is executed the cudaEvent_t can notify us about it. The cudaEvent_t can be wrapped in SYCL event and returned to the users. I understand that it's probably hard to automatically identify when SYCL RT should use cudaEventRecord so as an option SYCL can provide some API so that users could let the SYCL RT know about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions