gpu: nvidia: Added support for native host task extension #2121

ShanoToni · 2024-09-25T12:20:12Z

Description

Adding support for the extension ext_codeplay_enqueue_native_command LINK to documentation.

Apart from improved dependencies for native commands, the native command provides less execution overhead, improving performance.

Example:

# old host task
    total perf: min(ms):38.256 avg(ms):97.7698
# native host task
    total perf: min(ms):33.4952 avg(ms):33.7441

Results generated running on A100 with command
--eltwise --engine=gpu --mode=P --skip-impl=dpcpp:ref:any --batch=tests/benchdnn/inputs/eltwise/test_eltwise_gpu

Checklist

General

[ x ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[ x ] Have you formatted the code using clang-format?

Performance improvements

[ x ] Have you submitted performance data that demonstrates performance improvements?

src/gpu/nvidia/sycl_cuda_compat.hpp

mgouicem · 2024-09-25T13:16:17Z

make test
disable device_cpu
enable device_gpu
enable thr_cuda
enable arch_rtx

densamoilov · 2024-09-25T21:11:57Z

src/gpu/nvidia/sycl_cuda_compat.hpp

@@ -35,9 +35,13 @@ T get_native_mem(const interop_handle &ih, U acc) {
            ih.get_native_mem<::sycl::backend::ext_oneapi_cuda>(acc));
 }

-template <typename T>
-void host_task(::sycl::handler &cgh, const T &task) {
+template <typename HandlerT, typename FnT>


The spec says:

The call to interopCallable must not submit any synchronous tasks to the native backend object, and it must not block waiting for any tasks to complete. The call also must not add tasks to backend objects that underly any other queue, aside from the queue that is associated with this handler. If it does any of these things, the behavior is undefined.

And we have a few places where we do explicitly synchronize, here, for example.

It seems that it may become an issue when ext_codeplay_enqueue_native_command is used.

In validation the following case hung and the process was killed by timeout: --mode-modifier=P --bnorm --engine=gpu --inplace=true ic23n"bnorm_ci_0d:1", so there seems to be an issue indeed. Given that we enable the new host task extension conditionally we should also introduce a function that should do the synchronization under the same condition.

I have made some changes and added a utility sync_device which would synchronize only when using default host_task.
I wanted to additionally ask, what is the reason for the additional interop_task? If my understanding is correct the current situation is q.submit{ host_task { q.submit { memset/device sync } } } in the current state I do not think that the internal submit will wait on any prior events, as that would be handled by the external submit, and no following dependencies will come out from the internal submit as we synchronize the device in the host_task case, while the enqueue_native_command would resolve any dependencies.

I have removed the internal interop task and as I am not seeing failures without it, let me know if this change does not make sense/ is wrong, and I can revert it.

I don't know why there was the additional interop_task and I can't think of any reason why we may need it if we block the entire device anyway.

ShanoToni requested a review from a team as a code owner September 25, 2024 12:20

github-actions bot added the platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia label Sep 25, 2024

mgouicem approved these changes Sep 25, 2024

View reviewed changes

src/gpu/nvidia/sycl_cuda_compat.hpp Show resolved Hide resolved

src/gpu/nvidia/sycl_cuda_compat.hpp Outdated Show resolved Hide resolved

kala855 approved these changes Sep 25, 2024

View reviewed changes

ShanoToni force-pushed the native_host_task branch 2 times, most recently from a7518a0 to bb561af Compare September 25, 2024 14:08

t4c1 approved these changes Sep 25, 2024

View reviewed changes

densamoilov reviewed Sep 25, 2024

View reviewed changes

gpu: nvidia: Added support for native host task extension

856db1d

ShanoToni force-pushed the native_host_task branch from bb561af to 856db1d Compare September 26, 2024 15:11

densamoilov approved these changes Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: nvidia: Added support for native host task extension #2121

gpu: nvidia: Added support for native host task extension #2121

ShanoToni commented Sep 25, 2024

mgouicem commented Sep 25, 2024

densamoilov Sep 25, 2024

densamoilov Sep 26, 2024 •

edited

Loading

ShanoToni Sep 26, 2024

densamoilov Sep 26, 2024

gpu: nvidia: Added support for native host task extension #2121

Are you sure you want to change the base?

gpu: nvidia: Added support for native host task extension #2121

Conversation

ShanoToni commented Sep 25, 2024

Description

Checklist

General

Performance improvements

mgouicem commented Sep 25, 2024

densamoilov Sep 25, 2024

Choose a reason for hiding this comment

densamoilov Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

ShanoToni Sep 26, 2024

Choose a reason for hiding this comment

densamoilov Sep 26, 2024

Choose a reason for hiding this comment

densamoilov Sep 26, 2024 •

edited

Loading