Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: Added support for native host task extension #2121

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ShanoToni
Copy link
Contributor

Description

Adding support for the extension ext_codeplay_enqueue_native_command LINK to documentation.

Apart from improved dependencies for native commands, the native command provides less execution overhead, improving performance.

Example:

# old host task
    total perf: min(ms):38.256 avg(ms):97.7698
# native host task
    total perf: min(ms):33.4952 avg(ms):33.7441

Results generated running on A100 with command
--eltwise --engine=gpu --mode=P --skip-impl=dpcpp:ref:any --batch=tests/benchdnn/inputs/eltwise/test_eltwise_gpu

Checklist

General

  • [ x ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [ x ] Have you formatted the code using clang-format?

Performance improvements

  • [ x ] Have you submitted performance data that demonstrates performance improvements?

@ShanoToni ShanoToni requested a review from a team as a code owner September 25, 2024 12:20
@github-actions github-actions bot added the platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia label Sep 25, 2024
src/gpu/nvidia/sycl_cuda_compat.hpp Show resolved Hide resolved
src/gpu/nvidia/sycl_cuda_compat.hpp Outdated Show resolved Hide resolved
@mgouicem
Copy link
Contributor

make test
disable device_cpu
enable device_gpu
enable thr_cuda
enable arch_rtx

@ShanoToni ShanoToni force-pushed the native_host_task branch 2 times, most recently from a7518a0 to bb561af Compare September 25, 2024 14:08
@@ -35,9 +35,13 @@ T get_native_mem(const interop_handle &ih, U acc) {
ih.get_native_mem<::sycl::backend::ext_oneapi_cuda>(acc));
}

template <typename T>
void host_task(::sycl::handler &cgh, const T &task) {
template <typename HandlerT, typename FnT>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says:

The call to interopCallable must not submit any synchronous tasks to the native backend object, and it must not block waiting for any tasks to complete. The call also must not add tasks to backend objects that underly any other queue, aside from the queue that is associated with this handler. If it does any of these things, the behavior is undefined.

And we have a few places where we do explicitly synchronize, here, for example.

It seems that it may become an issue when ext_codeplay_enqueue_native_command is used.

Copy link
Contributor

@densamoilov densamoilov Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In validation the following case hung and the process was killed by timeout: --mode-modifier=P --bnorm --engine=gpu --inplace=true ic23n"bnorm_ci_0d:1", so there seems to be an issue indeed. Given that we enable the new host task extension conditionally we should also introduce a function that should do the synchronization under the same condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some changes and added a utility sync_device which would synchronize only when using default host_task.
I wanted to additionally ask, what is the reason for the additional interop_task? If my understanding is correct the current situation is q.submit{ host_task { q.submit { memset/device sync } } } in the current state I do not think that the internal submit will wait on any prior events, as that would be handled by the external submit, and no following dependencies will come out from the internal submit as we synchronize the device in the host_task case, while the enqueue_native_command would resolve any dependencies.

I have removed the internal interop task and as I am not seeing failures without it, let me know if this change does not make sense/ is wrong, and I can revert it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why there was the additional interop_task and I can't think of any reason why we may need it if we block the entire device anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants