-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: nvidia: Added support for native host task extension #2121
base: main
Are you sure you want to change the base?
Conversation
make test |
a7518a0
to
bb561af
Compare
@@ -35,9 +35,13 @@ T get_native_mem(const interop_handle &ih, U acc) { | |||
ih.get_native_mem<::sycl::backend::ext_oneapi_cuda>(acc)); | |||
} | |||
|
|||
template <typename T> | |||
void host_task(::sycl::handler &cgh, const T &task) { | |||
template <typename HandlerT, typename FnT> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec says:
The call to interopCallable must not submit any synchronous tasks to the native backend object, and it must not block waiting for any tasks to complete. The call also must not add tasks to backend objects that underly any other queue, aside from the queue that is associated with this handler. If it does any of these things, the behavior is undefined.
And we have a few places where we do explicitly synchronize, here, for example.
It seems that it may become an issue when ext_codeplay_enqueue_native_command
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In validation the following case hung and the process was killed by timeout: --mode-modifier=P --bnorm --engine=gpu --inplace=true ic23n"bnorm_ci_0d:1"
, so there seems to be an issue indeed. Given that we enable the new host task extension conditionally we should also introduce a function that should do the synchronization under the same condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made some changes and added a utility sync_device
which would synchronize only when using default host_task.
I wanted to additionally ask, what is the reason for the additional interop_task? If my understanding is correct the current situation is q.submit{ host_task { q.submit { memset/device sync } } }
in the current state I do not think that the internal submit will wait on any prior events, as that would be handled by the external submit, and no following dependencies will come out from the internal submit as we synchronize the device in the host_task
case, while the enqueue_native_command
would resolve any dependencies.
I have removed the internal interop task and as I am not seeing failures without it, let me know if this change does not make sense/ is wrong, and I can revert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why there was the additional interop_task and I can't think of any reason why we may need it if we block the entire device anyway.
bb561af
to
856db1d
Compare
Description
Adding support for the extension
ext_codeplay_enqueue_native_command
LINK to documentation.Apart from improved dependencies for native commands, the native command provides less execution overhead, improving performance.
Example:
Results generated running on A100 with command
--eltwise --engine=gpu --mode=P --skip-impl=dpcpp:ref:any --batch=tests/benchdnn/inputs/eltwise/test_eltwise_gpu
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?Performance improvements