Skip to content

Add NVTX range in CUDA GPU kernel call of program #1986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

philip-paul-mueller
Copy link
Collaborator

This changes were made by Ioannis Magkanaris, I only opened the PR.
It adds NVTX ranges around the kernel call generated by DaCe, this allows to easily distinguish it from other CUDA activity such as CuPy.

Copy link
Contributor

@alexnick83 alexnick83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea, and NVTX ranges can also be profiled with nsys. However, would it perhaps be more appropriate to do it through the Instrumentation API? I believe we already have everything in place with GPUEventProvider. The push should happen in on_scope_entry method and the pop in the on_scope_exit method.

@tbennun @phschaad since you have worked on that file the most, do you have a better suggestion?

@tbennun
Copy link
Collaborator

tbennun commented Apr 28, 2025

I agree. This should be possible to implement in a nicer way with the instrumentation API, instrumenting the SDFG (or a state, or a group of maps etc.) with, e.g., a new instrumentation type called GPU_Region. If it is part of the instrumentation, it is also not going to be enabled always by default (the calls might add overhead for very short microsecond-scale SDFGs).

In fact, this could even be implemented in Python with the SDFG call hooks. Here is an example of how to do it in CuPy, which should be even more portable towards AMD GPUs:
https://docs.cupy.dev/en/latest/reference/generated/cupy.cuda.nvtx.RangePush.html

@philip-paul-mueller
Copy link
Collaborator Author

Okay we will look into the direction of the GPUEventProvider.
However, I am against of using a hook in Python, as this will for sure add way more overhead to "[...] very short microsecond-scale SDFGs [...]".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants