-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default gpu metric backend #2501
Conversation
Sorry, mixed with some other commits. will update it later. |
4a9920d
to
2dd6ca9
Compare
resolved |
How about we make |
done in 4398557 |
torch
userbenchmark/triton/run.py
Outdated
parser.add_argument( | ||
"--metrics-gpu-backend", | ||
choices=["default", "nvml"], | ||
default="default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the default mode name to torch
for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed in 8b48eea
run.py
Outdated
@@ -477,18 +477,17 @@ def main() -> None: | |||
) | |||
parser.add_argument( | |||
"--metrics-gpu-backend", | |||
choices=["dcgm", "default"], | |||
choices=["dcgm", "default", "nvml"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, let's change the default backend name to torch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed in 8b48eea
@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The current GPU memory metric backend includes dcgm and nvml. They are reported from hardware and should be accurate. This PR adds the native torch way to collect GPU memory usage. It uses
torch.cuda.max_memory_allocated()
. The benefit is that it has lower overhead and is accurate on a shared GPU server when there are mutliple GPU processes from other users. It is because we don't implement the process filter for the other two backends.Use
--metrics-gpu-backend torch
to set the backend.