Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default gpu metric backend #2501

Closed
wants to merge 10 commits into from
Closed

Conversation

FindHao
Copy link
Member

@FindHao FindHao commented Oct 10, 2024

The current GPU memory metric backend includes dcgm and nvml. They are reported from hardware and should be accurate. This PR adds the native torch way to collect GPU memory usage. It uses torch.cuda.max_memory_allocated(). The benefit is that it has lower overhead and is accurate on a shared GPU server when there are mutliple GPU processes from other users. It is because we don't implement the process filter for the other two backends.

Use --metrics-gpu-backend torch to set the backend.

@FindHao
Copy link
Member Author

FindHao commented Oct 10, 2024

Sorry, mixed with some other commits. will update it later.

@FindHao
Copy link
Member Author

FindHao commented Oct 10, 2024

Sorry, mixed with some other commits. will update it later.

resolved

@xuzhao9
Copy link
Contributor

xuzhao9 commented Oct 10, 2024

How about we make torch backend as default to make it consistent with PT2 benchmark runner?

@FindHao
Copy link
Member Author

FindHao commented Oct 10, 2024

How about we make torch backend as default to make it consistent with PT2 benchmark runner?

done in 4398557

@FindHao FindHao changed the title Add new metric backend torch Change default gpu metric backend Oct 10, 2024
parser.add_argument(
"--metrics-gpu-backend",
choices=["default", "nvml"],
default="default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the default mode name to torch for readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed in 8b48eea

run.py Outdated
@@ -477,18 +477,17 @@ def main() -> None:
)
parser.add_argument(
"--metrics-gpu-backend",
choices=["dcgm", "default"],
choices=["dcgm", "default", "nvml"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's change the default backend name to torch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed in 8b48eea

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao merged this pull request in c396191.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants