Skip to content

Add AMD Support#173

Open
bethune-bryant wants to merge 20 commits into
wookayin:masterfrom
bethune-bryant:brnelson/add_amd_spport
Open

Add AMD Support#173
bethune-bryant wants to merge 20 commits into
wookayin:masterfrom
bethune-bryant:brnelson/add_amd_spport

Conversation

@bethune-bryant
Copy link
Copy Markdown
Contributor

@bethune-bryant bethune-bryant commented Jul 29, 2024

Fixes #137

Design

To do this I duplicate the pynvml interface already used by gpustat in a wrapper around rocmi and dynamically import the correct library based on what hardware is present.

Current Status

The base functionality is currently working:
image

Remaining Tasks

  • Basic Functionality
  • Testing
  • Documentation

@bethune-bryant bethune-bryant changed the title WIP - AMD Support Add AMD Support Aug 8, 2024
@bethune-bryant bethune-bryant marked this pull request as ready for review August 8, 2024 14:37
@bethune-bryant
Copy link
Copy Markdown
Contributor Author

@wookayin
Before I start working on documentation and testing, would you mind taking a look at this PR?
Do you agree with the overall design, or is there something you would like changed?
Do you have any concerns?

@wookayin wookayin self-assigned this Aug 8, 2024
Copy link
Copy Markdown
Collaborator

@Stonesjtu Stonesjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some mocking tests for ROCM devices?

Comment thread setup.py
Comment thread gpustat/util.py
@bethune-bryant
Copy link
Copy Markdown
Contributor Author

Can you add some mocking tests for ROCM devices?

I'm not super familiar with mockito, but I've started looking into this.

Copy link
Copy Markdown
Collaborator

@Stonesjtu Stonesjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

for the testing part, we can mock a ROCML based NVML library call like NVMLGetFanSpeed to return constant values.

Comment thread gpustat/core.py
gpu_stat = InvalidGPU(index, "((Unknown Error))", e)
except N.NVMLError_GpuIsLost as e:
gpu_stat = InvalidGPU(index, "((GPU is lost))", e)
except Exception as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise the N.NVMLError_Unknown Error for consistency?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps: we can catch NVMLError instead of Base Exception, since you may ignore some python native errors

Comment thread gpustat/rocml.py
super().__init__(self.message)


class NVMLError_Unknown(Exception):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these NVMLError_xxx inherit NVMLError?

Comment thread gpustat/rocml.py
except (ImportError, SyntaxError, RuntimeError) as e:
_rocmi = sys.modules.get("rocmi", None)

raise ImportError(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a dedicated NVMLError subclass?

Comment thread gpustat/util.py
@metalcycling
Copy link
Copy Markdown

Will this be merged at some point?

@remi-or
Copy link
Copy Markdown

remi-or commented Sep 25, 2025

Hey everyone! I would be very happy to use the same package for amd and nvidia, what are the blockers that need to be addressed for this to get merged? Happy to make contributions.
cc. @wookayin as it looks like you are the core contributor 🙂

@wookayin
Copy link
Copy Markdown
Owner

Hi, I am very sorry that I've been inactive in this PR as I don't have any machines with an AMD graphics card (neither local or remote) where I can give a test. But if some of you can help out testing out the feature, I'd be so grateful and happy to have this merged sooner than later. I might also need to try setting up AWS G4ad instances soon.

@remi-or
Copy link
Copy Markdown

remi-or commented Sep 25, 2025

I have access to a node with AMD GPUs, so I would be happy to test it out. Though I see no testing script was added in this PR, so do you have any test in mind? Happy if this gets merged soon as well 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AMD support

6 participants