Add AMD Support#173
Conversation
|
@wookayin |
Stonesjtu
left a comment
There was a problem hiding this comment.
Can you add some mocking tests for ROCM devices?
I'm not super familiar with mockito, but I've started looking into this. |
Stonesjtu
left a comment
There was a problem hiding this comment.
LGTM.
for the testing part, we can mock a ROCML based NVML library call like NVMLGetFanSpeed to return constant values.
| gpu_stat = InvalidGPU(index, "((Unknown Error))", e) | ||
| except N.NVMLError_GpuIsLost as e: | ||
| gpu_stat = InvalidGPU(index, "((GPU is lost))", e) | ||
| except Exception as e: |
There was a problem hiding this comment.
Should we raise the N.NVMLError_Unknown Error for consistency?
There was a problem hiding this comment.
ps: we can catch NVMLError instead of Base Exception, since you may ignore some python native errors
| super().__init__(self.message) | ||
|
|
||
|
|
||
| class NVMLError_Unknown(Exception): |
There was a problem hiding this comment.
Should these NVMLError_xxx inherit NVMLError?
| except (ImportError, SyntaxError, RuntimeError) as e: | ||
| _rocmi = sys.modules.get("rocmi", None) | ||
|
|
||
| raise ImportError( |
There was a problem hiding this comment.
Should we make this a dedicated NVMLError subclass?
|
Will this be merged at some point? |
|
Hey everyone! I would be very happy to use the same package for |
|
Hi, I am very sorry that I've been inactive in this PR as I don't have any machines with an AMD graphics card (neither local or remote) where I can give a test. But if some of you can help out testing out the feature, I'd be so grateful and happy to have this merged sooner than later. I might also need to try setting up AWS G4ad instances soon. |
|
I have access to a node with AMD GPUs, so I would be happy to test it out. Though I see no testing script was added in this PR, so do you have any test in mind? Happy if this gets merged soon as well 😄 |
Fixes #137
Design
To do this I duplicate the
pynvmlinterface already used by gpustat in a wrapper around rocmi and dynamically import the correct library based on what hardware is present.Current Status
The base functionality is currently working:

Remaining Tasks