-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[<Ray component: Train>] Ray Train fails for AMD multi-gpu: Invalid Device Ordinal. #49260
Comments
I'm getting the same error with 2.40.0 and torch 2.5.1 and Rocm 6.2. Single GPU runs work fine, but all multi-GPU runs give this error. The bug affects every multi-GPU TorchTrainer job on AMD. The same code seems to run fine on Nvidia GPUs. |
This error stems from torch trying to set a device based on CUDA_VISIBLE_DEVICES, but ROCR_VISIBLE_DEVICES only see a single device (i.e. device masking has already been done). A temporary fix is to set
If my kids allow me, I will open a PR with a better fix tonight unless @hongpeng-guo beats me to it |
I noticed this as well, but it's not enough to run the above training script for me unfortunately. I no longer hit the invalid device ordinal but instead get a much less informative error:
I'm not sure yet if the above change is actually a solution, or if this is an entirely separate problem but it at least still appears to be cuda device related. |
@amorinConnor that looks like a separate problem, unrelated to GPU indexing/viability/communication |
@AVSuni Thank you so much for helping on this problem. Feel free to start a PR, I can help with it later. 👍 |
@AVSuni Have you confirmed this works on an AMD multi gpu setup? For me personally the code fails when wrapping the model in DistributedDataParallel. |
Close this Issue for now. Solved by #49346 |
What happened + What you expected to happen
I am trying to run some basic torch training code on an AMD machine with 4 GPUs. For multi-gpu training (num workers >1 below) Ray fails with the following error:
I verified some test code after getting help on the ray forums: https://discuss.ray.io/t/torchtrainer-fails-rocm-multi-gpu-invalid-device-ordinal/21041
Versions / Dependencies
ray 2.40.0
torch 2.5.0+rocm6.1
Python 3.9.2
Red Hat 10.3.1
Note this also fails on rocm 6.2
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: