-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training gets frozen while using multiple-GPUs #29
Comments
Problems solved for now! In case people might encounter a similar issue, |
Hi @NathanYanJing. Your |
Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice. Yes, it seems the problem comes back again somehow now -- it hangs at Dataloader part. I am guessing that this is probably NCCL and Nvidia-version issue. Would you mind sharing your NCCL and Cuda versions? |
Hi,I encountered a similar issue to yours. I encountered this error at model = DDP(model, device_ids=[rank]) and dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM):
After following your method to modify from torch.nn import DataParallel as DDP, I was able to resolve the first error, but the second one still persists.For this kind of situation, do you have any suggestions? Thanks for your reply! |
Super cool and amazing work!
I am writing to ask for your assistance with an issue I am encountering while training a model using A6000 GPUs. I am using the following command to run my code:
The problem I am experiencing is that the training appears to be frozen after creating the experiment directory for a long period of time. On occasion, it also throws the following error:
I have not experienced this problem when training with 1, 2, or 3 nodes.
I apologize for my lack of experience in this area, but could you please provide any insights or guidance to help me resolve this issue? Thank you for your assistance.
The text was updated successfully, but these errors were encountered: