-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kept for Feedback] Multi-GPU & New models #1
Comments
@TiankaiHang Hi! Thanks for your attention in our work!
I'd love to do that, but I have never had access to a machine with more than 1 GPU in my entire life... So if you or anyone else can make a pull request to support this would be really nice...
I believe a better model would lead to better results. But training V3/V3+ requires at least double the compute budget, that's why I did not do them. Also because the V2 results are still important for comparison against prior arts, that would lead to at least 3x compute budget if I chose to use V3/V3+ back then. I just do not have the cards. Some additional info: On ResNet backbones, my experience tells me that V3+ could be worse than V3. For background: pytorch/vision#2689 |
Thanks for your kind reply~ :-) Best. |
1 similar comment
Thanks for your kind reply~ :-) Best. |
You're welcome.
|
I will update it for multi-gpus after I reproduce it. Maybe next week, now the time is not enough. |
That's great to hear! Go for it! |
I would suggest checking out: https://github.com/huggingface/accelerate which should be relatively easy to deploy any model in a distributed setting. |
@lorenmt Good point! Thanks a lot! |
Yeah, thank you for your sharing and I am still working on this project. I'am sorry to say that I don't update it for multi-GPU until now. Something changed, I reproduce this project on another job. So the multi-GPU code is not match now. I'm trying to add multi-GPU on this project following Accelerate, today. Unfortunately, I have not solved this bug below. RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument |
@jinhuan-hit Thanks a lot! I still don't have the hardware to debug multi-GPU for now. But hopefully I'll be able to debug this month/the next. |
I have already check the network, but find nothing. Looking forward to hearing your good results! |
Yes I think you're right. I also did not find redundant layers. I'll also try investigate this when I got the cards. |
Have you tried to set |
Yeah, you are right! I warp the network using DDP with find_unused_parameters=True by myself, but it doesn't work. However, when I add find_unused_parameters=True to the function prepare of accelerator in Accelerate package, the job works well. Unfortunately, I'm sorry to say that I have not verify the result. The package version I used: torch==1.4.0, torchvision==0.5.0, accelerate==0.1.0.
Also, I change the main.py following https://github.com/huggingface/accelerate
Then it should work. Best wishes. |
@jinhuan-hit If the results are similar compared to single card under mixed precision, maybe you'd like to send a pull request for this? |
Yeah, I'm checking the results now. If OK, I'd like to send a PR. |
Thanks a lot! If a PyTorch version update that concerns code change is necessary for using Accelerate, please make the version update & multi-GPU in 2 separate PRs, if possible (one PR is also good). |
I use pytorch1.4.0 because of Accelerate. Now I'm using fp32 in training and it works well without any code modification. |
I have checked the result and it looks normal! |
Great! I'll formulate a draft PR for comments. |
Thanks for everyone's help! DDP is now supported. Please report bugs if you've found any. |
Thanks for your nice work and congratulations on your good results!
I have several questions.
Best.
The text was updated successfully, but these errors were encountered: