-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with num_gpus #6
Comments
Empirically, I discovered that it is necessary to halve the learning rate in order to training correctly on a 2-GPU machine instead of 4-GPU machine. Hereby I provide the training log with 2 GPUs and 0.5x learning rate, whose loss logs match the official released training log. |
It's weird. I use the detectron2's default GPU setting for training. Thank you for your advice! |
[UPDATE] fsod_train_log.txt |
With 4 GPUs, I rerun the defaults settings in all.sh and the AP is correct (11.27@voc 20 cats).
However, when I try to use another machines equiped with 2 GPUs, the loss_cls becomes strange and the AP at the end of training is near 0.
Hereby I provide my training log for debugging.
fsod_train_log.txt
In comparison with the logs on 2-GPUmachine and 4-GPU machine, the loss_cls diverges before the iteration 2999 which can be seen as bellows:
2 GPUs
4 GPUs
From your code, I do not see anything related to num-gpus. Maybe it is due to the lack of some extra code needed by Detectron2 if num-gpus changes?
The text was updated successfully, but these errors were encountered: