Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with num_gpus #6

Open
MARMOTatZJU opened this issue Aug 15, 2020 · 3 comments
Open

Issues with num_gpus #6

MARMOTatZJU opened this issue Aug 15, 2020 · 3 comments

Comments

@MARMOTatZJU
Copy link

MARMOTatZJU commented Aug 15, 2020

With 4 GPUs, I rerun the defaults settings in all.sh and the AP is correct (11.27@voc 20 cats).

However, when I try to use another machines equiped with 2 GPUs, the loss_cls becomes strange and the AP at the end of training is near 0.

Hereby I provide my training log for debugging.
fsod_train_log.txt

In comparison with the logs on 2-GPUmachine and 4-GPU machine, the loss_cls diverges before the iteration 2999 which can be seen as bellows:

2 GPUs

�[32m[08/13 11:32:08 d2.utils.events]: �[0m eta: 2 days, 10:28:56  iter: 2999  total_loss: 0.935  loss_cls: 0.566  loss_box_reg: 0.255  loss_rpn_cls: 0.061  loss_rpn_loc: 0.015  time: 1.8013  data_time: 0.0312  lr: 0.004000  max_mem: 7442M

4 GPUs

�[32m[08/07 11:07:45 d2.utils.events]: �[0m eta: 1 day, 5:50:52  iter: 2999  total_loss: 0.811  loss_cls: 0.476  loss_box_reg: 0.226  loss_rpn_cls: 0.077  loss_rpn_loc: 0.019  time: 0.9198  data_time: 0.0164  lr: 0.004000  max_mem: 4173M

From your code, I do not see anything related to num-gpus. Maybe it is due to the lack of some extra code needed by Detectron2 if num-gpus changes?

@MARMOTatZJU
Copy link
Author

Empirically, I discovered that it is necessary to halve the learning rate in order to training correctly on a 2-GPU machine instead of 4-GPU machine.

Hereby I provide the training log with 2 GPUs and 0.5x learning rate, whose loss logs match the official released training log.

fsod_train_log.txt

@fanq15
Copy link
Owner

fanq15 commented Aug 17, 2020

It's weird. I use the detectron2's default GPU setting for training. Thank you for your advice!

@MARMOTatZJU
Copy link
Author

[UPDATE]
@fanq15 Halving the learning rate as well as num_gpus reproduced the expected result. Hereby I provide the training log with 2 GPUs & 0.5x SOLVER.BASE_LR for debuggin usage.

fsod_train_log.txt
fsod_finetune_train_log.txt
fsod_finetune_test_log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants