Issues with num_gpus #6

MARMOTatZJU · 2020-08-15T06:16:15Z

With 4 GPUs, I rerun the defaults settings in all.sh and the AP is correct (11.27@voc 20 cats).

However, when I try to use another machines equiped with 2 GPUs, the loss_cls becomes strange and the AP at the end of training is near 0.

Hereby I provide my training log for debugging.
fsod_train_log.txt

In comparison with the logs on 2-GPUmachine and 4-GPU machine, the loss_cls diverges before the iteration 2999 which can be seen as bellows:

2 GPUs

�[32m[08/13 11:32:08 d2.utils.events]: �[0m eta: 2 days, 10:28:56  iter: 2999  total_loss: 0.935  loss_cls: 0.566  loss_box_reg: 0.255  loss_rpn_cls: 0.061  loss_rpn_loc: 0.015  time: 1.8013  data_time: 0.0312  lr: 0.004000  max_mem: 7442M

4 GPUs

�[32m[08/07 11:07:45 d2.utils.events]: �[0m eta: 1 day, 5:50:52  iter: 2999  total_loss: 0.811  loss_cls: 0.476  loss_box_reg: 0.226  loss_rpn_cls: 0.077  loss_rpn_loc: 0.019  time: 0.9198  data_time: 0.0164  lr: 0.004000  max_mem: 4173M

From your code, I do not see anything related to num-gpus. Maybe it is due to the lack of some extra code needed by Detectron2 if num-gpus changes?

The text was updated successfully, but these errors were encountered:

MARMOTatZJU · 2020-08-15T08:17:48Z

Empirically, I discovered that it is necessary to halve the learning rate in order to training correctly on a 2-GPU machine instead of 4-GPU machine.

Hereby I provide the training log with 2 GPUs and 0.5x learning rate, whose loss logs match the official released training log.

fsod_train_log.txt

fanq15 · 2020-08-17T00:48:48Z

It's weird. I use the detectron2's default GPU setting for training. Thank you for your advice!

MARMOTatZJU · 2020-08-17T13:35:10Z

[UPDATE]
@fanq15 Halving the learning rate as well as num_gpus reproduced the expected result. Hereby I provide the training log with 2 GPUs & 0.5x SOLVER.BASE_LR for debuggin usage.

fsod_train_log.txt
fsod_finetune_train_log.txt
fsod_finetune_test_log.txt

RockJim2001 mentioned this issue Apr 13, 2024

How to understand the instruction in the step 3 when i have only 2 gpus? #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with num_gpus #6

Issues with num_gpus #6

MARMOTatZJU commented Aug 15, 2020 •

edited

Loading

MARMOTatZJU commented Aug 15, 2020

fanq15 commented Aug 17, 2020

MARMOTatZJU commented Aug 17, 2020

Issues with num_gpus #6

Issues with num_gpus #6

Comments

MARMOTatZJU commented Aug 15, 2020 • edited Loading

MARMOTatZJU commented Aug 15, 2020

fanq15 commented Aug 17, 2020

MARMOTatZJU commented Aug 17, 2020

MARMOTatZJU commented Aug 15, 2020 •

edited

Loading