Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default value of eps in FrozenBatchNorm to match BatchNorm #2599

Closed
juyunsang opened this issue Aug 21, 2020 · 18 comments · Fixed by #2933
Closed

Change default value of eps in FrozenBatchNorm to match BatchNorm #2599

juyunsang opened this issue Aug 21, 2020 · 18 comments · Fixed by #2933

Comments

@juyunsang
Copy link

juyunsang commented Aug 21, 2020

❓ Questions and Help

Hello
Loss is nan error occurs when I learn fast rcnn with resnext101 backbone
My code is as follows

backbone = resnet_fpn_backbone('resnext101_32x8d', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

error message

Epoch: [0]  [   0/7208]  eta: 1:27:42  lr: 0.000040  loss: 40613806080.0000 (40613806080.0000)  loss_box_reg: 7979147264.0000 (7979147264.0000)  loss_classifier: 11993160704.0000 (11993160704.0000)  loss_objectness: 9486380032.0000 (9486380032.0000)  loss_rpn_box_reg: 11155118080.0000 (11155118080.0000)  time: 0.7301  data: 0.4106  max mem: 1241
Loss is nan, stopping training

When i change the backbone to resnet50 and resnet152, no error occrus.

Please note that this issue tracker is not a help form and this issue will be closed.

We have a set of listed resources available on the website. Our primary means of support is our discussion forum:

@pmeier
Copy link
Contributor

pmeier commented Aug 21, 2020

Hi @juyunsang

as our template states:

Please note that this issue tracker is not a help form and this issue will be closed. [...] Our primary means of support is our discussion forum.


Without knowing your data its hard to tell what is going wrong. I'm assuming your data is not corrupt since the other models are working. Thus, this might be a hyper-parameter problem. The loss in the first step seems fairly large (~ 40e9). In a first step I would reduce the learning rate and see if this already solves the problem.

@pmeier pmeier closed this as completed Aug 21, 2020
@fmassa
Copy link
Member

fmassa commented Aug 21, 2020

@juyunsang note that we do not provide pre-trained weights for detection models with the resnext101 backbone, which might explain the issue you are facing. You might be finetuning a detection model with ResNet50 pre-trained on COCO, while training it from scratch with ResNeXt101

@juyunsang
Copy link
Author

juyunsang commented Aug 22, 2020

@fmassa
Thank you for reply
I don't understand your mention
You might be finetuning a detection model with ResNet50 pre-trained on COCO, while training it from scratch with ResNeXt101
Are you saying that I should use the code below?

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

Can you show me the code for how I can use the resnext101 backbone?

@fmassa
Copy link
Member

fmassa commented Aug 24, 2020

@juyunsang My understanding was that you were using

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

for ResNet50, and

backbone = resnet_fpn_backbone('resnext101_32x8d', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

for ResNeXt101. Is it the case?

@juyunsang
Copy link
Author

@fmassa
Yes!
It works well with the code below.

backbone = resnet_fpn_backbone('resnet50', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

However, changing the parameter of resnet_fpn_backbone function to resnext101_32x8d results in a Loss is Nan error.

@fmassa fmassa reopened this Aug 25, 2020
@fmassa
Copy link
Member

fmassa commented Aug 25, 2020

Thanks for confirming.

My first bet would be that it's an issue with the FPN, because we forgot to run the weight initialization, as discussed in #2326

Could you check if implementing that fix could solve the issue for you?

@juyunsang
Copy link
Author

@fmassa
Thank you for reply
I change code self.children to self.modules
but error still occued..

# initialize parameters now to avoid modifying the initialization of top_blocks
for m in self.children():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_uniform_(m.weight, a=1)
nn.init.constant_(m.bias, 0)

@fmassa
Copy link
Member

fmassa commented Aug 31, 2020

Ok, I think I know what's going on. The resnext implementation might have weights which are zero in the batch norm, and we might need to set eps in

eps: float = 0.,
to be 1e-5 to match the default values in PyTorch.

This can be done by changing the norm_layer in

def resnet_fpn_backbone(backbone_name, pretrained, norm_layer=misc_nn_ops.FrozenBatchNorm2d, trainable_layers=3):
to be a lambda with the eps set out to be 1e-5.

Can you try this out and report back?

The same thing happened to @szagoruyko with WideResNets. Changing the default value of eps in FrozenBatchNorm should be taken into account, but it will change the results for pre-trained models, and this should be checked to see how much it affects performance

@dkadish
Copy link

dkadish commented Sep 24, 2020

I was having a somewhat similar issue and this seems to fix it.
I was trying to load the weights from a pre-trained torchvision.models.resnet50 model into a torchvision.models.detection.backbone_utils.resnet_fpn_backbone for use in a FasterRCNN model. Renaming the state dict keys (dict([('.'.join(['body'] + k.split('.')[1:]),v) for k,v in checkpoint["state_dict"].items()])) and loading it with strict=False worked, but I was getting Loss is nan, stopping training messages during training.

Using functools.partial as below created a model that got through training.

from functools import partial
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
from torchvision.ops.misc import FrozenBatchNorm2d

FBN = partial(FrozenBatchNorm2d, eps=1E-5)
backbone = resnet_fpn_backbone('resnet50', pretrained=True, norm_layer=FBN, trainable_layers=3).cuda()

@fmassa
Copy link
Member

fmassa commented Sep 28, 2020

I think it might be time to think about changing the default in FrozenBatchNorm2d to more closely align with what BatchNorm in PyTorch does, we should just check that this change doesn't affect the performance of the currently trained models in any way.

@fmassa fmassa self-assigned this Oct 10, 2020
@fmassa fmassa changed the title resnext101 backbone faster rcnn train occur loss is Nan Change default value of eps in FrozenBatchNorm to match BatchNorm Oct 10, 2020
@fmassa fmassa assigned datumbox and unassigned fmassa Oct 21, 2020
@frgfm
Copy link
Contributor

frgfm commented Oct 21, 2020

Hi @fmassa, I just opened #2852 to tackle this!

@fmassa
Copy link
Member

fmassa commented Oct 21, 2020

Hi @frgfm

We still need to make sure that the current pre-trained models still give correct results with the new value.

@datumbox is going to be working on ensuring that this is the case.

@frgfm
Copy link
Contributor

frgfm commented Oct 21, 2020

No worries @fmassa ! Should I split the PR (one added eps to __repr__ to avoid silent differences, and the other changing the default eps value)?

@fmassa
Copy link
Member

fmassa commented Oct 21, 2020

Yes please, if you could only make the __repr__ changes in your PR it would be great.

@datumbox will be taking care of switching the default eps value in a follow-up PR

@frgfm
Copy link
Contributor

frgfm commented Oct 21, 2020

@fmassa done!

@frgfm
Copy link
Contributor

frgfm commented Dec 9, 2020

Just FYI @datumbox @fmassa, this is all the more beneficial when you're trying to use RCNN models in torch.float16.
I just tried on my end, and when the running_var of some BN gets converted to half, it drops to zero by underflow. With an eps=0, the model will yield many NaNs, while with this change, it yields a valid output :)

@fmassa
Copy link
Member

fmassa commented Dec 10, 2020

@frgfm would you say that we should make it a BC-breaking change and revert the back-compatibility fix in #2940 , if the benefits of having a non-zero eps outweights the downsides of BC-breakage?

@frgfm
Copy link
Contributor

frgfm commented Dec 10, 2020

@fmassa I would consider the following points:

  • Prevents underflow & thus NaNs upon forward in FrozenBN in float16
  • Training a RCNN model with a pretrained backbone will start with closer reproduction of the backbone's state for its image classification training
  • Now a downturn: pretrained RCNN models will have slightly different (expectedly worse) perf in float32

Seeing the results of #2933, I'm less concerned than I used to about this last inconvenience. So I would argue the benefits do outweigh the BC downsides. But I may be missing other aspects I'm not aware off 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants