-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default value of eps in FrozenBatchNorm to match BatchNorm #2599
Comments
Hi @juyunsang as our template states:
Without knowing your data its hard to tell what is going wrong. I'm assuming your data is not corrupt since the other models are working. Thus, this might be a hyper-parameter problem. The loss in the first step seems fairly large (~ |
@juyunsang note that we do not provide pre-trained weights for detection models with the resnext101 backbone, which might explain the issue you are facing. You might be finetuning a detection model with ResNet50 pre-trained on COCO, while training it from scratch with ResNeXt101 |
@fmassa
Can you show me the code for how I can use the resnext101 backbone? |
@juyunsang My understanding was that you were using model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) for ResNet50, and backbone = resnet_fpn_backbone('resnext101_32x8d', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) for ResNeXt101. Is it the case? |
@fmassa
However, changing the parameter of resnet_fpn_backbone function to resnext101_32x8d results in a Loss is Nan error. |
Thanks for confirming. My first bet would be that it's an issue with the FPN, because we forgot to run the weight initialization, as discussed in #2326 Could you check if implementing that fix could solve the issue for you? |
@fmassa vision/torchvision/ops/feature_pyramid_network.py Lines 59 to 63 in c2e8a00
|
Ok, I think I know what's going on. The resnext implementation might have weights which are zero in the batch norm, and we might need to set vision/torchvision/ops/misc.py Line 54 in 497744b
1e-5 to match the default values in PyTorch.
This can be done by changing the
Can you try this out and report back? The same thing happened to @szagoruyko with WideResNets. Changing the default value of eps in FrozenBatchNorm should be taken into account, but it will change the results for pre-trained models, and this should be checked to see how much it affects performance |
I was having a somewhat similar issue and this seems to fix it. Using
|
I think it might be time to think about changing the default in |
No worries @fmassa ! Should I split the PR (one added eps to |
Yes please, if you could only make the @datumbox will be taking care of switching the default eps value in a follow-up PR |
@fmassa done! |
Just FYI @datumbox @fmassa, this is all the more beneficial when you're trying to use RCNN models in |
@fmassa I would consider the following points:
Seeing the results of #2933, I'm less concerned than I used to about this last inconvenience. So I would argue the benefits do outweigh the BC downsides. But I may be missing other aspects I'm not aware off 🤷♂️ |
❓ Questions and Help
Hello
Loss is nan error occurs when I learn fast rcnn with resnext101 backbone
My code is as follows
error message
When i change the backbone to resnet50 and resnet152, no error occrus.
Please note that this issue tracker is not a help form and this issue will be closed.
We have a set of listed resources available on the website. Our primary means of support is our discussion forum:
The text was updated successfully, but these errors were encountered: