Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add MobileNet v3 to torchvision #1676

Closed
EliasVansteenkiste opened this issue Dec 17, 2019 · 9 comments · Fixed by #3182 or #3252
Closed

[Feature Request] Add MobileNet v3 to torchvision #1676

EliasVansteenkiste opened this issue Dec 17, 2019 · 9 comments · Fixed by #3182 or #3252

Comments

@EliasVansteenkiste
Copy link

EliasVansteenkiste commented Dec 17, 2019

A new version (V3) of MobileNet is already out since a while now:
"Searching for MobileNetV3" on ArXiv

Public pytorch implementations are already available here:
https://github.com/d-li14/mobilenetv3.pytorch
https://github.com/kuan-wang/pytorch-mobilenet-v3

However they don't achieve accuracies on the level as the ones mentioned in the paper but the following implementation seems to be on par with the paper:

https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mobilenetv3.py
https://github.com/rwightman/pytorch-image-models/blob/master/results/results-all.csv

Could MobileNetV3 be added to torchvision?

@fmassa
Copy link
Member

fmassa commented Dec 19, 2019

I think having a MobileNetV3 or FBNet as more accurate mobile models would be useful to have in torchvision. I'm not sure yet which variant would be most relevant though, and that might require some more research.

@frgfm
Copy link
Contributor

frgfm commented Dec 27, 2019

Happy to take a look at this one @fmassa as I have been using the implementation of rwightman for some time now.

Are there any specific requirements for PR regarding new model implementations? (I suppose, having the reference script and the trained weights available and at least on par with the original paper)

@1e100
Copy link
Contributor

1e100 commented Dec 29, 2019

@fmassa I have my implementation of MobileNet V3 (including segmentation variants with dilated convs), and I have used it in practical applications (note to folks: I've just uploaded new checkpoints and models, older ones had bugs).

The problem is I'm not able to train it to the same accuracy as in the paper. The best I got is 74.78 top1 for MobileNet V3 Large (vs 75.2% in the paper, a deficit of 0.42 points). MobileNet V3 small did reach the paper number 67.36% (paper is 67.4%). EMA doesn't really seem to help the small model, but does help the large one.

Frankly I'm not sure what else I could try. I've even implemented it in Keras before the official implementation was released. Code and checkpoints can be found here: https://github.com/1e100/mobilenet_v3

@1e100
Copy link
Contributor

1e100 commented Dec 29, 2019

It's also somewhat tricky, BTW, to extract features for object detection or segmentation out of this arch (which is what I use it for). Paper suggests that they should be extracted immediately after the expansion conv that goes into the subsequent downsampling dwconv, which happens to be inside the block, and not at the block boundary. In TF that's not an issue, you can just get an output tensor directly, but in PyTorch this is problematic, since there's no static graph that one could look up nodes in.

The way people typically use such nets for backbones is they break up the Sequential into chunks, and feed the output of one chunk into another, and also into the head. This is not really doable here.

So I ended up retaining a var on each forward() in blocks where there's downsampling.

Having said all of that, if anything is to be added here I'd consider adding EfficientNet instead, in order to eventually add EfficientDet, which has achieved state of the art detection results across the full range of efficiencies, even beating out Retina. It's also scalable with a few parameters.

https://arxiv.org/pdf/1911.09070.pdf

It's almost certainly a bear to train, though.

@rwightman
Copy link
Contributor

As mentioned in #980, I'm open to adding my impl and TF ported or PyTorch trained (no weird padding) weights I have so far with guidance from @fmassa as to what is wanted.

I reproduced MobileNetV3 training for an early interpretation of the paper that had a few details wrong (head conv bias, rounding of SE channels, SE act fn). I've since fixed my implementation so it's exact to the TF Slim official release that came out a few months back. I'm pretty sure I can reproduce good training results with that but wouldnt' be able to reproduce here without adding to the training code and using my RMSProp variant.

@rwightman
Copy link
Contributor

@1e100 I've implemented a hook based feature extractor for these networks, see comment in #980. Goal is to get it hooked up in an EfficientDet and integrated with other obj/seg frameworks at some point.

@1e100
Copy link
Contributor

1e100 commented Dec 30, 2019

That'd be a very worthwhile addition IMO, especially if coupled with training know-how. These architectures are pretty drastically overfitted to the specific implementations of optimizers, regularization, augmentation, etc. All of which is different in PyTorch by default.

@1e100
Copy link
Contributor

1e100 commented Jan 13, 2020

Another tidbit for someone interested in practical applications of MNV3: it looks like disabling biases on convolutions in SE block improves detection mAP a bit. That is probably why Google's own detection model they've released does not have biases there. I've discovered this accidentally, and then confirmed by viewing the "official" detection model in Netron.

Classifiers benefit quite heavily from those biases though, to the tune of 1 point of top1. I'm not sure where that leaves someone who would like to use a pretrained checkpoint to build a detection model. The "best" checkpoint for that won't be the "best" classifier.

@fmassa
Copy link
Member

fmassa commented Jan 14, 2020

Hi all,

Sorry for the delay in replying, was on holidays and then was busy with a few other things.

I think a hook-based implementation for detection, as mentioned by @rwightman would be a good way of doing it, but that should be done independently on adding the classification model.
Another thing that could be done in the future is what I've proposed in pytorch/pytorch#21064 (not yet available in PyTorch though).

@rwightman how different would be your training implementation for reproducing MobileNetV3 results compared to what we currently have in torchvision reference scripts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment