-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
mixer_b16_224 with miil pretraining #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mrT23 It looks like you have used the official mixer-B/16 model pretrained on ImageNet-1k, as your reproduced numbers closely match what we also have in our results for this model. Moreover, 76.5% accuracy corresponds to our ImageNet-1k model and we have not yet published pretrained ImageNet-21k model that was additionally fintetuned on ImageNet. So, of course, there is no way to reproduce 80.6% reported in the paper without finetuning our published ImageNet-21k checkpoint on ImageNet-1k. I will appreciate if you double check whether you have used the correct ImageNet-21k weights (https://console.cloud.google.com/storage/browser/mixer_models/imagenet21k) for transfer learning experiments and, if not, recompute the numbers accordingly. |
@akolesnikoff i will add more details: i used the official pretrain weights from google imagenet-21K: --model=mixer_b16_224_in21k (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224_in21k-617b3de2.pth). when using exactly the same script for finetuning on imagenet-1K, the official pretrain weights from 21K initilization gave me score of 79.7, and the miil weights initilization gave me score of 82.0 (extra KD training raised the score further to 82.3). on other smaller datasets (pascal-voc and food-251 for example), i found it very hard to do transfer learning from the official 21K weights. training was unstable, and ofter "collapsed" in the middle, which is quite rare with modern code. when switching to miil pretraining, i didnt encounter these problems at all. miil and the official pretrain weights are publicly available, and you are welcome to compare them yourself on transfer learning and validate my results. |
Thanks for the clarifications. Instability of ImageNet-21k results comes as a big surprise to me, because we have ran a huge number of transfer learning experiments and I've never encountered numerically unstable behavior. And, for adaptation on CIFAR-100 we consistently get ~91% accuracy (while you report 85.5). In the end, I still highly suspect that there is some bug and you should be able to reproduce much higher numbers. I may have time to look into this myself after the NeurIPS deadline. BTW, I think your checkpoints are great and glad that you've submitted them. I am only worried that official checkpoint numbers can be misrepresented due to a subtle bug. |
@akolesnikoff In general, my current experience from all-MLP models (not only mixer) is that they are harder to transfer. Other people in my workplace reached the same conclusion independently, and there are also reference to that in the literature: I think you might have nailed exactly the hyper-parameters needed for mixer model transfer learning. Once you deviate, even in a small way, from these hyperparameters, you will see stability problems. p.s 1 |
@mrT23 @akolesnikoff one thought here, since it can be challenging do apples-to-apples comparison when crossing frameworks, is it possible there is any difference in the weight init of the classifier layer for transfer being done in the JAX w/ original models vs PyTorch. I found a few small details off in parts of the ViT model init after digging into the details of the Flax inits for some layers... |
@rwightman I am not using the timm code for training, just for initializing MLP models, so training cifar-100 with timm on transfer learning and comparing different initializations can offer us another sanity check. will try to get to that. |
@akolesnikoff some final notes:
|
(sorry for the radio silence: was super busy with NeurIPS and then on vacation) Thanks, this is interesting. Somehow our pre-trained checkpoints seems to be not really well compatible with Adam, at least for the setup you are using. I will investigate this when I have time. |
@akolesnikoff good luck to all of us on NIPS (and ICCV) :-) |
mixer_b16_224 with miil pretraining
This pull request introduces 2 new pretraining modes for mixer_b16 model:
--model=mixer_b16_224_miil
--model=mixer_b16_224_miil_in21k
Currently, the pretraining for mixer_b16_224 has accuracy of 76.5. (The article says that with their ImageNet-21K pretraining, it can go up to 80.6, although i could not reproduce that, a finetuning run gave me only 79.7). The mixer_b16_224_miil, which uses miil ImageNet-21K pretraining, has accuracy of 82.3.
In addition, from my testing mixer_b16_224_in21k is unstable in transfer learning. With mixer_b16_224_miil_in21k, transfer learning is far more stable, and gives higher scores:

(I think this is true in general - MLP models are gaining a reputation of being unstable and hard to transfer, but a lot of this stems from the pretraining quality, not from the architecture itself)