Skip to content

mixer_b16_224 with miil pretraining #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 20, 2021
Merged

Conversation

mrT23
Copy link
Contributor

@mrT23 mrT23 commented May 20, 2021

This pull request introduces 2 new pretraining modes for mixer_b16 model:
--model=mixer_b16_224_miil
--model=mixer_b16_224_miil_in21k

Currently, the pretraining for mixer_b16_224 has accuracy of 76.5. (The article says that with their ImageNet-21K pretraining, it can go up to 80.6, although i could not reproduce that, a finetuning run gave me only 79.7). The mixer_b16_224_miil, which uses miil ImageNet-21K pretraining, has accuracy of 82.3.

In addition, from my testing mixer_b16_224_in21k is unstable in transfer learning. With mixer_b16_224_miil_in21k, transfer learning is far more stable, and gives higher scores:
image
(I think this is true in general - MLP models are gaining a reputation of being unstable and hard to transfer, but a lot of this stems from the pretraining quality, not from the architecture itself)

@rwightman rwightman merged commit b4ebf92 into huggingface:master May 20, 2021
@akolesnikoff
Copy link

akolesnikoff commented May 20, 2021

@mrT23 It looks like you have used the official mixer-B/16 model pretrained on ImageNet-1k, as your reproduced numbers closely match what we also have in our results for this model. Moreover, 76.5% accuracy corresponds to our ImageNet-1k model and we have not yet published pretrained ImageNet-21k model that was additionally fintetuned on ImageNet. So, of course, there is no way to reproduce 80.6% reported in the paper without finetuning our published ImageNet-21k checkpoint on ImageNet-1k.

I will appreciate if you double check whether you have used the correct ImageNet-21k weights (https://console.cloud.google.com/storage/browser/mixer_models/imagenet21k) for transfer learning experiments and, if not, recompute the numbers accordingly.

@mrT23
Copy link
Contributor Author

mrT23 commented May 21, 2021

@akolesnikoff i will add more details:

i used the official pretrain weights from google imagenet-21K: --model=mixer_b16_224_in21k (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224_in21k-617b3de2.pth).
i compared these weights to weights from miil pretraining on imagenet21K (--model=mixer_b16_224_miil_in21k)

when using exactly the same script for finetuning on imagenet-1K, the official pretrain weights from 21K initilization gave me score of 79.7, and the miil weights initilization gave me score of 82.0 (extra KD training raised the score further to 82.3).
probably with longer training and dedicated tricks i could come closer to the reported score (80.6), but still the relative comparison strongly favor miil pretraining.

on other smaller datasets (pascal-voc and food-251 for example), i found it very hard to do transfer learning from the official 21K weights. training was unstable, and ofter "collapsed" in the middle, which is quite rare with modern code. when switching to miil pretraining, i didnt encounter these problems at all.

miil and the official pretrain weights are publicly available, and you are welcome to compare them yourself on transfer learning and validate my results.

@akolesnikoff
Copy link

Thanks for the clarifications. Instability of ImageNet-21k results comes as a big surprise to me, because we have ran a huge number of transfer learning experiments and I've never encountered numerically unstable behavior. And, for adaptation on CIFAR-100 we consistently get ~91% accuracy (while you report 85.5).

In the end, I still highly suspect that there is some bug and you should be able to reproduce much higher numbers. I may have time to look into this myself after the NeurIPS deadline.

BTW, I think your checkpoints are great and glad that you've submitted them. I am only worried that official checkpoint numbers can be misrepresented due to a subtle bug.

@mrT23
Copy link
Contributor Author

mrT23 commented May 21, 2021

@akolesnikoff
Thanks for the response.

In general, my current experience from all-MLP models (not only mixer) is that they are harder to transfer. Other people in my workplace reached the same conclusion independently, and there are also reference to that in the literature:
https://arxiv.org/pdf/2104.02057.pdf
https://arxiv.org/pdf/2004.08249.pdf
https://arxiv.org/abs/2006.04884

I think you might have nailed exactly the hyper-parameters needed for mixer model transfer learning. Once you deviate, even in a small way, from these hyperparameters, you will see stability problems.
An example -
with adamw and wd=1e-4, miil pretraining transfers well, while official pretraining is unstable.
Only when using higher wd, i was able to converge the mixer model from official pretraining.
So on my, tests miil pretraining was less susceptible to wd values on transfer learning.

p.s 1
all my runs are comparative, so even if my hyperparameters are not as good as yours (and this is probably the case), the relative comparison is still quite vaild.
p.s.2
on ViT-B-16 i as able to get 94.2% on cifar-100, higher than the reported accuracy in the article :-)

@rwightman
Copy link
Collaborator

rwightman commented May 21, 2021

@mrT23 @akolesnikoff one thought here, since it can be challenging do apples-to-apples comparison when crossing frameworks, is it possible there is any difference in the weight init of the classifier layer for transfer being done in the JAX w/ original models vs PyTorch. I found a few small details off in parts of the ViT model init after digging into the details of the Flax inits for some layers...

@mrT23
Copy link
Contributor Author

mrT23 commented May 22, 2021

@rwightman
i don't know. one thing i validated was that i am able to reproduce the imagenet-1K score (76.5), both with regular inference and with a small finetuning of 5 epochs when initializing from --model=mixer_b16_224.

I am not using the timm code for training, just for initializing MLP models, so training cifar-100 with timm on transfer learning and comparing different initializations can offer us another sanity check. will try to get to that.

@mrT23
Copy link
Contributor Author

mrT23 commented May 23, 2021

@akolesnikoff
The pull request results (from a private repo) can be fully reproduced using the timm package. see the reproduction code for cifar100 in here.
image

some final notes:

  1. using SGD, the official pretrain reaches the article results (~91.0%), so there is no problem in weights loading @rwightman
  2. miil pretraining is more stable for different optimizers and learning rates than the official pretraining. no big score drop.
  3. even for the optimal hyper-parameters (sgd), miil pretraining provides better results
  4. for other datasets, the instability and sensitivity to transfer learning parameters is even more profound (pascal-voc for example)

@akolesnikoff
Copy link

(sorry for the radio silence: was super busy with NeurIPS and then on vacation)

Thanks, this is interesting. Somehow our pre-trained checkpoints seems to be not really well compatible with Adam, at least for the setup you are using. I will investigate this when I have time.

@mrT23
Copy link
Contributor Author

mrT23 commented Jun 10, 2021

@akolesnikoff
sure.
if your tests show a different trend and you think my results don't fully reflect the pretrain quality, please let me know.

good luck to all of us on NIPS (and ICCV) :-)

guoriyue pushed a commit to guoriyue/pytorch-image-models that referenced this pull request May 24, 2024
mixer_b16_224 with miil pretraining
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants