mixer_b16_224 with miil pretraining #651

mrT23 · 2021-05-20T07:54:43Z

This pull request introduces 2 new pretraining modes for mixer_b16 model:
--model=mixer_b16_224_miil
--model=mixer_b16_224_miil_in21k

Currently, the pretraining for mixer_b16_224 has accuracy of 76.5. (The article says that with their ImageNet-21K pretraining, it can go up to 80.6, although i could not reproduce that, a finetuning run gave me only 79.7). The mixer_b16_224_miil, which uses miil ImageNet-21K pretraining, has accuracy of 82.3.

In addition, from my testing mixer_b16_224_in21k is unstable in transfer learning. With mixer_b16_224_miil_in21k, transfer learning is far more stable, and gives higher scores:

(I think this is true in general - MLP models are gaining a reputation of being unstable and hard to transfer, but a lot of this stems from the pretraining quality, not from the architecture itself)

akolesnikoff · 2021-05-20T22:36:38Z

@mrT23 It looks like you have used the official mixer-B/16 model pretrained on ImageNet-1k, as your reproduced numbers closely match what we also have in our results for this model. Moreover, 76.5% accuracy corresponds to our ImageNet-1k model and we have not yet published pretrained ImageNet-21k model that was additionally fintetuned on ImageNet. So, of course, there is no way to reproduce 80.6% reported in the paper without finetuning our published ImageNet-21k checkpoint on ImageNet-1k.

I will appreciate if you double check whether you have used the correct ImageNet-21k weights (https://console.cloud.google.com/storage/browser/mixer_models/imagenet21k) for transfer learning experiments and, if not, recompute the numbers accordingly.

mrT23 · 2021-05-21T05:24:47Z

@akolesnikoff i will add more details:

i used the official pretrain weights from google imagenet-21K: --model=mixer_b16_224_in21k (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224_in21k-617b3de2.pth).
i compared these weights to weights from miil pretraining on imagenet21K (--model=mixer_b16_224_miil_in21k)

when using exactly the same script for finetuning on imagenet-1K, the official pretrain weights from 21K initilization gave me score of 79.7, and the miil weights initilization gave me score of 82.0 (extra KD training raised the score further to 82.3).
probably with longer training and dedicated tricks i could come closer to the reported score (80.6), but still the relative comparison strongly favor miil pretraining.

on other smaller datasets (pascal-voc and food-251 for example), i found it very hard to do transfer learning from the official 21K weights. training was unstable, and ofter "collapsed" in the middle, which is quite rare with modern code. when switching to miil pretraining, i didnt encounter these problems at all.

miil and the official pretrain weights are publicly available, and you are welcome to compare them yourself on transfer learning and validate my results.

akolesnikoff · 2021-05-21T07:29:01Z

Thanks for the clarifications. Instability of ImageNet-21k results comes as a big surprise to me, because we have ran a huge number of transfer learning experiments and I've never encountered numerically unstable behavior. And, for adaptation on CIFAR-100 we consistently get ~91% accuracy (while you report 85.5).

In the end, I still highly suspect that there is some bug and you should be able to reproduce much higher numbers. I may have time to look into this myself after the NeurIPS deadline.

BTW, I think your checkpoints are great and glad that you've submitted them. I am only worried that official checkpoint numbers can be misrepresented due to a subtle bug.

mrT23 · 2021-05-21T09:18:25Z

@akolesnikoff
Thanks for the response.

In general, my current experience from all-MLP models (not only mixer) is that they are harder to transfer. Other people in my workplace reached the same conclusion independently, and there are also reference to that in the literature:
https://arxiv.org/pdf/2104.02057.pdf
https://arxiv.org/pdf/2004.08249.pdf
https://arxiv.org/abs/2006.04884

I think you might have nailed exactly the hyper-parameters needed for mixer model transfer learning. Once you deviate, even in a small way, from these hyperparameters, you will see stability problems.
An example -
with adamw and wd=1e-4, miil pretraining transfers well, while official pretraining is unstable.
Only when using higher wd, i was able to converge the mixer model from official pretraining.
So on my, tests miil pretraining was less susceptible to wd values on transfer learning.

p.s 1
all my runs are comparative, so even if my hyperparameters are not as good as yours (and this is probably the case), the relative comparison is still quite vaild.
p.s.2
on ViT-B-16 i as able to get 94.2% on cifar-100, higher than the reported accuracy in the article :-)

rwightman · 2021-05-21T20:02:54Z

@mrT23 @akolesnikoff one thought here, since it can be challenging do apples-to-apples comparison when crossing frameworks, is it possible there is any difference in the weight init of the classifier layer for transfer being done in the JAX w/ original models vs PyTorch. I found a few small details off in parts of the ViT model init after digging into the details of the Flax inits for some layers...

mrT23 · 2021-05-22T05:32:53Z

@rwightman
i don't know. one thing i validated was that i am able to reproduce the imagenet-1K score (76.5), both with regular inference and with a small finetuning of 5 epochs when initializing from --model=mixer_b16_224.

I am not using the timm code for training, just for initializing MLP models, so training cifar-100 with timm on transfer learning and comparing different initializations can offer us another sanity check. will try to get to that.

mrT23 · 2021-05-23T05:14:33Z

@akolesnikoff
The pull request results (from a private repo) can be fully reproduced using the timm package. see the reproduction code for cifar100 in here.

some final notes:

using SGD, the official pretrain reaches the article results (~91.0%), so there is no problem in weights loading @rwightman
miil pretraining is more stable for different optimizers and learning rates than the official pretraining. no big score drop.
even for the optimal hyper-parameters (sgd), miil pretraining provides better results
for other datasets, the instability and sensitivity to transfer learning parameters is even more profound (pascal-voc for example)

akolesnikoff · 2021-06-09T19:41:08Z

(sorry for the radio silence: was super busy with NeurIPS and then on vacation)

Thanks, this is interesting. Somehow our pre-trained checkpoints seems to be not really well compatible with Adam, at least for the setup you are using. I will investigate this when I have time.

mrT23 · 2021-06-10T08:01:22Z

@akolesnikoff
sure.
if your tests show a different trend and you think my results don't fully reflect the pretrain quality, please let me know.

good luck to all of us on NIPS (and ICCV) :-)

mixer_b16_224 with miil pretraining

talrid and others added 4 commits May 19, 2021 20:51

mixer_b16_224_miil

5bcf686

Revert "mixer_b16_224_miil"

240e667

Merge branch 'rwightman:master' into master

709d7c0

mixer_b16_224_miil, mixer_b16_224_miil_in21k models

dc1a4ef

rwightman merged commit b4ebf92 into huggingface:master May 20, 2021

guoriyue pushed a commit to guoriyue/pytorch-image-models that referenced this pull request May 24, 2024

Merge pull request huggingface#651 from mrT23/master

b85fbcc

mixer_b16_224 with miil pretraining

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

mixer_b16_224 with miil pretraining #651

mixer_b16_224 with miil pretraining #651

Uh oh!

mrT23 commented May 20, 2021 •

edited

Loading

Uh oh!

akolesnikoff commented May 20, 2021 •

edited

Loading

Uh oh!

mrT23 commented May 21, 2021 •

edited

Loading

Uh oh!

akolesnikoff commented May 21, 2021

Uh oh!

mrT23 commented May 21, 2021 •

edited

Loading

Uh oh!

rwightman commented May 21, 2021 •

edited

Loading

Uh oh!

mrT23 commented May 22, 2021

Uh oh!

mrT23 commented May 23, 2021 •

edited

Loading

Uh oh!

akolesnikoff commented Jun 9, 2021

Uh oh!

mrT23 commented Jun 10, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mixer_b16_224 with miil pretraining #651

mixer_b16_224 with miil pretraining #651

Uh oh!

Conversation

mrT23 commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akolesnikoff commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrT23 commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akolesnikoff commented May 21, 2021

Uh oh!

mrT23 commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwightman commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrT23 commented May 22, 2021

Uh oh!

mrT23 commented May 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akolesnikoff commented Jun 9, 2021

Uh oh!

mrT23 commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mrT23 commented May 20, 2021 •

edited

Loading

akolesnikoff commented May 20, 2021 •

edited

Loading

mrT23 commented May 21, 2021 •

edited

Loading

mrT23 commented May 21, 2021 •

edited

Loading

rwightman commented May 21, 2021 •

edited

Loading

mrT23 commented May 23, 2021 •

edited

Loading

mrT23 commented Jun 10, 2021 •

edited

Loading