Official Implementation of our paper "Hard Patches Mining for Masked Image Modeling", in CVPR 2023.
by Haochen Wang, Kaiyou Song, Junsong Fan, Yuxi Wang, Jin Xie, and Zhaoxiang Zhang
🔔 🔔 🔔 An extension of this paper has been available at [arXiv], where we successfully adapted HPM to masked video modeling benchmarks with almost no modifications! The code will be released soon.
-
This repo is a modification on the MAE repo. Installation and preparation follow that repo.
-
This repo is based on
timm==0.3.2
, for which a fix is needed to work with PyTorch 1.8.1+. -
This repo is the official implementation of Hard Patches Mining for Masked Image Modeling. It includes codes and models for the following tasks:
ImageNet-1K Pretrain: See PRETRAIN.md.
ImageNet-1L Finetune: See FINETUNE.md.
Abstract. Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.
Method | Model | PT Epochs | Top-1 Acc. | Checkpoint | mIoU |
---|---|---|---|---|---|
MAE | ViT-B/16 | 200 | 82.2 | 40.5 | |
HPM | ViT-B/16 | 200 | 83.0 (+0.8) | 42.1 (+1.6) | |
MAE | ViT-B/16 | 1600 | 83.6 | 48.1 | |
HPM | ViT-B/16 | 800 | 84.2 (+0.6) | [Google Drive] | 48.5 (+0.4) |
MAE | ViT-L/16 | 1600 | 85.1 | 53.6 | |
HPM | ViT-L/16 | 800 | 85.8 (+0.7) | [Google Drive] | 54.6 (+1.0) |
The pretraining and finetuning of our project are based on DeiT, MAE and UM-MAE. The linear probing is based on MAE. The kNN classification is based on DINO. Thanks for their wonderful work.
For object detection and semantic segmentation, please refer to Detectron2 and MMSegmentation, respectively. The configurations can be found in here and here for detection and segmentation, respectively.
This project is under the Apache License 2.0 license. See LICENSE for details.
@inproceedings{wang2023hard,
author = {Wang, Haochen and Song, Kaiyou and Fan, Junsong and Wang, Yuxi and Xie, Jin and Zhang, Zhaoxiang},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title = {Hard Patches Mining for Masked Image Modeling},
year = {2023},
}