EfficientViT is a new family of vision models for efficient high-resolution vision, especially segmentation. The core building block of EfficientViT is a new lightweight multi-scale attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations.
Here are comparisons with prior SOTA semantic segmentation models:
Here are the results of EfficientViT on image classification:
conda create -n efficientvit python=3.8.5
conda activate efficientvit
conda install pytorch=1.13.1 torchvision=0.14.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install tqdm opencv-python
- ImageNet: https://www.image-net.org/
- Cityscapes: https://www.cityscapes-dataset.com/
- ADE20K: https://groups.csail.mit.edu/vision/datasets/ADE20K/
Mobile latency is measured on Qualcomm Snapdragon 8Gen1 with Tensorflow-Lite, fp32, batch size 1.
Model | Resolution | ImageNet Top1 Acc | ImageNet Top5 Acc | Params | MACs | Mobile Latency | Checkpoint |
---|---|---|---|---|---|---|---|
EfficientViT-B1 | 224 | 79.4 | 94.3 | 9.1M | 0.52G | 19ms | link |
EfficientViT-B1 | 256 | 79.9 | 94.7 | 9.1M | 0.68G | 24ms | link |
EfficientViT-B1 | 288 | 80.4 | 95.0 | 9.1M | 0.86G | 31ms | link |
EfficientViT-B2 | 224 | 82.1 | 95.8 | 24M | 1.6G | 55ms | link |
EfficientViT-B2 | 256 | 82.7 | 96.1 | 24M | 2.1G | 72ms | link |
EfficientViT-B2 | 288 | 83.1 | 96.3 | 24M | 2.6G | 92ms | link |
EfficientViT-B3 | 224 | 83.5 | 96.4 | 49M | 4.0G | 140ms | link |
EfficientViT-B3 | 256 | 83.8 | 96.5 | 49M | 5.2G | 180ms | link |
EfficientViT-B3 | 288 | 84.2 | 96.7 | 49M | 6.5G | 228ms | link |
Model | Resolution | Cityscapes mIoU | Params | MACs | Mobile Latency | Checkpoint |
---|---|---|---|---|---|---|
EfficientViT-B0 | 960x1920 | 75.5 | 0.7M | 3.9G | 0.20s | link |
EfficientViT-B1 | 896x1792 | 80.1 | 4.8M | 19G | 0.82s | link |
EfficientViT-B2 | 1024x2048 | 82.1 | 15M | 74G | 3.1s | link |
EfficientViT-B3 | 1184x2368 | 83.2 | 40M | 240G | 10s | link |
Model | Resolution | ADE20K mIoU | Params | MACs | Mobile Latency | Checkpoint |
---|---|---|---|---|---|---|
EfficientViT-B1 | 480 | 42.7 | 4.8M | 2.7G | 0.10s | link |
EfficientViT-B2 | 416 | 45.1 | 15M | 6.0G | 0.21s | link |
EfficientViT-B3 | 512 | 49.0 | 39M | 22G | 0.8s | link |
from models.cls_model_zoo import create_cls_model
model = create_cls_model(
name="b3",
pretrained=True,
weight_url="assets/checkpoints/cls/b3-r288.pt"
)
from models.seg_model_zoo import create_seg_model
model = create_seg_model(
name="b3",
dataset="cityscapes",
pretrained=True,
weight_url="assets/checkpoints/seg/cityscapes/b3-r1184.pt"
)
from models.seg_model_zoo import create_seg_model
model = create_seg_model(
name="b3",
dataset="ade20k",
pretrained=True,
weight_url="assets/checkpoints/seg/ade20k/b3-r512.pt"
)
Please run eval_cls_model.py
or eval_seg_model.py
to evaluate our models.
Examples: classification, segmentation
Please run eval_seg_model.py
to visualize the outputs of our semantic segmentation models.
Example:
python eval_seg_model.py --dataset cityscapes --crop_size 1184 --model b3-r1184 --save_path demo/cityscapes/b3-r1184/
Han Cai: hancai@mit.edu
If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:
@article{cai2022efficientvit,
title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
author={Cai, Han and Gan, Chuang and Han, Song},
journal={arXiv preprint arXiv:2205.14756},
year={2022}
}