Skip to content

Latest commit

 

History

History
99 lines (77 loc) · 18.3 KB

MODEL_ZOO.md

File metadata and controls

99 lines (77 loc) · 18.3 KB

Model Zoo

Note

  • ⚠️ The current video models are fine-tuned without layer decay due to the bug, which may help to improve the performances as in MAE. We have fixed the bug but do not plan to retrain them. We have applied it for VideoMamba-M but it does not help.
  • For all the pretraining and finetuning, we adopt spaese/uniform sampling.
  • #Frame $=$ #input_frame $\times$ #crop $\times$ #clip
  • #input_frame means how many frames are input for model per inference
  • #crop means spatial crops (e.g., 3 for left/right/center)
  • #clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

Masked Pretraining

Model Setting Model Shell
VideoMamba-M K400 800e aliyun, 🤗HF run.sh
VideoMamba-M SthSthV2 200e aliyun, 🤗HF run.sh

Short-term Video Understanding

K400

Model Pretraining Resolution #Frame Top-1 Model Shell
VideoMamba-Ti ImageNet-1K 224 8x3x4 76.9 aliyun, 🤗HF run.sh
VideoMamba-Ti ImageNet-1K 224 16x3x4 78.1 aliyun, 🤗HF run.sh
VideoMamba-Ti ImageNet-1K 224 32x3x4 78.8 aliyun, 🤗HF run.sh
VideoMamba-Ti ImageNet-1K 224 64x3x4 79.6 aliyun, 🤗HF run.sh
VideoMamba-Ti ImageNet-1K 384 64x3x4 80.3 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 224 8x3x4 79.3 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 224 16x3x4 80.8 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 224 32x3x4 81.5 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 224 64x3x4 81.8 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 384 64x3x4 82.7 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 224 8x3x4 80.6 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 224 16x3x4 81.9 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 224 32x3x4 82.4 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 224 64x3x4 82.8 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 384 64x3x4 83.3 aliyun, 🤗HF run.sh
VideoMamba-M MASK 224 8x3x4 82.0 aliyun, 🤗HF run.sh
VideoMamba-M MASK 224 16x3x4 83.4 aliyun, 🤗HF run.sh
VideoMamba-M MASK 224 32x3x4 83.9 aliyun, 🤗HF run.sh
VideoMamba-M MASK 224 64x3x4 84.3 aliyun, 🤗HF run.sh
VideoMamba-M MASK 384 64x3x4 85.0 aliyun, 🤗HF run.sh

SthSthV2

Model Pretraining Resolution #Frame Top-1 Model Shell
VideoMamba-Ti ImageNet-1K 224 8x3x4 65.1 aliyun, 🤗HF run.sh
VideoMamba-Ti ImageNet-1K 224 16x3x4 66.0 aliyun, 🤗HF run.sh
VideoMamba-Ti ImageNet-1K 288 16x3x4 66.2 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 224 8x3x4 66.6 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 224 16x3x4 67.7 aliyun, 🤗HF run.sh
VideoMamba-S ImageNet-1K 288 16x3x4 68.1 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 224 8x3x4 67.3 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 224 16x3x4 68.3 aliyun, 🤗HF run.sh
VideoMamba-M ImageNet-1K 288 16x3x4 68.4 aliyun, 🤗HF run.sh
VideoMamba-M MASK 224 8x3x4 70.2 aliyun, 🤗HF run.sh
VideoMamba-M MASK 224 16x3x4 71.0 aliyun, 🤗HF run.sh
VideoMamba-M MASK 288 16x3x4 71.4 aliyun, 🤗HF run.sh

Long-term Video Understanding

Breakfast

Model Pretraining Resolution #Frame Top-1 Model Shell
VideoMamba-Ti K400 224 32x3x4 94.3 aliyun, 🤗HF run.sh
VideoMamba-Ti K400 224 64x3x4 94.3 aliyun, 🤗HF run.sh
VideoMamba-S K400 224 32x3x4 95.3 aliyun, 🤗HF run.sh
VideoMamba-S K400 224 64x3x4 97.4 aliyun, 🤗HF run.sh
VideoMamba-M K400 224 32x3x4 94.8 aliyun, 🤗HF run.sh
VideoMamba-M K400 224 64x3x4 95.8 aliyun, 🤗HF run.sh
VideoMamba-M MASK+K400 224 32x3x4 97.9 aliyun, 🤗HF run.sh
VideoMamba-M MASK+K400 224 64x3x4 96.9 aliyun, 🤗HF run.sh

COIN

Model Pretraining Resolution #Frame Top-1 Model Shell
VideoMamba-Ti K400 224 32x3x10 86.2 aliyun, 🤗HF run.sh
VideoMamba-Ti K400 224 64x3x10 87.0 aliyun, 🤗HF run.sh
VideoMamba-S K400 224 32x3x10 88.4 aliyun, 🤗HF run.sh
VideoMamba-S K400 224 64x3x10 88.7 aliyun, 🤗HF run.sh
VideoMamba-M K400 224 32x3x10 88.3 aliyun, 🤗HF run.sh
VideoMamba-M K400 224 64x3x10 89.5 aliyun, 🤗HF run.sh
VideoMamba-M MASK+K400 224 32x3x10 89.6 aliyun, 🤗HF run.sh
VideoMamba-M MASK+K400 224 64x3x10 90.4 aliyun, 🤗HF run.sh

LVU

For LVU, we originally sample frame from the raw videos sparsely, but the results are not stable due to the limited videos. However, we found that ViS4mer uses trimmed clips with sliding window, which may improve the results. We also provide the related dataset with sliding window. Stay tuned!