⚠️ The current video models are fine-tuned without layer decay due to the bug, which may help to improve the performances as in MAE. We have fixed the bug but do not plan to retrain them.We have applied it for VideoMamba-M but it does not help.- For all the pretraining and finetuning, we adopt spaese/uniform sampling.
-
#Frame
$=$ #input_frame
$\times$ #crop
$\times$ #clip
-
#input_frame
means how many frames are input for model per inference -
#crop
means spatial crops (e.g., 3 for left/right/center) -
#clip
means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)
Model | Setting | Model | Shell |
---|---|---|---|
VideoMamba-M | K400 800e | aliyun, 🤗HF | run.sh |
VideoMamba-M | SthSthV2 200e | aliyun, 🤗HF | run.sh |
Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
---|---|---|---|---|---|---|
VideoMamba-Ti | ImageNet-1K | 224 | 8x3x4 | 76.9 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | ImageNet-1K | 224 | 16x3x4 | 78.1 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | ImageNet-1K | 224 | 32x3x4 | 78.8 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | ImageNet-1K | 224 | 64x3x4 | 79.6 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | ImageNet-1K | 384 | 64x3x4 | 80.3 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 224 | 8x3x4 | 79.3 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 224 | 16x3x4 | 80.8 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 224 | 32x3x4 | 81.5 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 224 | 64x3x4 | 81.8 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 384 | 64x3x4 | 82.7 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 224 | 8x3x4 | 80.6 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 224 | 16x3x4 | 81.9 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 224 | 32x3x4 | 82.4 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 224 | 64x3x4 | 82.8 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 384 | 64x3x4 | 83.3 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 224 | 8x3x4 | 82.0 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 224 | 16x3x4 | 83.4 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 224 | 32x3x4 | 83.9 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 224 | 64x3x4 | 84.3 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 384 | 64x3x4 | 85.0 | aliyun, 🤗HF | run.sh |
Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
---|---|---|---|---|---|---|
VideoMamba-Ti | ImageNet-1K | 224 | 8x3x4 | 65.1 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | ImageNet-1K | 224 | 16x3x4 | 66.0 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | ImageNet-1K | 288 | 16x3x4 | 66.2 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 224 | 8x3x4 | 66.6 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 224 | 16x3x4 | 67.7 | aliyun, 🤗HF | run.sh |
VideoMamba-S | ImageNet-1K | 288 | 16x3x4 | 68.1 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 224 | 8x3x4 | 67.3 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 224 | 16x3x4 | 68.3 | aliyun, 🤗HF | run.sh |
VideoMamba-M | ImageNet-1K | 288 | 16x3x4 | 68.4 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 224 | 8x3x4 | 70.2 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 224 | 16x3x4 | 71.0 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK | 288 | 16x3x4 | 71.4 | aliyun, 🤗HF | run.sh |
Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
---|---|---|---|---|---|---|
VideoMamba-Ti | K400 | 224 | 32x3x4 | 94.3 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | K400 | 224 | 64x3x4 | 94.3 | aliyun, 🤗HF | run.sh |
VideoMamba-S | K400 | 224 | 32x3x4 | 95.3 | aliyun, 🤗HF | run.sh |
VideoMamba-S | K400 | 224 | 64x3x4 | 97.4 | aliyun, 🤗HF | run.sh |
VideoMamba-M | K400 | 224 | 32x3x4 | 94.8 | aliyun, 🤗HF | run.sh |
VideoMamba-M | K400 | 224 | 64x3x4 | 95.8 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK+K400 | 224 | 32x3x4 | 97.9 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK+K400 | 224 | 64x3x4 | 96.9 | aliyun, 🤗HF | run.sh |
Model | Pretraining | Resolution | #Frame | Top-1 | Model | Shell |
---|---|---|---|---|---|---|
VideoMamba-Ti | K400 | 224 | 32x3x10 | 86.2 | aliyun, 🤗HF | run.sh |
VideoMamba-Ti | K400 | 224 | 64x3x10 | 87.0 | aliyun, 🤗HF | run.sh |
VideoMamba-S | K400 | 224 | 32x3x10 | 88.4 | aliyun, 🤗HF | run.sh |
VideoMamba-S | K400 | 224 | 64x3x10 | 88.7 | aliyun, 🤗HF | run.sh |
VideoMamba-M | K400 | 224 | 32x3x10 | 88.3 | aliyun, 🤗HF | run.sh |
VideoMamba-M | K400 | 224 | 64x3x10 | 89.5 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK+K400 | 224 | 32x3x10 | 89.6 | aliyun, 🤗HF | run.sh |
VideoMamba-M | MASK+K400 | 224 | 64x3x10 | 90.4 | aliyun, 🤗HF | run.sh |
For LVU, we originally sample frame from the raw videos sparsely, but the results are not stable due to the limited videos. However, we found that ViS4mer uses trimmed clips with sliding window, which may improve the results. We also provide the related dataset with sliding window. Stay tuned!