Syed Talal Wasim*, Muhammad Uzair Khattak*, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan
*Joint first authors
Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost.
- News
- Overview
- Visualization
- Environment Setup
- Dataset Preparation
- Model Zoo
- Evaluation
- Training
- Citation
- Acknowledgements
- (July 13, 2023)
- Training and evaluation codes for Video-FocalNets, along with pretrained models are released.
(a) The overall architecture of Video-FocalNets: A four-stage architecture, with each stage comprising a patch embedding and a number of Video-FocalNet blocks. (b) Single Video-FocalNet block: Similar to the transformer blocks, we replace self-attention with Spatio-Temporal Focal Modulation.
Please follow INSTALL.md for installation.
Please follow DATA.md for data preparation.
Model | Depth | Dim | Kernels | Top-1 | Download |
---|---|---|---|---|---|
Video-FocalNet-T | [2,2,6,2] | 96 | [3,5] | 79.8 | ckpt |
Video-FocalNet-S | [2,2,18,2] | 96 | [3,5] | 81.4 | ckpt |
Video-FocalNet-B | [2,2,18,2] | 128 | [3,5] | 83.6 | ckpt |
Model | Depth | Dim | Kernels | Top-1 | Download |
---|---|---|---|---|---|
Video-FocalNet-B | [2,2,18,2] | 128 | [3,5] | 86.7 | ckpt |
Model | Depth | Dim | Kernels | Top-1 | Download |
---|---|---|---|---|---|
Video-FocalNet-B | [2,2,18,2] | 128 | [3,5] | 71.1 | ckpt |
Model | Depth | Dim | Kernels | Top-1 | Download |
---|---|---|---|---|---|
Video-FocalNet-B | [2,2,18,2] | 128 | [3,5] | 90.8 | ckpt |
Model | Depth | Dim | Kernels | Top-1 | Download |
---|---|---|---|---|---|
Video-FocalNet-B | [2,2,18,2] | 128 | [3,5] | 89.8 | ckpt |
To evaluate pre-trained Video-FocalNets on your dataset:
python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> main.py --eval \
--cfg <config-file> --resume <checkpoint> \
--opts DATA.NUM_FRAMES 8 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv
For example, to evaluate the Video-FocalNet-B
with a single GPU on Kinetics400:
python -m torch.distributed.launch --nproc_per_node 1 main.py --eval \
--cfg configs/kinetics400/video_focalnet_base.yaml --resume video-focalnet_base_k400.pth \
--opts DATA.NUM_FRAMES 8 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv
Alternatively, the DATA.ROOT
, DATA.TRAIN_FILE
, and DATA.VAL_FILE
paths can be set directly in the config files provided in the configs
directory.
According to our experience and sanity checks, there is a reasonable random variation of about +/-0.3% top-1 accuracy when testing on different machines.
Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.
To train a Video-FocalNet on a video dataset from scratch, run:
python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use> main.py \
--cfg <config-file> --batch-size <batch-size-per-gpu> --output <output-directory> \
--opts DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv
Alternatively, the DATA.ROOT
, DATA.TRAIN_FILE
, and DATA.VAL_FILE
paths can be set directly in the config files provided in the configs
directory. We also provide bash scripts to train Video-FocalNets on various datasets in the scripts
directory.
Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.
If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citation.
@InProceedings{Wasim_2023_ICCV,
author = {Wasim, Syed Talal and Khattak, Muhammad Uzair and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad Shahbaz},
title = {Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2023},
}
If you have any questions, please create an issue on this repository or contact at syed.wasim@mbzuai.ac.ae or uzair.khattak@mbzuai.ac.ae.
Our code is based on FocalNets, XCLIP and UniFormer repositories. We thank the authors for releasing their code. If you use our model, please consider citing these works as well.