In this repository, we provide an implementation of "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection" on Charades dataset (Localization setting, i.e., Charades_v1_localize). If you want to train and evaluate MS-TCT, you can follow the following steps. For MultiTHUMOS, you can follow the training process here.
Like the previous works (e.g. TGM, PDAN), MS-TCT is built on top of the pre-trained I3D features. Thus, feature extraction is needed before training the network.
- Please download the Charades dataset (24 fps version) from this link.
- Follow this repository to extract the snippet-level I3D feature.
Please satisfy the following dependencies to train MS-TCT correctly:
- pytorch 1.9
- python 3.8
- timm 0.4.12
- pickle5
- scikit-learn
- numpy
- Change the rgb_root to the extracted feature path in the train.py.
- Use
./run_MSTCT_Charades.sh
for training on Charades-RGB. The best logits will be saved automatically in ./save_logit. - Use
python Evaluation.py -pkl_path /best_logit_path/
to evaluate the model with the per-frame mAP and the action-conditional metrics.
- The network implementation is in ./MSTCT/ folder.
- RGB and Optical flow are following the same training process. Both modalities can be added in the logit-level to have the two-stream performance (i.e., late fusion). Note that, we mainly focus on the pure RGB result in the paper.
- In practice, we trained MS-TCT with a Tesla V100 GPU to shrink the computation time. But as MS-TCT is not large, GTX 1080 Ti can be sufficient for running the network.
- For the evaluation metrics: the standard frame-mAP is following the Superevent and action-conditional metrics is following the MLAD.
If you find our repo or paper useful, please cite us as
@inproceedings{dai2022mstct,
title={{MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection}},
author={Dai, Rui and Das, Srijan and Kahatapitiya, Kumara and Ryoo, Michael and Bremond, Francois},
booktitle={CVPR},
year={2022}
}
Contact: rui.dai@inria.fr