This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a collections of scalable video transformer benchmarks, and discuss the training recipes of how to train a big video transformer model.
Now, we implement the TimeSformer, ViViT and MaskFeat. And we have pre-trained the TimeSformer-B
, ViViT-B
and MaskFeat
on Kinetics400/600, but still can't guarantee the performance reported in the paper. However, we find some relevant hyper-parameters which may help us to reach the target performance.
- We have fixed serval known issues and now can build script to pretrain
MViT-B
withMaskFeat
or finetuneMViT-B
/TimeSformer-B
/ViViT-B
on K400. - We have reimplemented the methods of hog extraction and hog prediction in MaskFeat which are currently more efficient to pretrain.
- Note that if someone want to train
TimeSformer-B
orViViT-B
with current repo, they need to carefully adjust the learning rate and weight decay for a better performance. For example, you can can choose 0.005 for peak learning rate and 0.0001 for weight decay by default.
In order to share the basic divided spatial-temporal attention module to different video transformer, we make some changes in the following apart.
We split the position embedding
from R(nt*h*w×d) mentioned in the ViViT paper into R(nh*w×d)
and R(nt×d) to stay the same as TimeSformer.
In order to make clear whether to add the class_token
into the module forward computation, we only compute the interaction between class_token
and query
when the current layer is the last layer (except FFN
) of each transformer block.
- Tokenization: the token embedding filter can be chosen either
Conv2D
orConv3D
, and the initializing weights ofConv3D
filters fromConv2D
can be replicated along temporal dimension and averaging them or initialized with zeros along the temporal positions except at the centert/2
. - Temporal
MSA
module weights: one can choose to copy the weights from spatialMSA
module or initialize all weights with zeros. - Initialize from the
MAE
pre-trained model provided by ZhiLiang, where the class_token that does not appear in theMAE
pre-train model is initialized from truncated normal distribution. - Initialize from the
ViT
pre-trained model can be found here.
- [√] add more
TimeSformer
andViViT
variants pre-trained weights.- A larger version and other operation types.
- [√] add
linear prob
andfinetune recipe
.- Make available to transfer the pre-trained model to downstream task.
- add more scalable Video Transformer benchmarks.
- We will mainly focus on the data-efficient models.
- add more robust objective functions.
- Pre-train the model through the dominated self-supervised methods, e.g Mask Image Modeling.
pip install -r requirements.txt
# path to Kinetics400 train set and val set
TRAIN_DATA_PATH='/path/to/Kinetics400/train_list.txt'
VAL_DATA_PATH='/path/to/Kinetics400/val_list.txt'
# path to root directory
ROOT_DIR='/path/to/work_space'
# path to pretrain weights
PRETRAIN_WEIGHTS='/path/to/weights'
# pretrain mvit using maskfeat
python model_pretrain.py \
-lr 8e-4 -epoch 300 -batch_size 16 -num_workers 8 -frame_interval 4 -num_frames 16 -num_class 400 \
-root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH
# finetune mvit with maskfeat pretrain weights
python model_pretrain.py \
-lr 0.005 -epoch 200 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 4 -num_class 400 \
-arch 'mvit' -optim_type 'adamw' -lr_schedule 'cosine' -objective 'supervised' -mixup True \
-auto_augment 'rand_aug' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS
# finetune timesformer with imagenet pretrain weights
python model_pretrain.py \
-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 8 -frame_interval 32 -num_class 400 \
-arch 'timesformer' -attention_type 'divided_space_time' -optim_type 'sgd' -lr_schedule 'cosine' \
-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'
# finetune vivit with imagenet pretrain weights
python model_pretrain.py \
-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 16 -num_class 400 \
-arch 'vivit' -attention_type 'fact_encoder' -optim_type 'sgd' -lr_schedule 'cosine' \
-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'
The minimal folder structure will look like as belows.
root_dir
├── results
│ ├── experiment_tag
│ │ ├── ckpt
│ │ ├── log
name | weights from | dataset | epochs | num frames | spatial crop | top1_acc | top5_acc | weight | log |
---|---|---|---|---|---|---|---|---|---|
TimeSformer-B | ImageNet-21K | K600 | 15e | 8 | 224 | 78.4 | 93.6 | Google drive or BaiduYun(code: yr4j) | log |
ViViT-B | ImageNet-21K | K400 | 30e | 16 | 224 | 75.2 | 91.5 | Google drive | |
MaskFeat | from scratch | K400 | 100e | 16 | 224 | Google drive |
For each column, we show the masked input(left), HOG predictions(middle) and original video frame(right).
Here, we show the extracted attention map of a random frame sampled from the demo video.
operation | top1_acc | top5_acc | top1_acc (three crop) |
---|---|---|---|
base | 68.2 | 87.6 | - |
+ frame_interval 4 -> 16 (span more time) |
72.9(+4.7) | 91.0(+3.4) | - |
+ RandomCrop, flip (overcome overfit) | 75.7(+2.8) | 92.5(+1.5) | - |
+ batch size 16 -> 8 (more iterations) |
75.8(+0.1) | 92.4(-0.1) | - |
+ frame_interval 16 -> 24 (span more time) |
77.7(+1.9) | 93.3(+0.9) | 78.4 |
+ frame_interval 24 -> 32 (span more time) |
78.4(+0.7) | 94.0(+0.7) | 79.1 |
tips: frame_interval
and data augment
counts for the validation accuracy.
operation | epoch_time |
---|---|
base (start with DDP) | 9h+ |
+ speed up training recipes |
1h+ |
+ switch from get_batch first to sample_Indice first |
0.5h |
+ batch size 16 -> 8 |
33.32m |
+ num_workers 8 -> 4 |
35.52m |
+ frame_interval 16 -> 24 |
44.35m |
tips: Improve the frame_interval
will drop a lot on time performance.
1.speed up training recipes
:
- More GPU device.
pin_memory=True
.- Avoid CPU->GPU Device transfer (such as
.item()
,.numpy()
,.cpu()
operations on tensor orlog
to disk).
2.get_batch first
means that we firstly read all frames through the video reader, and then get the target slice of frames, so it largely slow down the data-loading speed.
this repo is built on top of Pytorch-Lightning, pytorchvideo, skimage, decord and kornia. I also learn many code designs from MMaction2. I thank the authors for releasing their code.
I look forward to seeing one can provide some ideas about the repo, please feel free to report it in the issue, or even better, submit a pull request.
And your star is my motivation, thank u~