ActBERT

Introduction

Actbert is proposed by Baidu in CVPR2020 for multimodal pretrain task. It leverage global action information to cat- alyze mutual interactions between linguistic texts and local regional objects. This method introduce a TaNgled Transformer block (TNT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. ActBERT significantly outperforms the state- of-the-art in five downstream video-and-language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization.

Data

Please refer to Kinetics400 data download and preparation doc HowTo100M-data

Please refer to MSR-VTT data download and preparation doc MSR-VTT-data

Train

Train on HowTo100M

download pretrain-model

Please download bert-base-uncased as pretraind model:

wget https://videotag.bj.bcebos.com/PaddleVideo-release2.2/bert-base-uncased.pdparams

and add path to MODEL.framework.backbone.pretrained in config file as：

MODEL:
    framework: "ActBert"
    backbone:
        name: "BertForMultiModalPreTraining"
        pretrained: your weight path

We provide training option on small data, config file is for reference only.

Start training

Train ActBERT on HowTo100M scripts:

python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7"  --log_dir=log_actbert  main.py  --validate -c configs/multimodal/actbert/actbert.yaml

AMP is useful for speeding up training:

export FLAGS_conv_workspace_size_limit=800 #MB
export FLAGS_cudnn_exhaustive_search=1
export FLAGS_cudnn_batchnorm_spatial_persistent=1

python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7"  --log_dir=log_actbert  main.py  --amp --validate -c configs/multimodal/actbert/actbert.yaml

Test

Evaluation performs on downstream task, i.e. text-video clip retrieval on MSR-VTT dataset, test accuracy can be obtained using scripts:

python3.7 main.py --test -c configs/multimodal/actbert/actbert_msrvtt.yaml -w Actbert.pdparams

Metrics on MSR-VTT:

R@1	R@5	R@10	Median R	Mean R	checkpoints
8.6	31.2	45.5	13.0	28.5	ActBERT.pdparams

Reference

ActBERT: Learning Global-Local Video-Text Representations , Linchao Zhu, Yi Yang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

actbert.md

actbert.md

ActBERT

Contents

Introduction

Data

Train

Train on HowTo100M

download pretrain-model

Start training

Test

Reference

Files

actbert.md

Latest commit

History

actbert.md

File metadata and controls

ActBERT

Contents

Introduction

Data

Train

Train on HowTo100M

download pretrain-model

Start training

Test

Reference