This is the repository that contains the source code for the CVPR 2024 paper of Dysen-VDM.
conda create -n dysen_vdm python=3.8.5
conda activate dysen_vdm
pip install -r requirements.txt
Put all the data at dataset
fold.
-
Pre-training corpus
- WebVid
- WebVid is a large-scale dataset of videos with textual descriptions, where the videos are diverse and rich in their content.
- There are 10.7M video-caption pairs, where we only use 3M text-video pairs for the pre-training of VDM.
- The dataset can be downloaded from the official website, and save them in the
dataset/webvid
.
- WebVid
-
Text-to-video in-domain data
-
UCF-101
- Composed of diverse human actions, which contains 101 classes where each class label denotes a specific movement label.
- The dataset can be downloaded from the official website, and save them in the
dataset/ucf101
.
-
MSR-VTT
- MSR-VTT (Microsoft Research Video to Text) is a large-scale text-video pair 715 dataset. It consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 716 English sentences by Amazon Mechanical Turks.
- The dataset can be downloaded from the official website, and save them in the
dataset/msrvtt
.
-
ActivityNet
- Each video in ActivityNet connects to the descriptions with multiple actions (at least 3 actions), allowing to describe multiple complex events that occur.
- The dataset can be found in the official website, and save them in the
dataset/activityNet
.
-
We first pre-train the Dysen-VDM system.
The pre-training process is with the dataset/WebVid
text-video pair data.
bash shellscripts/train_vdm_autoencoder.sh
- Properly set up
PROJ_ROOT
,DATADIR
,EXPERIMENT_NAME
andCONFIG
, whereEXPERIMENT_NAME
=webvid
.
bash shellscripts/run_train_vdm.sh
- Properly set up
PROJ_ROOT
,DATADIR
,AEPATH
,EXPERIMENT_NAME
andCONFIG
, whereEXPERIMENT_NAME
=webvid
.
This step uses gold DSG of video for the updating of recurrent graph Transformer in 3D-UNet.
parse the DSG annotations in advance with the tools in dysen/DSG
bash shellscripts/run_train_dysen_vdm.sh
-
properly set up
EXPERIMENT_NAME
,RESUME
,DATADIR
,CKPT_PATH
andVDM_MODEL
, whereEXPERIMENT_NAME
=webvid
. -
The in-context learning (ICL) process within dysen is optimized with reinforcement learning (RL). If using RL for the
Imagination Rationality
optimization, gold DSG of video is needed. parse the DSG annotations in advance with the tools indysen/DSG
.
We further update Dysen-VDM on the in-domain training set:
bash shellscripts/run_train_dysen_vdm.sh
- Properly set up
EXPERIMENT_NAME
,RESUME
,DATADIR
,CKPT_PATH
andVDM_MODEL
, whereEXPERIMENT_NAME
=activityNet
|msrvtt
|ucf101
.
Measuring the performances of Dysen-VDM on datasets dataset
:
bash shellscripts/run_eval_dysen_vdm.sh
- Properly set up
DATACONFIG
,PREDCITPATH
,GOLDPATH
,EXPERIMENT_NAME
, andRESDIR
.
Text-to-video generation with well-trained Dysen-VDM:
bash shellscripts/run_sample_vdm_text2video.sh
For any questions or feedback, feel free to contact Hao Fei.
If you find Dysen-VDM useful in your research or applications, please kindly cite:
@inproceedings{fei2024dysen,
title={Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs},
author={Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua},
booktitle={Proceedings of the CVPR},
pages={961--970},
year={2024}
}
This repository is under BSD 3-Clause License. Dysen-VDM is a research project intended for non-commercial use only. One must NOT use the code of Dysen-VDM for any illegal, harmful, violent, racist, or sexual purposes. One is strictly prohibited from engaging in any activity that will potentially violate these guidelines. Any potential commercial use of this code should be approved by the authors.