GitHub - NJU-PCALab/OpenVid-1M: [ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

OpenVid-1M

OpenVid-1M is a high-quality text-to-video dataset designed for research institutions to enhance video quality, featuring high aesthetics, clarity, and resolution. It can be used for direct training or as a quality tuning complement to other video datasets. It can also be used in other video generation task (video super-resolution, frame interpolation, etc)

We carefully curate 1 million high-quality video clips with expressive captions to advance text-to-video research, in which 0.4 million videos are in 1080P resolution (termed OpenVidHD-0.4M).

OpenVid-1M is cited, discussed or used in several recent works, including video diffusion models Goku, MarDini, Allegro, T2V-Turbo-V2, Pyramid Flow, SnapGen-V; long video generation model with AR model ARLON; visual understanding and generation model VILA-U; 3D/4D generation models GenXD, DimentionX; video VAE model IV-VAE; Frame interpolation model Framer and large multimodal model InternVL 2.5.

News 🚀🚀🚀

[2025.02.28] 🤗 Thanks @Binglei, OpenVid-1M-mapping was developed to correlate the video names in the CSV files with their file paths in the unzipped files. It will be particularly useful if you only need to use a portion of OpenVid-1M and prefer not to download the entire collection.
[2025.01.23] 🏆 OpenVid-1M is accepted by ICLR 2025!!!
[2024.12.01] 🚀 OpenVid-1M dataset was downloaded over 79,000 times on Huggingface last month, placing it in the top 1% of all video datasets (as of Nov. 2024)!!
[2024.07.01] 🔥 Our paper, code, model and OpenVid-1M dataset are released!

Preparation

Environment

conda create -n openvid python=3.10
conda activate openvid
pip install torch torchvision
pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121

Dataset

Download OpenVid-1M dataset.

# it takes a lot of time.
python download_scripts/download_OpenVid.py

Put OpenVid-1M dataset in ./dataset folder.

dataset
└─ OpenVid-1M
    └─ data
        └─ train
            └─ OpenVid-1M.csv
            └─ OpenVidHD.csv
    └─ video
        └─ ---_iRTHryQ_13_0to241.mp4
        └─ ---agFLYkbY_7_0to303.mp4
        └─ --0ETtekpw0_2_18to486.mp4
        └─ ...

Model Weight

Model	Data	Pretrained Weight	Steps	Batch Size	URL
STDiT-16×1024×1024	OpenVidHQ	STDiT-16×512×512	16k	32×4	🔗
STDiT-16×512×512	OpenVid-1M	STDiT-16×256×256	20k	32×8	🔗
MVDiT-16×512×512	OpenVid-1M	MVDiT-16×256×256	20k	32×4	🔗

Our model's weight is partially initialized from PixArt-α.

Inference

# MVDiT, 16x512x512
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/mvdit/inference/16x512x512.py --ckpt-path MVDiT-16x512x512.pt
# STDiT, 16x512x512
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/stdit/inference/16x512x512.py --ckpt-path STDiT-16x512x512.pt
# STDiT, 16x1024x1024
torchrun --standalone --nproc_per_node 1 scripts/inference.py --config configs/stdit/inference/16x1024x1024.py --ckpt-path STDiT-16x1024x1024.pt

Training

# MVDiT, 16x256x256, 72k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/mvdit/train/16x256x256.py
# MVDiT, 16x512x512, 20k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/mvdit/train/16x512x512.py

# STDiT, 16x256x256, 72k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x256x256.py
# STDiT, 16x512x512, 20k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x512x512.py
# STDiT, 16x1024x1024, 16k Steps
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py --config configs/stdit/train/16x1024x1024.py

Training orders: 16x256x256 $\rightarrow$ 16×512×512 $\rightarrow$ 16×1024×1024.

References

Part of the code is based upon: Open-Sora. Thanks for their great work!

Citation

@article{nan2024openvid,
  title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation},
  author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying},
  journal={arXiv preprint arXiv:2407.02371},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
configs		configs
download_scripts		download_scripts
openvid		openvid
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

OpenVid-1M

News 🚀🚀🚀

Preparation

Environment

Dataset

Model Weight

Inference

Training

References

Citation

About

Releases

Packages

Languages

NJU-PCALab/OpenVid-1M

Folders and files

Latest commit

History

Repository files navigation

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

OpenVid-1M

News 🚀🚀🚀

Preparation

Environment

Dataset

Model Weight

Inference

Training

References

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages