Open-Sora Plan

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前我们资源有限仅搭建了基础架构，无法进行完整训练，希望通过开源社区逐步增加模块并筹集资源进行训练，当前版本离目标差距巨大，仍需持续完善和快速迭代，欢迎Pull request！！！

Project stages:

Primary

Setup the codebase and train a un-conditional model on a landscape dataset.
Train models that boost resolution and duration.

Extensions

Conduct text2video experiments on landscape dataset.
Train the 1080p model on video2text dataset.
Control model with more conditions.

📰 News

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We opened some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901
Setup repo-structure.
Add docker file. ⌛ [WIP]
Enable type hints for functions. 🙏 [Need your contribution]
Add Video-VQGAN model, which is borrowed from VideoGPT.
Support variable aspect ratios, resolutions, durations training on DiT.
Support Dynamic mask input inspired by FiT.
Add class-conditioning on embeddings.
Incorporating Latte as main codebase.
Add VAE model, which is borrowed from Stable Diffusion.
Joint dynamic mask input with VAE.
Add VQVAE from VQGAN. 🙏 [Need your contribution]
Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612
Add sampling script.
Incorporate SiT. 🤝 Thanks to @khan-yin
Add eavluation scripts (FVD, CLIP score). 🙏 [Need your contribution]

Train models that boost resolution and duration

Add PI to support out-of-domain size. 🙏 [Need your contribution]
Add 2D RoPE to improve generalization ability as FiT. 🙏 [Need your contribution]
Extract offline feature.
Add frame interpolation model. 🤝 Thanks to @yunyangge
Add super resolution model. 🤝 Thanks to @Linzy19
Add accelerate to automatically manage training.
Joint training with images. 🙏 [Need your contribution]
Incorporate NaViT. 🙏 [Need your contribution]
Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]

Conduct text2video experiments on landscape dataset.

Finish data loading, pre-processing utils. ⌛ [WIP]
Add CLIP and T5 support. ⌛ [WIP]
Add text2image training script. ⌛ [WIP]
Add prompt captioner. 🙏 [Need your contribution] 🚀 [Require more computation]

Train the 1080p model on video2text dataset

Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
Finish data loading, and pre-processing utils. ⌛ [WIP]
Support memory friendly training.
- Add flash-attention2 from pytorch.
- Add xformers.
- Support mixed precision training.
- Add gradient checkpoint.
- Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
- Train using the deepspeed engine. 🙏 [Need your contribution]
- Integrate with Colossal-AI for a cheaper, faster, and more efficient. 🙏 [Need your contribution]
Train with a text condition. Here we could conduct different experiments:
- Train with T5 conditioning. 🚀 [Require more computation]
- Train with CLIP conditioning. 🚀 [Require more computation]
- Train with CLIP + T5 conditioning (probably costly during training and experiments). 🚀 [Require more computation]

Control model with more condition

Load pretrained weights from PixArt-α. ⌛ [WIP]
Incorporating ControlNet. 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   └── super_resolution
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

The requirements are as follows.

Python >= 3.8
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

🗝️ Usage

Datasets

Refer to Data.md

Video-VQVAE (VideoGPT)

Training

To train VQVAE, run the script:

scripts/train_vqvae.sh

You can modify the training parameters within the script. For training parameters, please refer to transformers.TrainingArguments. Other parameters are explained as follows:

VQ-VAE Specific Settings

--embedding_dim: number of dimensions for codebooks embeddings
--n_codes 2048: number of codes in the codebook
--n_hiddens 240: number of hidden features in the residual blocks
--n_res_layers 4: number of residual blocks
--downsample "4,4,4": T H W downsampling stride of the encoder

Dataset Settings

--data_path <path>: path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
--resolution 128: spatial resolution to train on
--sequence_length 16: temporal resolution, or video clip length

Reconstructing

python examples/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1

python examples/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1

We present four reconstructed videos in this demonstration, arranged from left to right as follows:

3s 596x336	10s 256x256	18s 196x196	24s 168x96

Others

Please refer to the document VQVAE.

VideoDiT (DiT)

Training

sh scripts/train.sh

Sampling

sh scripts/sample.sh

🤝 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

Latte: The main codebase we built upon and it is an wonderful video gererated model.
DiT: Scalable Diffusion Models with Transformers.
VideoGPT: Video Generation using VQ-VAE and Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

The service is a research preview intended for non-commercial use only. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open-Sora Plan

💪 Goal

📰 News

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

Train models that boost resolution and duration

Conduct text2video experiments on landscape dataset.

Train the 1080p model on video2text dataset

Control model with more condition

📂 Repo structure (WIP)

🛠️ Requirements and Installation

🗝️ Usage

Datasets

Video-VQVAE (VideoGPT)

Training

VQ-VAE Specific Settings

Dataset Settings

Reconstructing

Others

VideoDiT (DiT)

Training

Sampling

🤝 How to Contribute to the Open-Sora Plan Community

👍 Acknowledgement

🔒 License

✨ Star History

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open-Sora Plan

💪 Goal

📰 News

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

Train models that boost resolution and duration

Conduct text2video experiments on landscape dataset.

Train the 1080p model on video2text dataset

Control model with more condition

📂 Repo structure (WIP)

🛠️ Requirements and Installation

🗝️ Usage

Datasets

Video-VQVAE (VideoGPT)

Training

VQ-VAE Specific Settings

Dataset Settings

Reconstructing

Others

VideoDiT (DiT)

Training

Sampling

🤝 How to Contribute to the Open-Sora Plan Community

👍 Acknowledgement

🔒 License

✨ Star History