Skip to content

Latest commit

 

History

History
281 lines (220 loc) · 15.6 KB

README.md

File metadata and controls

281 lines (220 loc) · 15.6 KB

Open-Sora Plan

[Project Page] [中文主页]

slack badge WeChat badge Twitter
License GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues
GitHub repo stars  GitHub repo forks  GitHub repo watchers  GitHub repo size

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前我们资源有限仅搭建了基础架构,无法进行完整训练,希望通过开源社区逐步增加模块并筹集资源进行训练,当前版本离目标差距巨大,仍需持续完善和快速迭代,欢迎Pull request!!!

Project stages:

  • Primary
  1. Setup the codebase and train a un-conditional model on a landscape dataset.
  2. Train models that boost resolution and duration.
  • Extensions
  1. Conduct text2video experiments on landscape dataset.
  2. Train the 1080p model on video2text dataset.
  3. Control model with more conditions.

📰 News

[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We opened some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

  • Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901, @Nyx-177, @HowardLi1984, @sennnnn, @Jason-fan20
  • Setup environment. 🤝 Thanks to @nameless1117
  • Add docker file. ⌛ [WIP] 🤝 Thanks to @Mon-ius, @SimonLeeGit
  • Enable type hints for functions. @RuslanPeresy, 🙏 [Need your contribution]
  • Resume from checkpoint.
  • Add Video-VQGAN model, which is borrowed from VideoGPT.
  • Support variable aspect ratios, resolutions, durations training on DiT.
  • Support Dynamic mask input inspired by FiT.
  • Add class-conditioning on embeddings.
  • Incorporating Latte as main codebase.
  • Add VAE model, which is borrowed from Stable Diffusion.
  • Joint dynamic mask input with VAE.
  • Add VQVAE from VQGAN. 🙏 [Need your contribution]
  • Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
  • Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612, @sennnnn
  • Add sampling script.
  • Add DDP sampling script. ⌛ [WIP]
  • Incorporate SiT. 🤝 Thanks to @khan-yin
  • Add evaluation scripts (FVD, CLIP score). 🤝 Thanks to @rain305f

Train models that boost resolution and duration

  • Add PI to support out-of-domain size. 🤝 Thanks to @jpthu17
  • Add 2D RoPE to improve generalization ability as FiT. 🤝 Thanks to @jpthu17
  • Compress KV according to PixArt-sigma. ⌛ [WIP]
  • Train a low dimension Video-AE, whether it is VAE or VQVAE. ⌛ [WIP] 🚀 [Require more computation]
  • Extract offline feature.
  • Train with offline feature.
  • Add frame interpolation model. 🤝 Thanks to @yunyangge
  • Add super resolution model. 🤝 Thanks to @Linzy19
  • Add accelerate to automatically manage training.
  • Joint training with images. 🙏 [Need your contribution]
  • Incorporate NaViT. 🙏 [Need your contribution]
  • Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]

Conduct text2video experiments on landscape dataset.

  • Finish data loading, pre-processing utils.
  • Add T5 support.
  • Add CLIP support. 🙏 [Need your contribution]
  • Add text2image training script.
  • Add prompt captioner.
    • Collect training data.
      • Need video-text pairs with poor caption. 🙏 [Need your contribution]
      • Extract multi-frame descriptions by large image-language models. 🤝 Thanks to @HowardLi1984
      • Extract video description by large video-language models. 🙏 [Need your contribution]
      • Integrate captions to get a dense caption by using a large language model, such as GPT-4. 🤝 Thanks to @HowardLi1984
    • Train a captioner to refine captions. 🚀 [Require more computation]

Train the 1080p model on video2text dataset

  • Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
  • Add synthetic video created by game engines or 3D representations. 🙏 [Need your contribution]
  • Finish data loading, and pre-processing utils. ⌛ [WIP]
  • Support memory friendly training.
    • Add flash-attention2 from pytorch.
    • Add xformers. 🤝 Thanks to @jialin-zhao
    • Support mixed precision training.
    • Add gradient checkpoint.
    • Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
    • Train using the deepspeed engine. 🤝 Thanks to @sennnnn
    • Integrate with Colossal-AI for a cheaper, faster, and more efficient. 🙏 [Need your contribution]
  • Train with a text condition. Here we could conduct different experiments:
    • Train with T5 conditioning. 🚀 [Require more computation]
    • Train with CLIP conditioning. 🚀 [Require more computation]
    • Train with CLIP + T5 conditioning (probably costly during training and experiments). 🚀 [Require more computation]

Control model with more condition

  • Load pretrained weights from PixArt-α. ⌛ [WIP]
  • Incorporating ControlNet. 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   └── super_resolution
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

  1. Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
  1. Install required packages
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
  1. Install optional requirements such as static type checking:
pip install -e '.[dev]'

🗝️ Usage

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

Video-VQVAE (VideoGPT)

Training

To train VQVAE, run the script:

scripts/train_vqvae.sh

You can modify the training parameters within the script. For training parameters, please refer to transformers.TrainingArguments. Other parameters are explained as follows:

VQ-VAE Specific Settings
  • --embedding_dim: number of dimensions for codebooks embeddings
  • --n_codes 2048: number of codes in the codebook
  • --n_hiddens 240: number of hidden features in the residual blocks
  • --n_res_layers 4: number of residual blocks
  • --downsample "4,4,4": T H W downsampling stride of the encoder
Dataset Settings
  • --data_path <path>: path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
  • --resolution 128: spatial resolution to train on
  • --sequence_length 16: temporal resolution, or video clip length

Reconstructing

python examples/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
python examples/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1

We present four reconstructed videos in this demonstration, arranged from left to right as follows:

3s 596x336 10s 256x256 18s 196x196 24s 168x96

Others

Please refer to the document VQVAE.

VideoDiT (DiT)

Training

sh scripts/train.sh

Sampling

sh scripts/sample.sh

🤝 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

  • Latte: The main codebase we built upon and it is an wonderful video gererated model.
  • DiT: Scalable Diffusion Models with Transformers.
  • VideoGPT: Video Generation using VQ-VAE and Transformers.
  • FiT: Flexible Vision Transformer for Diffusion Model.
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

  • The service is a research preview intended for non-commercial use only. See LICENSE for details.