Skip to content

This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open source community can contribute to this project.

License

Notifications You must be signed in to change notification settings

LJQCN101/Open-Sora-Plan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Open-Sora Plan

[Project Page] [ไธญๆ–‡ไธป้กต]

slack badge WeChat badge Twitter License Contributors Pr GitHub issues GitHub closed issues GitHub Repo stars

๐Ÿ’ช Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!

ๆœฌ้กน็›ฎๅธŒๆœ›้€š่ฟ‡ๅผ€ๆบ็คพๅŒบ็š„ๅŠ›้‡ๅค็ŽฐSora๏ผŒ็”ฑๅŒ—ๅคง-ๅ…”ๅฑ•AIGC่”ๅˆๅฎž้ชŒๅฎคๅ…ฑๅŒๅ‘่ตท๏ผŒๅฝ“ๅ‰ๆˆ‘ไปฌ่ต„ๆบๆœ‰้™ไป…ๆญๅปบไบ†ๅŸบ็ก€ๆžถๆž„๏ผŒๆ— ๆณ•่ฟ›่กŒๅฎŒๆ•ด่ฎญ็ปƒ๏ผŒๅธŒๆœ›้€š่ฟ‡ๅผ€ๆบ็คพๅŒบ้€ๆญฅๅขžๅŠ ๆจกๅ—ๅนถ็ญน้›†่ต„ๆบ่ฟ›่กŒ่ฎญ็ปƒ๏ผŒๅฝ“ๅ‰็‰ˆๆœฌ็ฆป็›ฎๆ ‡ๅทฎ่ทๅทจๅคง๏ผŒไป้œ€ๆŒ็ปญๅฎŒๅ–„ๅ’Œๅฟซ้€Ÿ่ฟญไปฃ๏ผŒๆฌข่ฟŽPull request๏ผ๏ผ๏ผ

Project stages:

  • Primary
  1. Setup the codebase and train a un-conditional model on a landscape dataset.
  2. Train models that boost resolution and duration.
  • Extensions
  1. Conduct text2video experiments on landscape dataset.
  2. Train the 1080p model on video2text dataset.
  3. Control model with more conditions.

๐Ÿ“ฐ News

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We opened some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch ๐Ÿ‘€ this repository for the latest updates.

โœŠ Todo

Setup the codebase and train a unconditional model on landscape dataset

  • Fix typos & Update readme. ๐Ÿค Thanks to @mio2333, @CreamyLong, @chg0901
  • Setup repo-structure.
  • Add docker file. โŒ› [WIP]
  • Enable type hints for functions. ๐Ÿ™ [Need your contribution]
  • Add Video-VQGAN model, which is borrowed from VideoGPT.
  • Support variable aspect ratios, resolutions, durations training on DiT.
  • Support Dynamic mask input inspired by FiT.
  • Add class-conditioning on embeddings.
  • Incorporating Latte as main codebase.
  • Add VAE model, which is borrowed from Stable Diffusion.
  • Joint dynamic mask input with VAE.
  • Add VQVAE from VQGAN. ๐Ÿ™ [Need your contribution]
  • Make the codebase ready for the cluster training. Add SLURM scripts. ๐Ÿ™ [Need your contribution]
  • Refactor VideoGPT. ๐Ÿค Thanks to @qqingzheng, @luo3300612
  • Add sampling script.
  • Incorporate SiT. ๐Ÿค Thanks to @khan-yin
  • Add eavluation scripts (FVD, CLIP score). ๐Ÿ™ [Need your contribution]

Train models that boost resolution and duration

  • Add PI to support out-of-domain size. ๐Ÿ™ [Need your contribution]
  • Add 2D RoPE to improve generalization ability as FiT. ๐Ÿ™ [Need your contribution]
  • Extract offline feature.
  • Add frame interpolation model. ๐Ÿค Thanks to @yunyangge
  • Add super resolution model. ๐Ÿค Thanks to @Linzy19
  • Add accelerate to automatically manage training.
  • Joint training with images. ๐Ÿ™ [Need your contribution]
  • Incorporate NaViT. ๐Ÿ™ [Need your contribution]
  • Add FreeNoise support for training-free longer video generation. ๐Ÿ™ [Need your contribution]

Conduct text2video experiments on landscape dataset.

  • Finish data loading, pre-processing utils. โŒ› [WIP]
  • Add CLIP and T5 support. โŒ› [WIP]
  • Add text2image training script. โŒ› [WIP]
  • Add prompt captioner. ๐Ÿ™ [Need your contribution] ๐Ÿš€ [Require more computation]

Train the 1080p model on video2text dataset

  • Looking for a suitable dataset, welcome to discuss and recommend. ๐Ÿ™ [Need your contribution]
  • Finish data loading, and pre-processing utils. โŒ› [WIP]
  • Support memory friendly training.
    • Add flash-attention2 from pytorch.
    • Add xformers.
    • Support mixed precision training.
    • Add gradient checkpoint.
    • Support for ReBased and Ring attention. ๐Ÿค Thanks to @kabachuha
    • Train using the deepspeed engine. ๐Ÿ™ [Need your contribution]
    • Integrate with Colossal-AI for a cheaper, faster, and more efficient. ๐Ÿ™ [Need your contribution]
  • Train with a text condition. Here we could conduct different experiments:
    • Train with T5 conditioning. ๐Ÿš€ [Require more computation]
    • Train with CLIP conditioning. ๐Ÿš€ [Require more computation]
    • Train with CLIP + T5 conditioning (probably costly during training and experiments). ๐Ÿš€ [Require more computation]

Control model with more condition

  • Load pretrained weights from PixArt-ฮฑ. โŒ› [WIP]
  • Incorporating ControlNet. ๐Ÿ™ [Need your contribution]

๐Ÿ“‚ Repo structure (WIP)

โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ docs
โ”‚   โ”œโ”€โ”€ Data.md                    -> Datasets description.
โ”‚   โ”œโ”€โ”€ Contribution_Guidelines.md -> Contribution guidelines description.
โ”œโ”€โ”€ scripts                        -> All scripts.
โ”œโ”€โ”€ opensora
โ”‚ย ย  โ”œโ”€โ”€ dataset
โ”‚ย ย  โ”œโ”€โ”€ models
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ ae                     -> Compress videos to latents
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ imagebase
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ vae
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ vqvae
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ videobase
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ vae
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย      โ””โ”€โ”€ vqvae
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ captioner
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ diffusion              -> Denoise latents
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ diffusion         
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ dit
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ latte
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ unet
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ frame_interpolation
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ super_resolution
โ”‚ย ย  โ”œโ”€โ”€ sample
โ”‚ย ย  โ”œโ”€โ”€ train                      -> Training code
โ”‚ย ย  โ””โ”€โ”€ utils

๐Ÿ› ๏ธ Requirements and Installation

The requirements are as follows.

  • Python >= 3.8
  • CUDA Version >= 11.7
  • Install required packages:
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

๐Ÿ—๏ธ Usage

Datasets

Refer to Data.md

Video-VQVAE (VideoGPT)

Training

To train VQVAE, run the script:

scripts/train_vqvae.sh

You can modify the training parameters within the script. For training parameters, please refer to transformers.TrainingArguments. Other parameters are explained as follows:

VQ-VAE Specific Settings
  • --embedding_dim: number of dimensions for codebooks embeddings
  • --n_codes 2048: number of codes in the codebook
  • --n_hiddens 240: number of hidden features in the residual blocks
  • --n_res_layers 4: number of residual blocks
  • --downsample "4,4,4": T H W downsampling stride of the encoder
Dataset Settings
  • --data_path <path>: path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
  • --resolution 128: spatial resolution to train on
  • --sequence_length 16: temporal resolution, or video clip length

Reconstructing

python examples/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
python examples/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1

We present four reconstructed videos in this demonstration, arranged from left to right as follows:

3s 596x336 10s 256x256 18s 196x196 24s 168x96

Others

Please refer to the document VQVAE.

VideoDiT (DiT)

Training

sh scripts/train.sh

Sampling

sh scripts/sample.sh

๐Ÿค How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

๐Ÿ‘ Acknowledgement

  • Latte: The main codebase we built upon and it is an wonderful video gererated model.
  • DiT: Scalable Diffusion Models with Transformers.
  • VideoGPT: Video Generation using VQ-VAE and Transformers.
  • FiT: Flexible Vision Transformer for Diffusion Model.
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

๐Ÿ”’ License

  • The service is a research preview intended for non-commercial use only. See LICENSE for details.

โœจ Star History

Star History

About

This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open source community can contribute to this project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Shell 1.0%