This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!
ๆฌ้กน็ฎๅธๆ้่ฟๅผๆบ็คพๅบ็ๅ้ๅค็ฐSora๏ผ็ฑๅๅคง-ๅ ๅฑAIGC่ๅๅฎ้ชๅฎคๅ ฑๅๅ่ตท๏ผๅฝๅๆไปฌ่ตๆบๆ้ไป ๆญๅปบไบๅบ็กๆถๆ๏ผๆ ๆณ่ฟ่กๅฎๆด่ฎญ็ป๏ผๅธๆ้่ฟๅผๆบ็คพๅบ้ๆญฅๅขๅ ๆจกๅๅนถ็ญน้่ตๆบ่ฟ่ก่ฎญ็ป๏ผๅฝๅ็ๆฌ็ฆป็ฎๆ ๅทฎ่ทๅทจๅคง๏ผไป้ๆ็ปญๅฎๅๅๅฟซ้่ฟญไปฃ๏ผๆฌข่ฟPull request๏ผ๏ผ๏ผ
Project stages:
- Primary
- Setup the codebase and train a un-conditional model on a landscape dataset.
- Train models that boost resolution and duration.
- Extensions
- Conduct text2video experiments on landscape dataset.
- Train the 1080p model on video2text dataset.
- Control model with more conditions.
[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.
[2024.03.05] See our latest todo, pull requests are welcome.
[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.
[2024.03.03] We opened some discussions to clarify several issues.
[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch ๐ this repository for the latest updates.
- Fix typos & Update readme. ๐ค Thanks to @mio2333, @CreamyLong, @chg0901
- Setup repo-structure.
- Add docker file. โ [WIP]
- Enable type hints for functions. ๐ [Need your contribution]
- Add Video-VQGAN model, which is borrowed from VideoGPT.
- Support variable aspect ratios, resolutions, durations training on DiT.
- Support Dynamic mask input inspired by FiT.
- Add class-conditioning on embeddings.
- Incorporating Latte as main codebase.
- Add VAE model, which is borrowed from Stable Diffusion.
- Joint dynamic mask input with VAE.
- Add VQVAE from VQGAN. ๐ [Need your contribution]
- Make the codebase ready for the cluster training. Add SLURM scripts. ๐ [Need your contribution]
- Refactor VideoGPT. ๐ค Thanks to @qqingzheng, @luo3300612
- Add sampling script.
- Incorporate SiT. ๐ค Thanks to @khan-yin
- Add eavluation scripts (FVD, CLIP score). ๐ [Need your contribution]
- Add PI to support out-of-domain size. ๐ [Need your contribution]
- Add 2D RoPE to improve generalization ability as FiT. ๐ [Need your contribution]
- Extract offline feature.
- Add frame interpolation model. ๐ค Thanks to @yunyangge
- Add super resolution model. ๐ค Thanks to @Linzy19
- Add accelerate to automatically manage training.
- Joint training with images. ๐ [Need your contribution]
- Incorporate NaViT. ๐ [Need your contribution]
- Add FreeNoise support for training-free longer video generation. ๐ [Need your contribution]
- Finish data loading, pre-processing utils. โ [WIP]
- Add CLIP and T5 support. โ [WIP]
- Add text2image training script. โ [WIP]
- Add prompt captioner. ๐ [Need your contribution] ๐ [Require more computation]
- Looking for a suitable dataset, welcome to discuss and recommend. ๐ [Need your contribution]
- Finish data loading, and pre-processing utils. โ [WIP]
- Support memory friendly training.
- Add flash-attention2 from pytorch.
- Add xformers.
- Support mixed precision training.
- Add gradient checkpoint.
- Support for ReBased and Ring attention. ๐ค Thanks to @kabachuha
- Train using the deepspeed engine. ๐ [Need your contribution]
- Integrate with Colossal-AI for a cheaper, faster, and more efficient. ๐ [Need your contribution]
- Train with a text condition. Here we could conduct different experiments:
- Train with T5 conditioning. ๐ [Require more computation]
- Train with CLIP conditioning. ๐ [Require more computation]
- Train with CLIP + T5 conditioning (probably costly during training and experiments). ๐ [Require more computation]
- Load pretrained weights from PixArt-ฮฑ. โ [WIP]
- Incorporating ControlNet. ๐ [Need your contribution]
โโโ README.md
โโโ docs
โ โโโ Data.md -> Datasets description.
โ โโโ Contribution_Guidelines.md -> Contribution guidelines description.
โโโ scripts -> All scripts.
โโโ opensora
โย ย โโโ dataset
โย ย โโโ models
โย ย โย ย โโโ ae -> Compress videos to latents
โย ย โย ย โย ย โโโ imagebase
โย ย โย ย โย ย โย ย โโโ vae
โย ย โย ย โย ย โย ย โโโ vqvae
โย ย โย ย โย ย โโโ videobase
โย ย โย ย โย ย โโโ vae
โย ย โย ย โย ย โโโ vqvae
โย ย โย ย โโโ captioner
โย ย โย ย โโโ diffusion -> Denoise latents
โย ย โย ย โย ย โโโ diffusion
โย ย โย ย โย ย โโโ dit
โย ย โย ย โย ย โโโ latte
โย ย โย ย โย ย โโโ unet
โย ย โย ย โโโ frame_interpolation
โย ย โย ย โโโ super_resolution
โย ย โโโ sample
โย ย โโโ train -> Training code
โย ย โโโ utils
The requirements are as follows.
- Python >= 3.8
- CUDA Version >= 11.7
- Install required packages:
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
Refer to Data.md
To train VQVAE, run the script:
scripts/train_vqvae.sh
You can modify the training parameters within the script. For training parameters, please refer to transformers.TrainingArguments. Other parameters are explained as follows:
--embedding_dim
: number of dimensions for codebooks embeddings--n_codes 2048
: number of codes in the codebook--n_hiddens 240
: number of hidden features in the residual blocks--n_res_layers 4
: number of residual blocks--downsample "4,4,4"
: T H W downsampling stride of the encoder
--data_path <path>
: path to anhdf5
file or a folder containingtrain
andtest
folders with subdirectories of videos--resolution 128
: spatial resolution to train on--sequence_length 16
: temporal resolution, or video clip length
python examples/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
python examples/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1
We present four reconstructed videos in this demonstration, arranged from left to right as follows:
3s 596x336 | 10s 256x256 | 18s 196x196 | 24s 168x96 |
---|---|---|---|
Please refer to the document VQVAE.
sh scripts/train.sh
sh scripts/sample.sh
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines
- Latte: The main codebase we built upon and it is an wonderful video gererated model.
- DiT: Scalable Diffusion Models with Transformers.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
- The service is a research preview intended for non-commercial use only. See LICENSE for details.