CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
Bojia Zi1, Shihao Zhao2, Xianbiao Qi*5, Jianan Wang4, Yukai Shi3, Qianyu Chen1, Bin Liang1, Rong Xiao5, Kam-Fai Wong1, Lei Zhang4
* is corresponding author.
This is the inference code for our paper CoCoCo.
Orginal | The ocean, the waves ... | The ocean, the waves ... |
Orginal | The river with ice ... | The river with ice ... |
Orginal | Meteor streaking in the sky ... | Meteor streaking in the sky ... |
- Consistent text-guided video inpainting
- By using damped attention, we have decent inpainting visual content
- Higher text controlability
- We have better text controlability
- Personalized video inpainting
- We develop a training-free method to implement personalized video inpainting by leveraging personalized T2Is
- Gradio Demo using SAM2
- We use SAM2 to create Video Inpaint Anything
- Infinite Video Inpainting
- By using the slidding window, you are allowed to inpaint any length videos.
- Controlable Video Inpainting
- By composing with the controlnet, we find that we can inpaint controlable content in the given masked region
- More inpainting tricks will be released soon...
Before install the dependencies, you should check the following requirements to overcome the installation failure.
- You have a GPU with at least 24G GPU memory.
- Your CUDA with nvcc version is greater than 12.0.
- Your Pytorch version is greater than 2.4.
- Your gcc version is greater than 9.4.
- Your diffusers version is 0.11.1.
- Your gradio version is 3.40.0.
If you update your enviroments successfully, then try to install the dependencies by pip.
# Install the CoCoCo dependencies
pip3 install -r requirements.txt
# Compile the SAM2
pip3 install -e .
If everything goes well, I think you can turn to the next steps.
Note that our method requires both parameters of SD1.5 inpainting and cococo.
-
The pretrained image inpainting model (Stable Diffusion Inpainting.)
-
The CoCoCo Checkpoints.
-
Warning: the runwayml delete their models and weights, so we must download the image inpainting model from other url.
-
After download, you should put these two models in two folders, the image inpainting folder should contains scheduler, tokenizer, text_encoder, vae, unet, the cococo folder should contain model_0.pth to model-3.pth
You can obtain mask by GroundingDINO or Track-Anything, or draw masks by yourself.
We release the gradio demo to use the SAM2 to implement Video Inpainting Anything. Try our Demo!
By running this code, you can simply get the video inpainting results.
python3 valid_code_release.py --config ./configs/code_release.yaml \
--prompt "Trees. Snow mountains. best quality." \
--negative_prompt "worst quality. bad quality." \
--guidance_scale 10 \ # the cfg number, higher means more powerful text controlability
--video_path ./images/ \ # the path that store the video and masks, the format is the images.npy and masks.npy
--model_path [cococo_folder_name] \ # the path to cococo weights, e.g. ./cococo_weights
--pretrain_model_path [sd_folder_name] \ # the path that store the pretrained stable inpainting model, e.g. ./stable-diffusion-v1-5-inpainting
--sub_folder unet # set the subfolder of pretrained stable inpainting model to get the unet checkpoints
We give a method to allow users to compose their own personlized video inpainting model by using personalized T2Is WITHOUT TRAINING. There are three steps in total:
-
Convert the opensource model to Pytorch weights.
-
Transform the personalized image diffusion to personliazed inpainting diffusion. Substract the weights of personalized image diffusion from SD1.5, and add them on inpainting model. Surprisingly, this method can get a personalized image inpainting model, and it works well:)
-
Add the weight of personalized inpainting model to our CoCoCo.
-
For the model using different key, we use the following script to process opensource T2I model.
For example, the epiCRealism, it is different from the key of the StableDiffusion.
model.diffusion_model.input_blocks.1.1.norm.bias model.diffusion_model.input_blocks.1.1.norm.weight
Therefore, we develope a tool to convert this type model to the delta of weight.
cd task_vector; python3 convert.py \ --tensor_path [safetensor_path] \ # set the safetensor path --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --source_path ./resources \ # the path you put some preliminary files, e.g. ./resources --target_path ./resources \ # the path you put some preliminary files, e.g. ./resources --target_prefix [prefix]; # set the converted filename prefix
-
For the model using same key and trained by LoRA.
For example, the Ghibli LoRA.
lora_unet_up_blocks_3_resnets_0_conv1.lora_down.weight lora_unet_up_blocks_3_resnets_0_conv1.lora_up.weight
python3 convert_lora.py \ --tensor_path [tensor_path] \ # the safetensor path --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --regulation_path ./lora.json \ # use this path defaultly. Please don't change --target_prefix [target_prefix] # et the converted filename prefix
-
You can use customized T2I or LoRA to create vision content in the masks.
python3 valid_code_release_with_T2I_LoRA.py \ --config ./configs/code_release.yaml --guidance_scale 10 \ # set this as default --video_path ./images \ # the path that store the videos, the format is the images.npy --masks_path ./images \ # the path that store the masks, the format is the masks.npy --model_path [model_path] \ # the path that store the cococo weights --pretrain_model_path [pretrain_model_path] \ # the path that store the SD1.5 Inpainting, e.g. ./stable-diffusion-v1-5-inpainting --sub_folder unet \ # set the subfolder of pretrained stable inpainting model to get the unet checkpoints --unet_lora_path [unet_lora_path] \ # the LoRA weights for unet --beta_unet 0.75 \ # the hyper-parameter $beta$ for unet LoRA weights --text_lora_path [text_lora_path] \ # the LoRA weights for text_encoder --beta_text 0.75 \ # the hyper-parameter $beta$ for text encoder LoRA weights --vae_lora_path [text_lora_path] \ # the LoRA weights for vae --beta_vae 0.75 \ # the hyper-parameter $beta$ for vae LoRA weights --unet_model_path [unet_model_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_model_path [text_model_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_model_path [vae_model_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --prompt [prompt] \ --negative_prompt [negative_prompt]
-
Try our demo with original COCOCO
CUDA_VISIBLE_DEVICES=0,1 python3 app.py \ --config ./configs/code_release.yaml \ --model_path [model_path] \ # the path to cococo weights --pretrain_model_path [pretrain_model_path] \ # the image inpainting pretrained model path, e.g. ./stable-diffusion-v1-5-inpainting --sub_folder [sub_folder] # set unet as default
-
Try our demo with LoRA and checkpoint
-
By using our convertion code, we obtain some personalized image inpainting models and LoRAs, you can download from the bellow:
-
Run the Gradio demo with LoRA.
CUDA_VISIBLE_DEVICES=0,1 python3 app_with_T2I_LoRA.py \ --config ./configs/code_release.yaml \ --unet_lora_path [unet_lora_path] \ # the LoRA weights for unet --text_lora_path [text_lora_path] \ # the LoRA weights for text_encoder --vae_lora_path [vae_lora_path] \ # the LoRA weights for vae --beta_unet 0.75 \ # the hyper-parameter $beta$ for unet LoRA weights --beta_text 0.75 \ # the hyper-parameter $beta$ for text_encoder LoRA weights --beta_vae 0.75 \ # the hyper-parameter $beta$ for vae LoRA weights --text_model_path [text_model_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --unet_model_path [unet_model_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --vae_model_path [vae_model_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --model_path [model_path] \ # cococo weights --pretrain_model_path [pretrain_model_path] \ # the image inpainting pretrained model path, e.g. ./stable-diffusion-v1-5-inpainting --sub_folder [sub_folder] # the default is unet
-
[1]. We will use larger dataset with high-quality videos to produce a more powerful video inpainting model soon.
[2]. The training code is under preparation.
@article{Zi2024CoCoCo,
title={CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility},
author={Bojia Zi and Shihao Zhao and Xianbiao Qi and Jianan Wang and Yukai Shi and Qianyu Chen and Bin Liang and Kam-Fai Wong and Lei Zhang},
journal={ArXiv},
year={2024},
volume={abs/2403.12035},
url={https://arxiv.org/abs/2403.12035}
}
This code is based on AnimateDiff, Segment-Anything-2 and propainter.