Kaihua Chen*, Tarasha Khurana*, Deva Ramanan
This repository contains the official implementation of CogNVS.
- Release CogNVS inference pipeline and checkpoints
- Release self-supervised data generation code
- Release CogNVS test-time finetuning code
- Release evaluation code on Kubric-4D, ParallelDomain-4D, and Dycheck
- Train a better CogNVS inpainting checkpoint with more data, once more compute is available
Clone the repository and set up the environment:
git clone https://github.com/Kaihua-Chen/cog-nvs
cd cog-nvs
conda create --name cognvs python=3.11
conda activate cognvs
pip install -r cognvs_requirements.txt-
CogVideoX base model
Download the original CogVideoX-5b-I2V checkpoints from: https://huggingface.co/zai-org/CogVideoX-5b-I2V
-
CogNVS inpainting checkpoint
We provide CogNVS inpainting checkpoints, which can be used for further test-time finetuning on the sequences you want:
mkdir checkpoints cd checkpoints git lfs install git clone https://huggingface.co/kaihuac/cognvs_ckpt_inpaint cd ..
-
(Optional) Test-time finetuned checkpoints
Please refer to Step 3 "Self-supervised Data Pair Generation" to generate training pairs and then follow Step 4 "Test-time Finetuning" to finetune our inpainting checkpoints on your target sequence.
We also provide checkpoints already finetuned on our
demo_data. If you want to skip test-time finetuning, download them (~20GB each) from: Link
You can run inference in three ways:
- Use the CogNVS inpainting checkpoint directly (not recommended; only for quick test, quality is usually lower)
- Download and use our provided test-time finetuned checkpoints
- Perform your own test-time finetuning (following instructions in later sections) and run inference afterward
Example using a test-time finetuned checkpoint:
python demo.py \
--model_path "checkpoints/CogVideoX-5b-I2V" \
--cognvs_ckpt_path "checkpoints/cognvs_ckpt_finetuned_davis_bear/my_checkpoint-200_transformer" \
--data_path "demo_data/davis_bear" \
--mp4_name "example_eval_render.mp4"where mp4_name is the name of the input video, and can also be a pattern like eval_render*.mp4.
The output will be saved to:
demo_data/davis_bear/outputs/
- Sequence folder structure
sequence_name/
├─ gt_rgb.mp4
└─ cam_info/
└─ megasam_depth.npy
└─ megasam_intrinsics.npy (optional)
└─ megasam_c2ws.npy (optional)
- Generate training pairs
python data_gen.py \
--device "cuda:0" \
--data_path "demo_data/davis_bear" \
--mode "train" \
--intrinsics_file "cam_info/megasam_intrinsics.npy" \
--extrinsics_file "cam_info/megasam_c2ws.npy"(intrinsics_file and extrinsics_file are optional. The pipeline still works if you only provide the depth file from MegaSAM, DepthCrafter, etc.)
- Generate evaluation pairs
python data_gen.py \
--device "cuda:0" \
--data_path "demo_data/davis_bear" \
--mode "eval"Evaluation renders will be created from predefined trajectories in the trajs/ folder. You can customize trajectories by editing those .txt files.
After generating training pairs, edit the config files and run test-time finetuning:
-
Edit
finetune/finetune_cognvs.sh:model_path: path to CogVideoX-5b-I2V checkpointtransformer_id: path to our CogNVS inpainting checkpointoutput_dir: path to save the finetuned checkpointbase_dir_input: sequence folder with training pairs
Optional parameters:
train_epochs: number of epochscheckpointing_steps: steps to save checkpointscheckpointing_limit: max number of checkpoints to keepdo_validation: setTrueto enable validation (slower)validation_steps: steps to run validation
-
Edit
finetune/accelerate_config.yaml:gpu_ids: GPU ids for trainingnum_processes: must match number of GPU ids
-
Start finetuning:
cd finetune
sh finetune_cognvs.sh- Process finetuned checkpoints
Place the following files from the toolbox/ folder into the checkpoints/ directory:
config.jsondiffusion_pytorch_model.safetensors.index.jsonprocess_ckpts.sh
The structure should be:
checkpoints/
├── config.json
├── diffusion_pytorch_model.safetensors.index.json
├── process_ckpts.sh
└── cognvs_ckpt_finetuned_bear/
└── checkpoint-200/
Edit process_ckpts.sh to match your checkpoint step:
CHECKPOINT_DIR="checkpoint-200"Then run:
cd checkpoints
sh process_ckpts.shThis processing step can take ~20 min or longer, depending on your system performance.
- Go back to Section 2 (Inference) and run on evaluation renders
Our work builds on CogVideoX and uses DeepSpeed ZeRO-2 for memory-efficient finetuning. Video depth estimation adopts MegaSAM or DepthCrafter. Concurrent research includes ViewCrafter, GEN3C, CAT4D, TrajectoryCrafter, ReCamMaster, etc. We thank the authors for their contributions.
If you find this work helpful, please cite:
@inproceedings{chen2025cognvs,
title = {Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos},
author = {Chen, Kaihua and Khurana, Tarasha and Ramanan, Deva},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025}
}