Duolikun Danier, Fan Zhang, David Bull
We observe that most existing learning-based VFI models are trained to minimise the L1/L2/VGG loss between their outputs and the ground-truth frames. However, it was shown in previous works that these metrics do not correlate well with the perceptual quality of VFI. On the other hand, generative models, especially diffusion models, are showing remarkable results in generating visual content with high perceptual quality. In this work, we leverage the high-fidelity image/video generation capabilities of latent diffusion models to perform generative VFI.
See environment.yaml for requirements on packages. Simple installation:
conda env create -f environment.yaml
The pre-trained model can be downloaded from here, and its corresponding config file is this yaml.
[Vimeo-90K] | [BVI-DVC quintuplets]
[Middlebury] | [UCF101] | [DAVIS] | [SNU-FILM]
To make use of the evaluate.py and the files in ldm/data/, the dataset folder names should be lower-case and structured as follows.
└──── <data directory>/
├──── middlebury_others/
| ├──── input/
| | ├──── Beanbags/
| | ├──── ...
| | └──── Walking/
| └──── gt/
| ├──── Beanbags/
| ├──── ...
| └──── Walking/
├──── ucf101/
| ├──── 0/
| ├──── ...
| └──── 99/
├──── davis90/
| ├──── bear/
| ├──── ...
| └──── walking/
├──── snufilm/
| ├──── test-easy.txt
| ├──── ...
| └──── data/SNU-FILM/test/...
├──── bvidvc/quintuplets
| ├──── 00000/
| ├──── ...
| └──── 17599/
└──── vimeo_septuplet/
├──── sequences/
├──── sep_testlist.txt
└──── sep_trainlist.txt
To evaluate LDMVFI (with DDIM sampler), for example, on the Middlebury dataset, using PSNR/SSIM/LPIPS, run the following command.
python evaluate.py \
--config configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml \
--ckpt <path/to/ldmvfi-vqflow-f32-c256-concat_max.ckpt> \
--dataset Middlebury_others \
--metrics PSNR SSIM LPIPS \
--data_dir <path/to/data/dir> \
--out_dir eval_results/ldmvfi-vqflow-f32-c256-concat_max/ \
--use_ddim
This will create the directory eval_results/ldmvfi-vqflow-f32-c256-concat_max/Middlebury_others/
, and store the interpolated frames, as well as a results.txt
file in that directory. For other test sets, replace Middlebury_other
with the corresponding class names defined in ldm/data/testsets.py (e.g. Ucf101_triplet
).
To evaluate the model on perceptual video metric FloLPIPS, first evaluate the image metrics using the code above (so that the interpolated frames are saved in eval_results/ldmvfi-vqflow-f32-c256-concat_max
), then run the following code.
python evaluate_vqm.py \
--exp ldmvfi-vqflow-f32-c256-concat_max \
--dataset Middlebury_others \
--metrics FloLPIPS \
--data_dir <path/to/data/dir> \
--out_dir eval_results/ldmvfi-vqflow-f32-c256-concat_max/ \
This will read the interpolated frames previously stored in eval_results/ldmvfi-vqflow-f32-c256-concat_max/Middlebury_others/
then output the evaluation results to results_vqm.txt
in the same folder.
To interpolate a video (in .yuv format), use the following code.
python interpolate_yuv.py \
--net LDMVFI \
--config configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml \
--ckpt <path/to/ldmvfi-vqflow-f32-c256-concat_max.ckpt> \
--input_yuv <path/to/input/yuv> \
--size <spatial res of video, e.g. 1920x1080> \
--out_fps <output fps, should be 2 x original fps> \
--out_dir <desired/output/dir> \
--use_ddim
LDMVFI is trained in two stages, where the VQ-FIGAN and the denoising U-Net are trained separately.
python main.py --base configs/autoencoder/vqflow-f32.yaml -t --gpus 0,
python main.py --base configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml -t --gpus 0,
These will create a logs/
folder within which the corresonding directories are created for each experiment. The log files from training include checkpoints, images and tensorboard loggings.
To resume from a checkpoint file, simply use the --resume
argument in main.py to specify the checkpoint.
@article{danier2023ldmvfi,
title={LDMVFI: Video Frame Interpolation with Latent Diffusion Models},
author={Danier, Duolikun and Zhang, Fan and Bull, David},
journal={arXiv preprint arXiv:2303.09508},
year={2023}
}
Our code is adapted from the original latent-diffusion repository. We thank the authors for sharing their code.