Here you can find information about training and evaluation of Diffused Heads. If you want to test our model on CREMA, please switch back to main.
Note: No checkpoints or datasets are provided. This code was roughly cleaned and can have bugs. Please raise an issue to open discussion on your problem. We apologize for the delay in publishing the code.
CREMA checkpoint can be downloaded here. No LRW checkpoint will be provided due to the license.
Our model works best on videos with the same alignment. To prepare videos, please use face processor. You can experiment with different offset values.
Precompute audio embeddings for your dataset.
We worked with the one from SDA. You can use part of the demo code from main where a scripted checkpoint is provided.
You are free to use any suitable audio encoder. Perhaps a better (and easier) choice is Whisper Large. Remember to change the dimension of audio embeddings in the config file, if needed.
The provided dataset class works on predefined file_list.txt
containing relative paths to video clips. Examples can be found in datasets/
The data folder should contain subfolders audio/
and video/
with separate audio and video files.
To train the model, specify paths and parameters in ./configs/config.yaml
.
python train.py
To generate multiple test videos, specify paths and parameters in ./configs/config_gen_test.yaml
.
python generate.py
To generate a video from any image/video and audio, specify paths and parameters in ./configs/config_gen_custom.yaml
.
python custom_video.py
The test splits for CREMA and LRW we used can be found in datasets/
.
Metrics used:
- FVD: Laughing Matters repo
- FID: torchmetrics
- Blinks/s and Blink duration: https://github.com/DinoMan/blink-detector
- OFM and F-MSE:
./smoothness_eval.py
- AV offset and AV Confidence: https://github.com/joonson/syncnet_python
- WER: a pretrained lipreading model that we cannot share. You can use any available one.
Our code supports W&B login. We left the code in the main scripts commented.
@inproceedings{stypulkowski2024diffused,
title={Diffused heads: Diffusion models beat gans on talking-face generation},
author={Stypu{\l}kowski, Micha{\l} and Vougioukas, Konstantinos and He, Sen and Zi{\k{e}}ba, Maciej and Petridis, Stavros and Pantic, Maja},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={5091--5100},
year={2024}
}
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.