Generate realistic talking faces for any human speech and face identity.
[Paper] | [Project Page] | [Demonstration Video]
A new, improved work that can produce significantly more accurate and natural results on moving talking face videos is available here: https://github.com/Rudrabha/Wav2Lip
Code without MATLAB dependency is now available in fully_pythonic
branch. Note that the models in both the branches are not entirely identical and either one may perform better than the other in several cases. The model used at the time of the paper's publication is with the MATLAB dependency and this is the one that has been extensively tested. Please feel free to experiment with the fully_pythonic
branch if you do not want to have the MATLAB dependency.
A Google Colab notebook is also available for the fully_pythonic
branch. [Credits: Kirill]
- Can handle in-the-wild face poses and expressions.
- Can handle speech in any language and is robust to background noise.
- Paste faces back into the original video with minimal/no artefacts --- can potentially correct lip sync errors in dubbed movies!
- Complete multi-gpu training code, pre-trained models available.
- Fast inference code to generate results from the pre-trained models
- Python >= 3.5
- ffmpeg:
sudo apt-get install ffmpeg
- Matlab R2016a (for audio preprocessing, this dependency will be removed in later versions)
- Install necessary packages using
pip install -r requirements.txt
- Install keras-contrib
pip install git+https://www.github.com/keras-team/keras-contrib.git
Download checkpoints of the folowing models into the logs/
folder
- CNN Face detection using dlib: Link
- LipGAN Google Drive
LipGAN takes speech features in the form of MFCCs and we need to preprocess our input audio file to get the MFCC features. We use the create_mat.m
script to create .mat
files for a given audio.
cd matlab
matlab -nodesktop
>> create_mat(input_wav_or_mp4_file, path_to_output.mat) # replace with file paths
>> exit
cd ..
Here, we are given an audio input (as .mat
MFCC features) and a video of an identity speaking something entirely different. LipGAN can synthesize the correct lip motion for the given audio and overlay it on the given video of the speaking identity (Example #1, #2 in the above image).
python batch_inference.py --checkpoint_path <saved_checkpoint> --face <random_input_video> --fps <fps_of_input_video> --audio <guiding_audio_wav_file> --mat <mat_file_from_above> --results_dir <folder_to_save_generated_video>
The generated result_voice.mp4
will contain the input video lip synced with the given input audio. Note that the FPS parameter is by default 25
, make sure you set the FPS correctly for your own input video.
Refer to example #3 in the above picture. Given an audio, LipGAN generates a correct mouth shape (viseme) at each time-step and overlays it on the input image. The sequence of generated mouth shapes yields a talking face video.
python batch_inference.py --checkpoint_path <saved_checkpoint> --face <random_input_face> --audio <guiding_audio_wav_file> --mat <mat_file_from_above> --results_dir <folder_to_save_generated_video>
Please use the --pads argument to correct for inaccurate face detections such as not covering the chin region correctly. This can improve the results further.
python batch_inference.py --help
We illustrate the training pipeline using the LRS2 dataset. Adapting for other datasets would involve small modifications to the code.
We need to do two things: (i) Save the MFCC features from the audio and (ii) extract and save the facial crops of each frame in the video.
data_root (mvlrs_v1)
├── main, pretrain (we use only main folder in this work)
| ├── list of folders
| │ ├── five-digit numbered video IDs ending with (.mp4)
We use MATLAB to save the MFCC files for all the videos present in the dataset. Refer to the fully_pythonic branch if you do not want to use MATLAB.
# Please copy the appropriate LRS2 train split's filelist.txt to the filelists/ folder. The example below is shown for LRS2.
cd matlab
matlab -nodesktop
>> preprocess_mat('../filelists/train.txt', 'mvlrs_v1/main/') # replace with appropriate file paths for other datasets.
>> exit
cd ..
We preprocess the video files by detecting faces using a face detector from dlib.
# Please copy the appropriate LRS2 split's filelist.txt to the filelists/ folder. Example below is shown for LRS2.
python preprocess.py --split [train|pretrain|val] --videos_data_root mvlrs_v1/ --final_data_root <folder_to_store_preprocessed_files>
### More options while preprocessing (like number of workers, image size etc.)
python preprocess.py --help
data_root (mvlrs_v1)
├── main, pretrain (we use only main folder in this work)
| ├── list of folders
| │ ├── folders with five-digit video IDs
| │ | ├── 0.jpg, 1.jpg .... (extracted face crops of each frame)
| │ | ├── 0.npz, 1.npz .... (mfcc features corresponding to each frame)
As training LipGAN is computationally intensive, you can just train the generator alone for quick, decent results.
python train_unet.py --data_root <path_to_preprocessed_dataset>
### Extensive set of training options available. Please run and refer to:
python train_unet.py --help
python train.py --data_root <path_to_preprocessed_dataset>
### Extensive set of training options available. Please run and refer to:
python train.py --help
The software is licensed under the MIT License. Please cite the following paper if you have use this code:
@inproceedings{KR:2019:TAF:3343031.3351066,
author = {K R, Prajwal and Mukhopadhyay, Rudrabha and Philip, Jerin and Jha, Abhishek and Namboodiri, Vinay and Jawahar, C V},
title = {Towards Automatic Face-to-Face Translation},
booktitle = {Proceedings of the 27th ACM International Conference on Multimedia},
series = {MM '19},
year = {2019},
isbn = {978-1-4503-6889-6},
location = {Nice, France},
= {1428--1436},
numpages = {9},
url = {http://doi.acm.org/10.1145/3343031.3351066},
doi = {10.1145/3343031.3351066},
acmid = {3351066},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {cross-language talking face generation, lip synthesis, neural machine translation, speech to speech translation, translation systems, voice transfer},
}
Part of the MATLAB code is taken from the an implementation of the Talking Face Generation implementation. We thank the authors for releasing their code.