This repository is the official PyTorch implementation of our paper: Emotionally Enhanced Talking Face Generation. We introduce a multimodal framework to generate lipsynced videos agnostic to any arbitrary identity, language, and emotion. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.
📑 Original Paper | 📰 Project Page | 🌀 Demo | ⚡ Live Testing |
---|---|---|---|
Paper | Project Page | Demo Video | Interactive Demo |
All results from this open-source code or our demo website should only be used for research/academic/personal purposes only.
- ffmpeg:
sudo apt-get install ffmpeg
- Install necessary packages using
pip install -r requirements.txt
. - Face detection pre-trained model should be downloaded to
face_detection/detection/sfd/s3fd.pth
. Alternative link if the above does not work.
Download the data from this repo.
python convertFPS.py -i <raw_video_folder> -o <folder_to_save_25fps_videos>
python preprocess_crema-d.py --data_root <folder_of_25fps_videos> --preprocessed_root preprocessed_dataset/
There are three major steps: (i) Train the expert lip-sync discriminator, (ii) Train the emotion discriminator (iii) Train the EmoGen model.
python color_syncnet_train.py --data_root preprocessed_dataset/ --checkpoint_dir <folder_to_save_checkpoints>
python emotion_disc_train.py -i preprocessed_dataset/ -o <folder_to_save_checkpoints>
python train.py --data_root preprocessed_dataset/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint> --emotion_disc_path <path_to_emotion_disc_checkpoint>
You can also set additional less commonly-used hyper-parameters at the bottom of the hparams.py
file.
Model description | Link to the model |
---|---|
Emogen (PL+DA) | Link |
Emogen (PRE) | link |
Comment these code lines for inference: line1 and line2.
python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> --emotion <categorical emotion>
The result is saved (by default) in results/{emotion}.mp4
. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG
containing audio data: *.wav
, *.mp3
or even a video file, from which the code will automatically extract the audio. Choose categorical emotion from this list: [HAP, SAD, FEA, ANG, DIS, NEU].
- Experiment with the
--pads
argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g.--pads 0 20 0 0
. - If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the
--nosmooth
argument and give another try. - Experiment with the
--resize_factor
argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).
Please check the evaluation/
folder for the instructions.
Theis repository can only be used for personal/research/non-commercial purposes. Please cite the following paper if you use this repository:
@misc{goyal2023emotionally,
title={Emotionally Enhanced Talking Face Generation},
author={Sahil Goyal and Shagun Uppal and Sarthak Bhagat and Yi Yu and Yifang Yin and Rajiv Ratn Shah},
year={2023},
eprint={2303.11548},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
The code structure is inspired by Wav2Lip. We thank the authors for the wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models. Demo website is developed by @ddhroov10 and @SakshatMali.