Official PyTorch implementation for the paper:
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior, CVPR 2023.
Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong
We propose CodeTalker by casting speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. Given the raw audio and a 3D neutral face template, our CodeTalker can produce vivid and realistic 3D facial motions with subtle expressions and accurate lip movements.
- 2023.06.16 Provide a Colab online demo.
- 2023.04.03 Release code and model weights!
- Linux
- Python 3.6+
- Pytorch 1.9.1
- CUDA 11.1 (GPU with at least 11GB VRAM)
Other necessary packages:
pip install -r requirements.txt
- ffmpeg
- MPI-IS/mesh
IMPORTANT: Please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.
Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy, raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl in the folder vocaset/. Download "FLAME_sample.ply" from voca and put it in vocaset/. Read the vertices/audio data and convert them to .npy/.wav files stored in vocaset/vertices_npy and vocaset/wav:
cd vocaset
python process_voca_data.py
Follow the BIWI/README.md to preprocess BIWI dataset and put .npy/.wav files into BIWI/vertices_npy and BIWI/wav, and the templates.pkl into BIWI/.
Download the pretrained models from biwi_stage1.pth.tar & biwi_stage2.pth.tar and vocaset_stage1.pth.tar & vocaset_stage2.pth.tar. Put the pretrained models under BIWI and VOCASET folders, respectively. Given the audio signal,
- to animate a mesh in FLAME topology, run:
sh scripts/demo.sh vocaset - to animate a mesh in BIWI topology, run:
This script will automatically generate the rendered videos in the
sh scripts/demo.sh BIWIdemo/outputfolder. You can also put your own test audio file (.wav format) under thedemo/wavfolder and specify the arguments inDEMOsection ofconfig/<dataset>/demo.yamlaccordingly (e.g.,demo_wav_path,condition,subject, etc.).
The training/testing operation shares a similar command:
sh scripts/<train.sh|test.sh> <exp_name> config/<vocaset|BIWI>/<stage1|stage2>.yaml <vocaset|BIWI> <s1|s2>
Please replace <exp_name> with your own experiment name, <vocaset|BIWI> by the name of your target dataset, i.e., vocaset or BIWI. Change the exp_dir in both scripts/train.sh and scripts/test.sh if needed. We just take an example for default commands below.
sh scripts/train.sh CodeTalker_s1 config/vocaset/stage1.yaml vocaset s1
Make sure the paths of pre-trained models are correct, i.e., vqvae_pretrained_path and wav2vec2model_path in config/<vocaset|BIWI>/stage2.yaml.
sh scripts/train.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2
sh scripts/test.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2
Modify the paths in scripts/render.sh and run:
sh scripts/render.sh
We provide the reference code for Lip Vertex Error & Upper-face Dynamics Deviation. Remember to change the paths in scripts/cal_metric.sh, and run:
sh scripts/cal_metric.sh
-
Create the dataset directory
<dataset_dir>inCodeTalkerdirectory. -
Place your vertices data (.npy files) and audio data (.wav files) in
<dataset_dir>/vertices_npyand<dataset_dir>/wavfolders, respectively. -
Save the templates of all subjects to a
templates.pklfile and put it in<dataset_dir>, as done for BIWI and vocaset dataset. Export an arbitary template to .ply format and put it in<dataset_dir>/.
-
Create the corresponding config files in
config/<dataset_dir>and modify the arguments in the config files. -
Check all the code segments releated to dataset information.
-
Following the training/testing/visualization pipeline as done for BIWI and vocaset dataset.
If you find the code useful for your work, please star this repo and consider citing:
@inproceedings{xing2023codetalker,
title={Codetalker: Speech-driven 3d facial animation with discrete motion prior},
author={Xing, Jinbo and Xia, Menghan and Zhang, Yuechen and Cun, Xiaodong and Wang, Jue and Wong, Tien-Tsin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={12780--12790},
year={2023}
}
- Although our codebase allows for training with multi-GPUs, we did not test it and just hardcode the training batch size as one. You may need to change the
data_loaderif needed.
We heavily borrow the code from FaceFormer, Learn2Listen, and VOCA. Thanks for sharing their code and huggingface-transformers for their wav2vec2 implementation. We also gratefully acknowledge the ETHZ-CVL for providing the B3D(AC)2 dataset and MPI-IS for releasing the VOCASET dataset. Any third-party packages are owned by their respective authors and must be used under their respective licenses.
- StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN (ECCV 2022)
- SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation (CVPR 2023)
- MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation (CVPR 2023)
- DPE: Disentanglement of Pose and Expression for General Video Portrait Editing (CVPR 2023)
- MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation (arXiv 2023)
