Qili Wang 1
Dajiang Wu 1
Zihang Xu 2
Junshi Huang 1
Jun Lv 1
1 JD.Com, Inc. 2 The University of Hong Kong
eng01_female_xinwen01.mp4 |
eng03_female_xinwen03.mp4 |
eng04_male_xinwen04.mp4 |
ch02_female_eng03_female.mp4 |
ch03_male_eng04_male.mp4 |
ch04_female_eng05_female.mp4 |
- Tested GPUS: V100, A800
- Tested Python Version: 3.8.19
Create conda environment and install packages with pip:
conda create -n joygen python=3.8.19 ffmpeg
conda activate joygen
pip install -r requirements.txt
Install Nvdiffrast library:
git clone https://github.com/NVlabs/nvdiffrast
cd nvdiffrast
pip install .
These pretrained models should be organized as follows: Download link
./pretrained_models/
├── BFM
│ ├── 01_MorphableModel.mat
│ ├── BFM_exp_idx.mat
│ ├── BFM_front_idx.mat
│ ├── BFM_model_front.mat
│ ├── Exp_Pca.bin
│ ├── facemodel_info.mat
│ ├── index_mp468_from_mesh35709.npy
│ ├── select_vertex_id.mat
│ ├── similarity_Lm3D_all.mat
│ └── std_exp.txt
├── audio2motion
│ ├── 240210_real3dportrait_orig
│ │ └── audio2secc_vae
│ │ ├── config.yaml
│ │ └── model_ckpt_steps_400000.ckpt
│ └── hubert
│ ├── config.json
│ ├── preprocessor_config.json
│ ├── pytorch_model.bin
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── vocab.json
├── joygen
│ ├── config.json
│ └── diffusion_pytorch_model.safetensors
├── dwpose
│ ├── default_runtime.py
│ ├── dw-ll_ucoco_384.pth
│ └── rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py
├── face-parse-bisent
│ ├── 79999_iter.pth
│ └── resnet18-5c106cde.pth
├── face_recon_feat0.2_augment
│ ├── epoch_20.pth
│ ├── loss_log.txt
│ ├── test_opt.txt
│ └── train_opt.txt
├── sd-vae-ft-mse
│ ├── README.md
│ ├── config.json
│ ├── diffusion_pytorch_model.bin
│ └── diffusion_pytorch_model.safetensors
└── whisper
└── tiny.pt
Or you can download them separately:
- audio2motion
- hubert
- BFM
- joygen
- dwpose
- face_recon_feat0.2_augment
- face-parse-bisent
- sd-vae-ft-mse
- whisper
We provide the video URLs. Download link
Run the inference script:
bash scripts/inference_pipeline.sh args1 args2 args3
- args1: driving audio file
- args2: video file
- args3: result directory
Run the inference script step by step
- Obtain a sequence of facial expression coefficients from the audio.
python inference_audio2motion.py \
--a2m_ckpt ./pretrained_models/audio2motion/240210_real3dportrait_orig/audio2secc_vae \
--hubert_path ./pretrained_models/audio2motion/hubert \
--drv_aud ./demo/xinwen_5s.mp3 \
--seed 0 \
--result_dir ./results/a2m \
--exp_file xinwen_5s.npy
- Render the depth map frame by frame using the new expression coefficients.
python -u inference_edit_expression.py \
--name face_recon_feat0.2_augment \
--epoch=20 \
--use_opengl False \
--checkpoints_dir ./pretrained_models \
--bfm_folder ./pretrained_models/BFM \
--infer_video_path ./demo/example_5s.mp4 \
--infer_exp_coeff_path ./results/a2m/xinwen_5s.npy \
--infer_result_dir ./results/edit_expression
- Generate the facial animation based on the audio features and the facial depth map.
CUDA_VISIBLE_DEIVCES=0 python -u inference_joygen.py \
--unet_model_path pretrained_models/joygen \
--vae_model_path pretrained_models/sd-vae-ft-mse \
--intermediate_dir ./results/edit_expression \
--audio_path demo/xinwen_5s.mp3 \
--video_path demo/example_5s.mp4 \
--enable_pose_driven \
--result_dir results/talk \
--img_size 256 \
--gpu_id 0 \
python -u preprocess_dataset.py \
--checkpoints_dir ./pretrained_models \
--name face_recon_feat0.2_augment \
--epoch=20 \
--use_opengl False \
--bfm_folder ./pretrained_models/BFM \
--video_dir ./demo \ # The directory for storing video files.
--result_dir ./results/preprocessed_dataset \
Check the preprocessed data and generate a list file for training.
python -u preprocess_dataset_extra.py data_dir
Modify the config.yaml file according to the specific requirements, such as the dataset section.
accelerate launch --main_process_port 29501 --config_file config/accelerate_config.yaml train_joygen.py
We would like to thank the contributors to the Deep3DFaceRecon_pytorch, Real3DPortrait, MuseTalk for their open research and exploration.
@misc{wang2025joygenaudiodriven3ddepthaware,
title={JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing},
author={Qili Wang and Dajiang Wu and Zihang Xu and Junshi Huang and Jun Lv},
year={2025},
eprint={2501.01798},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.01798},
}