Yiguo Jiang1 · Xiaodong Cun📪2 · Yong Zhang3 · Yudian Zheng1 · Fan Tang4 · Chi-Man Pun📪1
1University of Macau 2 GVC Lab, Great Bay University 3Meituan 4ICT-CAS
We introduce EmoCAST, a novel diffusion-based emotional talking head system for in-the-wild images from a single image, audio, and the emotional text description.
-
Create conda environment
conda create -n emocast python=3.10 conda activate emocast
-
Install packages with pip
pip install -r requirements.txt pip install .
-
Download Pretrained Models
Download these models below into the
./pretrained_model/folder.Model Download Link audio_separator https://huggingface.co/huangjackson/Kim_Vocal_2 insightface https://github.com/deepinsight/insightface/tree/master/python-package#model-zoo face landmarker https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task motion module https://github.com/guoyww/AnimateDiff/blob/main/README.md#202309-animatediff-v2 sd-vae-ft-mse https://huggingface.co/stabilityai/sd-vae-ft-mse StableDiffusion V1.5 https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 wav2vec https://huggingface.co/facebook/wav2vec2-base-960h EmoCAST https://huggingface.co/Jmaomao/EmoCAST These pretrained models should be organized as follows:
./pretrained_models/ |-- audio_separator/ | |-- download_checks.json | |-- mdx_model_data.json | |-- vr_model_data.json | |-- Kim_Vocal_2.onnx |-- face_analysis/ | |-- models/ | |-- face_landmarker_v2_with_blendshapes.task | |-- 1k3d68.onnx | |-- 2d106det.onnx | |-- genderage.onnx | |-- glintr100.onnx | |-- scrfd_10g_bnkps.onnx |-- motion_module/ | |-- mm_sd_v15_v2.ckpt |-- sd-vae-ft-mse/ | |-- config.json | |-- diffusion_pytorch_model.safetensors |-- stable-diffusion-v1-5/ | |-- unet/ | |-- config.json | |-- diffusion_pytorch_model.safetensors |-- wav2vec/ | |-- wav2vec2-base-960h/ | |-- config.json | |-- feature_extractor_config.json | |-- model.safetensors | |-- preprocessor_config.json | |-- special_tokens_map.json | |-- tokenizer_config.json | |-- vocab.json |-- emocast/ | |-- net.pth -
Prepare Inference Data
Prepare the driving audio, reference image, and emotive text prompt as input.
-
For the driving audio, it should be in
.wavformat. -
For the reference image, it should be cropped into a square with the face as the primary focus, facing forward.
-
For the emotive text prompt, it describe a specific talking scene, such as:
The portrait is experiencing chronic illness or pain.A person is talking with happy emotion.The portrait is watching a horror movie with jump scares.
-
-
Run Inference
We test the inference on a 24G RTX 4090.
To run the inference script, change the
--driving_audioand--source_imageto the correct path and provide--prompt_emo.The generated videos will be saved in
--outputs.bash inference.sh
For more options, you can refer
scripts/inference.py
If you find our work helpful for your research, please cite:
@misc{jiang2025emocastemotionaltalkingportrait,
title={EmoCAST: Emotional Talking Portrait via Emotive Text Description},
author={Yiguo Jiang and Xiaodong Cun and Yong Zhang and Yudian Zheng and Fan Tang and Chi-Man Pun},
year={2025},
eprint={2508.20615},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.20615},
}
Thanks to the hallo, for their open research and exploration.


