GitHub - GVCLab/EmoCAST: EmoCAST: Emotional Talking Portrait via Emotive Text Description

👿 EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang¹ · Xiaodong Cun📪² · Yong Zhang³ · Yudian Zheng¹ · Fan Tang⁴ · Chi-Man Pun📪¹

¹University of Macau ² GVC Lab, Great Bay University ³Meituan ⁴ICT-CAS

We introduce EmoCAST, a novel diffusion-based emotional talking head system for in-the-wild images from a single image, audio, and the emotional text description.

📸 Video Demo

🔧️ Framework

⚙️ Usage

🛠️ Installation

Create conda environment

conda create -n emocast python=3.10
conda activate emocast

Install packages with pip

pip install -r requirements.txt
pip install .

🎮 Inference

Download Pretrained Models

Download these models below into the ./pretrained_model/ folder.

Model	Download Link
audio_separator	https://huggingface.co/huangjackson/Kim_Vocal_2
insightface	https://github.com/deepinsight/insightface/tree/master/python-package#model-zoo
face landmarker	https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task
motion module	https://github.com/guoyww/AnimateDiff/blob/main/README.md#202309-animatediff-v2
sd-vae-ft-mse	https://huggingface.co/stabilityai/sd-vae-ft-mse
StableDiffusion V1.5	https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
wav2vec	https://huggingface.co/facebook/wav2vec2-base-960h
EmoCAST	https://huggingface.co/Jmaomao/EmoCAST

These pretrained models should be organized as follows:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   |-- Kim_Vocal_2.onnx
|-- face_analysis/
|   |-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       |-- scrfd_10g_bnkps.onnx
|-- motion_module/
|   |-- mm_sd_v15_v2.ckpt
|-- sd-vae-ft-mse/
|   |-- config.json
|   |-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
|   |-- unet/
|       |-- config.json
|       |-- diffusion_pytorch_model.safetensors
|-- wav2vec/
|   |-- wav2vec2-base-960h/
|       |-- config.json
|       |-- feature_extractor_config.json
|       |-- model.safetensors
|       |-- preprocessor_config.json
|       |-- special_tokens_map.json
|       |-- tokenizer_config.json
|       |-- vocab.json
|-- emocast/
|   |-- net.pth

Prepare Inference Data

Prepare the driving audio, reference image, and emotive text prompt as input.
- For the driving audio, it should be in .wav format.
- For the reference image, it should be cropped into a square with the face as the primary focus, facing forward.
- For the emotive text prompt, it describe a specific talking scene, such as:
  
  The portrait is experiencing chronic illness or pain.
  
  A person is talking with happy emotion.
  
  The portrait is watching a horror movie with jump scares.
Run Inference

We test the inference on a 24G RTX 4090.

To run the inference script, change the --driving_audio and --source_image to the correct path and provide --prompt_emo.

The generated videos will be saved in --outputs.
```
bash inference.sh
```
For more options, you can refer scripts/inference.py

🛎 Citation

If you find our work helpful for your research, please cite:

@misc{jiang2025emocastemotionaltalkingportrait,
      title={EmoCAST: Emotional Talking Portrait via Emotive Text Description}, 
      author={Yiguo Jiang and Xiaodong Cun and Yong Zhang and Yudian Zheng and Fan Tang and Chi-Man Pun},
      year={2025},
      eprint={2508.20615},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.20615}, 
}

💗 Acknowledgements

Thanks to the hallo, for their open research and exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
configs/inference		configs/inference
emocast		emocast
scripts		scripts
README.md		README.md
inference.sh		inference.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👿 EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang¹ · Xiaodong Cun📪² · Yong Zhang³ · Yudian Zheng¹ · Fan Tang⁴ · Chi-Man Pun📪¹

📸 Video Demo

🔧️ Framework

⚙️ Usage

🛠️ Installation

🎮 Inference

🛎 Citation

💗 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

GVCLab/EmoCAST

Folders and files

Latest commit

History

Repository files navigation

👿 EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang1 · Xiaodong Cun📪2 · Yong Zhang3 · Yudian Zheng1 · Fan Tang4 · Chi-Man Pun📪1

📸 Video Demo

🔧️ Framework

⚙️ Usage

🛠️ Installation

🎮 Inference

🛎 Citation

💗 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Yiguo Jiang¹ · Xiaodong Cun📪² · Yong Zhang³ · Yudian Zheng¹ · Fan Tang⁴ · Chi-Man Pun📪¹

Packages