Skip to content

Render-AI/DeepAudio-V1

 
 

Repository files navigation

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Haomin Zhang, Chang Liu, Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di

AI Lab Giant Network, and University of Trento

Results

Image
v2c_1.mp4
v2c_2.mp4

For more results, please visit https://acappemin.github.io/DeepAudio-V1.github.io.

Installation

1. Create a conda environment

conda create -n v2as python=3.10
conda activate v2as

2. F5-TTS base install

cd ./F5-TTS
pip install -e .

3. Additional requirements

pip install -r requirements.txt
conda install cudnn

Pretrained models

The models are available at https://huggingface.co/lshzhm/DeepAudio-V1/tree/main. See MODELS.md for more details.

Inference

1. V2A inference

bash v2a.sh

2. V2S inference

bash v2s.sh

3. TTS inference

bash tts.sh

Evaluation

bash eval_v2c.sh

Acknowledgement

  • MMAudio for video-to-audio backbone and pretrained models
  • F5-TTS for text-to-speech and video-to-speech backbone
  • V2C for animated movie benchmark
  • Wav2Vec2-Emotion for emotion recognition in EMO-SIM evaluation.
  • WavLM-SV for speech recognition in SPK-SIM evaluation.
  • Whisper for speech recognition in WER evaluation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.0%
  • HTML 4.4%
  • Cuda 1.2%
  • C 0.8%
  • Shell 0.2%
  • CSS 0.2%
  • Other 0.2%