Haomin Zhang, Chang Liu, Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di
AI Lab Giant Network, and University of Trento
v2c_1.mp4
v2c_2.mp4
For more results, please visit https://acappemin.github.io/DeepAudio-V1.github.io.
1. Create a conda environment
conda create -n v2as python=3.10
conda activate v2as
2. F5-TTS base install
cd ./F5-TTS
pip install -e .
3. Additional requirements
pip install -r requirements.txt
conda install cudnn
Pretrained models
The models are available at https://huggingface.co/lshzhm/DeepAudio-V1/tree/main. See MODELS.md for more details.
1. V2A inference
bash v2a.sh
2. V2S inference
bash v2s.sh
3. TTS inference
bash tts.sh
bash eval_v2c.sh
- MMAudio for video-to-audio backbone and pretrained models
- F5-TTS for text-to-speech and video-to-speech backbone
- V2C for animated movie benchmark
- Wav2Vec2-Emotion for emotion recognition in EMO-SIM evaluation.
- WavLM-SV for speech recognition in SPK-SIM evaluation.
- Whisper for speech recognition in WER evaluation.