This paper has been accepted by ICML 2024. We plan to release the checkpoints (we name it as Open-UniAudio) in the next few days. Our plans includes:
- 2024.7.22: We first release a pre-trained checkpoint of UniAudio. This version is build on original audio codec https://github.com/yangdongchao/UniAudio/tree/main/codec . Compared with the training dataset in original paper, we further inclease the music dataset. In our paper, we only use the MSD dataset for music, and we find the performance is poor than other advanced system, thus we add DISCO-100k dataset into the training. The details of this version as follows: (a) Training dataset: speech+sound+music (b) Model size: n_layer: 16 n_head: 12 n_embd: 768 (c) Vocabulary size: audio tokens: 3*1024; special tokens: 128; phone tokens: 306; Semantic tokens: 512; singing tokens: about 200
wget https://huggingface.co/Dongchao/UniAudio/resolve/main/uniaudio_v1.zip
- Provides all of task recipes
- We scale the audio token vocabulary into 60k+.
- Provides the checkpoints of pre-training UniAudio, compared to paper's reported training data, we plan add more sound and music data to train the open UniAudio.
If you are interested in Open-UniAudio, welcome to contact us!
Demo page: http://dongchaoyang.top/UniAudio_demo/
paper: https://arxiv.org/pdf/2310.00704.pdf
UniAudio is a universal audio generation model, which can solve a lot of audio generation task with one model, such as TTS, VC, Singing voice synthesis, speech enhancement, speech extraction, text-to-sound, text-to-music, speech edit, audio edit, instructed TTS, and speech dereverberation. In the following, we will give the details of UniAudio. UniAudio supports user to define any task, we expect more researcher can join us.
- Neural Audio Codec Models: The whole training code for neural audio codec models
- Top-level Design: How to design a framework that support users to define their tasks.
- Training own UniAudio for any task with your own dataset: The details of training process
The overview of UniAudio as following picture shows.
The more details will be updatad soon.
Please refer to codec folder to find the training codec of Neural Audio Codec.
The framework of UniAudio is very simple and useful. It includes 4 steps: (1) define your task. (2) prepare data. (3) tokenize data and save it as .pth file. (4) Training and inference
We give an example of UniAudio to train TTS in LibriTTS
Firstly, create environment for UniAudio.
conda create -n uniaudio python=3.8
conda init
source ~/.bashrc
conda activate uniaudio
Then:
cd UniAudio
bash requirements.sh
bash UniAudio/download.sh
You can define the task format in the utils/task_definition.py
tts_format = {
'keys': ["phone_seq", "prompt_seq", "audio_seq"],
'type': ["phone", "audio_prompt", "audio"],
'features': [],
'loss_key': 'audio_seq',
}
After that, the dataloader of UniAudio will combine the phone_seq and audio_seq as one single sequence based on your definition, so that LM can process it.
We give an example to prepare data with LibriTTS. Firstly, you should download the dataset.
wget https://www.openslr.org/resources/60/train-clean-100.tar.gz
wget https://www.openslr.org/resources/60/train-clean-360.tar.gz
wget https://www.openslr.org/resources/60/test-clean.tar.gz # for evaluation
wget https://www.openslr.org/resources/60/train-other-500.tar.gz
tar -zxvf train-clean-100.tar.gz
tar -zxvf train-clean-360.tar.gz
tar -zxvf test-clean.tar.gz
tar -zxvf train-other-500.tar.gz
Secondly, please prepare the wav.scp
file based on your data path, see UniAudio/egs/TTS/data/train/wav.scp
find the format of wav.scp.
Thirdly, you should prepare the phone.scp file, an example can be found in UniAudio/egs/TTS/data/train/phone.scp
You can use UniAudio/tools/tokenizer/phone/text_tokenizer.py
to get the phone.scp.
Third, prepare utt2spk file, see an example in UniAudio/egs/TTS/data/train/utt2spk
After that, put your wav.scp, phone.scp and utt2spk
into UniAudio/egs/TTS/data/train folder
, similarly, you also should prepare the val set in UniAudio/egs/TTS/data/val
folder
Lastly, tokenize the audio and phone into tokens. Note that, you should set the number of GPUs, you want to use.
cd UniAudio/egs/TTS
bash run.sh --stage 2 --stop_stage 4
The details of annotation can be found in run.sh
cd UniAudio/egs/TTS
bash run.sh --stage 5 --stop_stage 5
cd UniAudio/egs/TTS
bash run.sh --stage 6 --stop_stage 6
In previous section, we give the details of how to train a TTS task with UniAudio. In the future, we will try to give more tasks's details, including:
- TTS in LibriLight
- VC in Librilight
- SE
- TSE
- Text-to-sound and music
- Sing Voice Generation
Excepting the 11 tasks presented in the paper, UniAudio also can other tasks, e.g. ASR, speaker verification, audio classification, and so on. But the performance of these tasks still has gap with other SOTAs.
It is easy to train all of tasks using UniAudio, when differnt tasks of data has ready.
- The first version of UniAudio, give an example to train TTS task in LibriTTS dataset.
- Add more task into egs
- Give more comprehensive documentation for UniAudio
- Combine all of task
- Give scripts to simulate speech enhancement and target speaker extraction data.
- Give more detailed information for evaluation scripts.
If you find this code useful in your research, please cite our work and give us a star
@article{yang2023uniaudio,
title={UniAudio: An Audio Foundation Model Toward Universal Audio Generation},
author={Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Helen Meng},
journal={arXiv preprint arXiv:2310.00704},
year={2023}
}
If you have any problem about the UniAudio code, please contact Dongchao (dcyang@se.cuhk.edu.hk) or Jinchuan (jinchuat@andrew.cmu.edu).
All of code is finished by Dongchao and Jinchuan. You can use the code under MIT license.