This repository is the official implementation of "Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech".
multi-task learning | meta learning |
---|---|
This is how I build my environment, which is not exactly needed to be the same:
- Sign up for Comet.ml, find out your workspace and API key via www.comet.ml/api/my/settings and fill them in
.comet.config
. Comet logger is used throughout train/val/test stages.- Check my training logs here.
- [Optional] Install pyenv for Python version control, change to Python 3.8.6.
# After download and install pyenv:
pyenv install 3.8.6
pyenv local 3.8.6
- [Optional] Install pyenv-virtualenv as a plugin of pyenv for clean virtual environment.
# After install pyenv-virtualenv
pyenv virtualenv meta-tts
pyenv activate meta-tts
- Install learn2learn from source.
# Install Cython first:
pip install cython
# Then install learn2learn from source:
git clone https://github.com/learnables/learn2learn.git
cd learn2learn
pip install -e .
- Install requirements:
pip install -r requirements.txt
First, download LibriTTS and VCTK, then change the paths in config/LibriTTS/preprocess.yaml
and config/VCTK/preprocess.yaml
, then run
python3 prepare_align.py config/LibriTTS/preprocess.yaml
python3 prepare_align.py config/VCTK/preprocess.yaml
for some preparations.
Alignments of LibriTTS is provided here, and
the alignments of VCTK is provided here.
You have to unzip the files into preprocessed_data/LibriTTS/TextGrid/
and
preprocessed_data/VCTK/TextGrid/
.
Then run the preprocessing script:
python3 preprocess.py config/LibriTTS/preprocess.yaml
# Copy stats from LibriTTS to VCTK to keep pitch/energy normalization the same shift and bias.
cp preprocessed_data/LibriTTS/stats.json preprocessed_data/VCTK/
python3 preprocess.py config/VCTK/preprocess.yaml
To train the models in the paper, run this command:
python3 train.py -a <algorithm>
Available algorithms:
- base_emb_vad / base_emb_va / base_emb_d / base_emb
- Baseline with embedding table.
- meta_emb_vad / meta_emb_va / meta_emb_d / meta_emb
- Meta-TTS with embedding table.
- base_emb1_vad / base_emb1_va / base_emb1_d / base_emb1
- Baseline with shared embedding.
- meta_emb1_vad / meta_emb1_va / meta_emb1_d / meta_emb1
- Meta-TTS with shared embedding.
Note:
- *_vad: fine-tune embedding + variance adaptor + decoder
- *_va: fine-tune embedding + variance adaptor
- *_d: fine-tune embedding + decoder
- without *_vad/*_va/*_d: fine-tune embedding only
Please use 8 V100 GPUs for meta models, and 1 V100 GPU for baseline models, or
else you might need to tune gradient accumulation step (grad_acc_step) setting in
config/*/train.yaml
to get the correct meta batch size.
Note that each GPU has its own random seed, so even the meta batch size is the
same, different number of GPUs is equivalent to different random seed.
After training, you can find your checkpoints under
output/ckpt/LibriTTS/<project_name>/<experiment_key>/checkpoints/
, where the
project name is set in .comet.config
.
To inference the models, run:
# LibriTTS
python3 test.py -a <algorithm> -e <experiment_key> -c <checkpoint_file_name>
# VCTK
python3 test.py -p config/VCTK/preprocess.yaml -t config/VCTK/train.yaml -m config/VCTK/model.yaml \
-a <algorithm> -e <experiment_key> -c <checkpoint_file_name>
and the results would be under
output/result/<corpus>/<experiment_key>/<algorithm>/
.
cd evaluation/
and check README.md
Since our codes are using Comet logger, you might need to create a dummy experiment by running:
from comet_ml import Experiment
experiment = Experiment()
then put the checkpoint files under
output/ckpt/LibriTTS/<project_name>/<experiment_key>/checkpoints/
.
You can download pretrained models here.
Speaker verification: