Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

This repository is the official implementation of "Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech".

📋 Optional: include a graphic explaining your approach/main result, bibtex entry, link to demos, blog posts and tutorials

Requirements

This is how I build my environment, which is not exactly needed to be the same:

Sign up for Comet.ml, find out your workspace and API key via www.comet.ml/api/my/settings and fill them in .comet.config. Comet logger is used throughout train/val/test stages.
[Optional] Install pyenv for Python version control, change to Python 3.8.6.

# After download and install pyenv:
pyenv install 3.8.6
pyenv local 3.8.6

[Optional] Install pyenv-virtualenv as a plugin of pyenv for clean virtual environment.

# After install pyenv-virtualenv
pyenv virtualenv meta-tts
pyenv activate meta-tts

Install learn2learn from source.

# Install Cython first:
pip install cython

# Then install learn2learn from source:
git clone https://github.com/learnables/learn2learn.git
cd learn2learn
pip install -e .

Install requirements:

pip install -r requirements.txt

Proprocessing

First, download LibriTTS and VCTK, then change the paths in config/LibriTTS/preprocess.yaml and config/VCTK/preprocess.yaml, then run

python3 prepare_align.py config/LibriTTS/preprocess.yaml
python3 prepare_align.py config/VCTK/preprocess.yaml

for some preparations.

Alignments of LibriTTS is provided here, and the alignments of VCTK is provided here. You have to unzip the files into preprocessed_data/LibriTTS/TextGrid/ and preprocessed_data/VCTK/TextGrid/.

Then run the preprocessing script:

python3 preprocess.py config/LibriTTS/preprocess.yaml

# Copy stats from LibriTTS to VCTK to keep pitch/energy normalization the same shift and bias.
cp preprocessed_data/LibriTTS/stats.json preprocessed_data/VCTK/

python3 preprocess.py config/VCTK/preprocess.yaml

Training

To train the model(s) in the paper, run this command:

python3 train.py -a <algorithm>

Available algorithms:

base_emb_vad, base_emb_va, base_emb_d, base_emb
- Baseline with embedding table.
meta_emb_vad, meta_emb_va, meta_emb_d, meta_emb
- Meta-TTS with embedding table.
base_emb1_vad, base_emb1_va, base_emb1_d, base_emb1
- Baseline with shared embedding.
meta_emb1_vad, meta_emb1_va, meta_emb1_d, meta_emb1
- Meta-TTS with shared embedding.

(*_vad: fine-tune embedding + variance adaptor + decoder)

(*_va: fine-tune embedding + variance adaptor)

(*_d: fine-tune embedding + decoder)

(without *_vad/*_va/*_d: fine-tune embedding only)

Evaluation

To evaluate my model on ImageNet, run:

python eval.py --model-file mymodel.pth --benchmark imagenet

📋 Describe how to evaluate the trained models on benchmarks reported in the paper, give commands that produce the results (section below).

Pre-trained Models

You can download pretrained models here:

My awesome model trained on ImageNet using parameters x,y,z.

📋 Give a link to where/how the pretrained models can be downloaded and how they were trained (if applicable). Alternatively you can have an additional column in your results table with a link to the models.

Results

Our model achieves the following performance on :

Image Classification on ImageNet

Model name	Top 1 Accuracy	Top 5 Accuracy
My awesome model	85%	95%

📋 Include a table of results from your paper, and link back to the leaderboard for clarity and context. If your main result is a figure, include that figure and link to the command or notebook to reproduce it.

Contributing

📋 Pick a licence and describe how to contribute to your code repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Requirements

Proprocessing

Training

Evaluation

Pre-trained Models

Results

Image Classification on ImageNet

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Meta-TTS: Meta-Learning for Few-shot SpeakerAdaptive Text-to-Speech

Requirements

Proprocessing

Training

Evaluation

Pre-trained Models

Results

Image Classification on ImageNet

Contributing