Moved root of repo

YiDa-9527 · Jun 11, 2019 · 9ddc17d · 9ddc17d
1 parent 24f1013
commit 9ddc17d
Show file tree

Hide file tree

Showing 113 changed files with 56 additions and 2,064 deletions.
diff --git a/.gitignore b/.gitignore
@@ -15,9 +15,7 @@
 *.toc
 *.wav
 *.sh
-_old
-sv2tts/encoder/saved_models/*
-sv2tts/synthesizer/saved_models/*
-sv2tts/vocoder/saved_models/*
-sv2tts/utils/_*.py
-sv2tts/vocoder_old
+encoder/saved_models/*
+synthesizer/saved_models/*
+vocoder/saved_models/*
+utils/_*.py
diff --git a/sv2tts/LICENSE.txt → LICENSE.txt b/sv2tts/LICENSE.txt → LICENSE.txt
diff --git a/README.md b/README.md
@@ -1,20 +1,29 @@
-# Real-Time Voice Cloning
+### Datasets and preprocessing
+Ideally, you want to keep all your datasets under a same directory. All prepreprocessing scripts will, by default, output the clean data to a new directory  `SV2TTS` created in your datasets root directory. Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.
 
-### Papers implemented  
-| URL | Designation | Title | Implementation source |
-| --- | ----------- | ----- | --------------------- |
-|[1811.00002](https://arxiv.org/pdf/1802.08435v1.pdf) | WaveRNN | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
-|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
-|[1712.05884](https://arxiv.org/pdf/1712.05884.pdf) | Tacotron 2 | Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions | [Rayhane-mamah/Tacotron-2](https://github.com/Rayhane-mamah/Tacotron-2)
-|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E | Generalized End-To-End Loss for Speaker Verification | This repo |
+You will need the following datasets:
 
-### Related papers 
-| URL | Designation | Title |
-| --- | ----------- | ----- |
-|[1808.10128](https://arxiv.org/pdf/1808.10128.pdf) | SST4TTS | Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis |
-|[1710.07654](https://arxiv.org/pdf/1710.07654.pdf) | Deep Voice 3 | Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning |
-|[1705.08947](https://arxiv.org/pdf/1705.08947.pdf) | Deep Voice 2 | Deep Voice 2: Multi-Speaker Neural Text-to-Speech |
-|[1703.10135](https://arxiv.org/pdf/1703.10135.pdf) | Tacotron | Tacotron: Towards End-To-End Speech Synthesis |
-|[1702.07825](https://arxiv.org/pdf/1702.07825.pdf) | Deep Voice 1 | Deep Voice: Real-time Neural Text-to-Speech |
-|[1609.03499](https://arxiv.org/pdf/1609.03499.pdf) | Wavenet | Wavenet: A Generative Model for Raw Audio |
-|[1506.07503](https://arxiv.org/pdf/1506.07503.pdf) | Attention | Attention-Based Models for Speech Recognition |
+For the encoder:
+- **[LibriSpeech](http://www.openslr.org/12/):** train-other-500 (extract as `LibriSpeech/train-other-500`)
+- **[VoxCeleb1](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html):** Dev A - D as well as the metadata file (extract as `VoxCeleb1/wav` and `VoxCeleb1/vox1_meta.csv`)
+- **[VoxCeleb2](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html):** Dev A - H (extract as `VoxCeleb1/dev`)
+
+For the synthesizer and the vocoder: 
+- **[LibriSpeech](http://www.openslr.org/12/):** train-clean-100, train-clean-360 (extract as `LibriSpeech/train-clean-100` and `LibriSpeech/train-clean-360`)
+
+Feel free to adapt the code to your needs. Other interesting datasets that you could use:
+- **[VCTK](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html)**, used in the SV2TTS paper.
+- **[M-AILABS](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)**
+
+
+### Known issues
+- There is no noise removal algorithm implemented to clean the data of the synthesizer and the vocoder. I've found the noise removal algorithm from Audacity which uses Fourier analysis to be quite good, but it's too much work to reimplement.
+- The hyperparameters for both the encoder and the vocoder do not appear as arguments. I have to think about how I want to do this. 
+- I've tried filtering the non-English speakers out of VoxCeleb1 using the metadata file. However, there is no such file for VoxCeleb2. Right now, the non-English speakers of VoxCeleb2 are unfiltered (hopefully, they're still a minority in the dataset). It's hard to tell if this really has a negative impact on the model.
+
+
+### TODO
+- I'd like to eventually merge the `audio` module of each package to `utils/audio`.
+- I'm looking for a Tacotron framework that is pytorch-based, but that is as good as the tensorflow implementation of Rayhane Mamah.
+- Let the user decide if they want to use speaker embeddings or utterance embeddings for training the synthesizer.
+- The toolbox can always be improved.
diff --git a/sv2tts/demo_toolbox.py → demo_toolbox.py b/sv2tts/demo_toolbox.py → demo_toolbox.py
diff --git a/documents/images/batched_sampling.png b/documents/images/batched_sampling.png
diff --git a/documents/images/bdlstm_objective.png b/documents/images/bdlstm_objective.png
diff --git a/documents/images/bdlstm_subjective.png b/documents/images/bdlstm_subjective.png
diff --git a/documents/images/deep_voice_1_arch.png b/documents/images/deep_voice_1_arch.png
diff --git a/documents/images/dnn_spss.png b/documents/images/dnn_spss.png
diff --git a/documents/images/dt_clustering.png b/documents/images/dt_clustering.png
diff --git a/documents/images/encoder_inference.png b/documents/images/encoder_inference.png
diff --git a/documents/images/encoder_preprocess_vad.png b/documents/images/encoder_preprocess_vad.png
diff --git a/documents/images/fatchord_wavernn.png b/documents/images/fatchord_wavernn.png
diff --git a/documents/images/hmm_spss.png b/documents/images/hmm_spss.png
diff --git a/documents/images/librispeech_durations.png b/documents/images/librispeech_durations.png
diff --git a/documents/images/mlpg_features.png b/documents/images/mlpg_features.png
diff --git a/documents/images/mos_all.png b/documents/images/mos_all.png
diff --git a/documents/images/projections_griffin.png b/documents/images/projections_griffin.png
diff --git a/documents/images/sim_matrix.png b/documents/images/sim_matrix.png
diff --git a/documents/images/spss_framework.png b/documents/images/spss_framework.png
diff --git a/documents/images/sv2tts_framework.jpg b/documents/images/sv2tts_framework.jpg
diff --git a/documents/images/sv2tts_training.png b/documents/images/sv2tts_training.png
diff --git a/documents/images/tacotron2_arch.png b/documents/images/tacotron2_arch.png
diff --git a/documents/images/tacotron2_results.png b/documents/images/tacotron2_results.png
diff --git a/documents/images/tacotron_alignment.png b/documents/images/tacotron_alignment.png
diff --git a/documents/images/test_eer.png b/documents/images/test_eer.png
diff --git a/documents/images/toolbox.png b/documents/images/toolbox.png
diff --git a/documents/images/training_umap.png b/documents/images/training_umap.png
diff --git a/documents/images/uliege_logo.jpg b/documents/images/uliege_logo.jpg
diff --git a/documents/images/umap_projections.png b/documents/images/umap_projections.png
diff --git a/documents/images/wavenet_results_chart.png b/documents/images/wavenet_results_chart.png