Skip to content

Commit d7415ac

Browse files
authored
Merge pull request #640 from samuel-lunii/sd/synpaflexSupport
Synpaflex french dataset tacotron2 and MB-MelGAN support
2 parents eec964d + 88d5ef5 commit d7415ac

File tree

12 files changed

+480
-8
lines changed

12 files changed

+480
-8
lines changed

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Prepare a dataset in the following format:
116116

117117
Where `metadata.csv` has the following format: `id|transcription`. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.
118118

119-
Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts]` for example.
119+
Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts/synpaflex]` for example.
120120

121121
## Preprocessing
122122

@@ -132,14 +132,17 @@ The preprocessing has two steps:
132132

133133
To reproduce the steps above:
134134
```
135-
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
136-
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
135+
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
136+
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/libritts/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
137137
```
138138

139-
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/) and [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) for dataset argument. In the future, we intend to support more datasets.
139+
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/), [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) and
140+
[`synpaflex`](https://www.ortolang.fr/market/corpora/synpaflex-corpus/) for dataset argument. In the future, we intend to support more datasets.
140141

141142
**Note**: To run `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need to reformat it first before run preprocessing.
142143

144+
**Note**: To run `synpaflex` preprocessing, please first run the notebook [notebooks/prepare_synpaflex.ipynb](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/notebooks/prepare_synpaflex.ipynb). We need to reformat it first before run preprocessing.
145+
143146
After preprocessing, the structure of the project folder should be:
144147
```
145148
|- [NAME_DATASET]/
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
2+
# This is the hyperparameter configuration file for Multi-Band MelGAN.
3+
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
4+
# apply to the other dataset, you might need to carefully change some parameters.
5+
# This configuration performs 1000k iters.
6+
7+
###########################################################
8+
# FEATURE EXTRACTION SETTING #
9+
###########################################################
10+
sampling_rate: 22050
11+
hop_size: 256 # Hop size.
12+
format: "npy"
13+
14+
15+
###########################################################
16+
# GENERATOR NETWORK ARCHITECTURE SETTING #
17+
###########################################################
18+
model_type: "multiband_melgan_generator"
19+
20+
multiband_melgan_generator_params:
21+
out_channels: 4 # Number of output channels (number of subbands).
22+
kernel_size: 7 # Kernel size of initial and final conv layers.
23+
filters: 384 # Initial number of channels for conv layers.
24+
upsample_scales: [8, 4, 2] # List of Upsampling scales.
25+
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
26+
stacks: 4 # Number of stacks in a single residual stack module.
27+
is_weight_norm: false # Use weight-norm or not.
28+
29+
###########################################################
30+
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
31+
###########################################################
32+
multiband_melgan_discriminator_params:
33+
out_channels: 1 # Number of output channels.
34+
scales: 3 # Number of multi-scales.
35+
downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
36+
downsample_pooling_params: # Parameters of the above pooling function.
37+
pool_size: 4
38+
strides: 2
39+
kernel_sizes: [5, 3] # List of kernel size.
40+
filters: 16 # Number of channels of the initial conv layer.
41+
max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
42+
downsample_scales: [4, 4, 4] # List of downsampling scales.
43+
nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
44+
nonlinear_activation_params: # Parameters of nonlinear activation function.
45+
alpha: 0.2
46+
is_weight_norm: false # Use weight-norm or not.
47+
48+
###########################################################
49+
# STFT LOSS SETTING #
50+
###########################################################
51+
stft_loss_params:
52+
fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
53+
frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
54+
frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
55+
56+
subband_stft_loss_params:
57+
fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
58+
frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
59+
frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
60+
61+
###########################################################
62+
# ADVERSARIAL LOSS SETTING #
63+
###########################################################
64+
lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
65+
lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
66+
67+
###########################################################
68+
# DATA LOADER SETTING #
69+
###########################################################
70+
batch_size: 64 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
71+
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
72+
batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.
73+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
74+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
75+
is_shuffle: true # shuffle dataset after each epoch.
76+
77+
###########################################################
78+
# OPTIMIZER & SCHEDULER SETTING #
79+
###########################################################
80+
generator_optimizer_params:
81+
lr_fn: "PiecewiseConstantDecay"
82+
lr_params:
83+
boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
84+
values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
85+
amsgrad: false
86+
87+
discriminator_optimizer_params:
88+
lr_fn: "PiecewiseConstantDecay"
89+
lr_params:
90+
boundaries: [100000, 200000, 300000, 400000, 500000]
91+
values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
92+
93+
amsgrad: false
94+
95+
gradient_accumulation_steps: 1
96+
###########################################################
97+
# INTERVAL SETTING #
98+
###########################################################
99+
discriminator_train_start_steps: 200000 # steps begin training discriminator
100+
train_max_steps: 4000000 # Number of training steps.
101+
save_interval_steps: 20000 # Interval steps to save checkpoint.
102+
eval_interval_steps: 5000 # Interval steps to evaluate the network.
103+
log_interval_steps: 200 # Interval steps to record the training log.
104+
105+
###########################################################
106+
# OTHER SETTING #
107+
###########################################################
108+
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# This is the hyperparameter configuration file for Tacotron2 v1.
2+
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration performs 200k iters but 65k iters is enough to get a good models.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
hop_size: 256 # Hop size.
10+
format: "npy"
11+
12+
13+
###########################################################
14+
# NETWORK ARCHITECTURE SETTING #
15+
###########################################################
16+
model_type: "tacotron2"
17+
18+
tacotron2_params:
19+
dataset: synpaflex
20+
embedding_hidden_size: 512
21+
initializer_range: 0.02
22+
embedding_dropout_prob: 0.1
23+
n_speakers: 1
24+
n_conv_encoder: 5
25+
encoder_conv_filters: 512
26+
encoder_conv_kernel_sizes: 5
27+
encoder_conv_activation: 'relu'
28+
encoder_conv_dropout_rate: 0.5
29+
encoder_lstm_units: 256
30+
n_prenet_layers: 2
31+
prenet_units: 256
32+
prenet_activation: 'relu'
33+
prenet_dropout_rate: 0.5
34+
n_lstm_decoder: 1
35+
reduction_factor: 1
36+
decoder_lstm_units: 1024
37+
attention_dim: 128
38+
attention_filters: 32
39+
attention_kernel: 31
40+
n_mels: 80
41+
n_conv_postnet: 5
42+
postnet_conv_filters: 512
43+
postnet_conv_kernel_sizes: 5
44+
postnet_dropout_rate: 0.1
45+
attention_type: "lsa"
46+
47+
###########################################################
48+
# DATA LOADER SETTING #
49+
###########################################################
50+
batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
51+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
52+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
53+
mel_length_threshold: 32 # remove all targets has mel_length <= 32
54+
is_shuffle: true # shuffle dataset after each epoch.
55+
use_fixed_shapes: true # use_fixed_shapes for training (2x speed-up)
56+
# refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)
57+
58+
###########################################################
59+
# OPTIMIZER & SCHEDULER SETTING #
60+
###########################################################
61+
optimizer_params:
62+
initial_learning_rate: 0.001
63+
end_learning_rate: 0.00001
64+
decay_steps: 150000 # < train_max_steps is recommend.
65+
warmup_proportion: 0.02
66+
weight_decay: 0.001
67+
68+
gradient_accumulation_steps: 1
69+
var_train_expr: null # trainable variable expr (eg. 'embeddings|decoder_cell' )
70+
# must separate by |. if var_train_expr is null then we
71+
# training all variables.
72+
###########################################################
73+
# INTERVAL SETTING #
74+
###########################################################
75+
train_max_steps: 200000 # Number of training steps.
76+
save_interval_steps: 2000 # Interval steps to save checkpoint.
77+
eval_interval_steps: 500 # Interval steps to evaluate the network.
78+
log_interval_steps: 200 # Interval steps to record the training log.
79+
start_schedule_teacher_forcing: 200001 # don't need to apply schedule teacher forcing.
80+
start_ratio_value: 0.5 # start ratio of scheduled teacher forcing.
81+
schedule_decay_steps: 50000 # decay step scheduled teacher forcing.
82+
end_ratio_value: 0.0 # end ratio of scheduled teacher forcing.
83+
###########################################################
84+
# OTHER SETTING #
85+
###########################################################
86+
num_save_intermediate_results: 1 # Number of results to be saved as intermediate results.

notebooks/prepare_synpaflex.ipynb

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {},
7+
"outputs": [],
8+
"source": [
9+
"import os\n",
10+
"\n",
11+
"import numpy as np\n",
12+
"import soundfile as sf\n",
13+
"from pathlib import Path\n",
14+
"from shutil import copyfile\n",
15+
"from tqdm import tqdm\n",
16+
"\n",
17+
"input_dataset_path = \"[your_local_path]/synpaflex-corpus/v0.1/\"\n",
18+
"reorganized_dataset_path = \"../synpaflex/\"\n",
19+
"\n",
20+
"maximal_duration = 12 # maximal audio file duration in seconds\n"
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": null,
26+
"metadata": {},
27+
"outputs": [],
28+
"source": [
29+
"wav_dir = os.path.join(reorganized_dataset_path, \"wavs/\")\n",
30+
"os.makedirs(wav_dir, exist_ok=True)\n",
31+
"data = []\n",
32+
"total_duration = 0\n",
33+
"\n",
34+
"# Precomputing walk_count for tqdm\n",
35+
"walk_count = 0\n",
36+
"for subdir, dirs, files in os.walk(input_dataset_path):\n",
37+
" walk_count += 1\n",
38+
"\n",
39+
"# walk through dataset\n",
40+
"for subdir, dirs, files in tqdm(os.walk(input_dataset_path), total=walk_count, bar_format='Data Reorganization : {l_bar}{bar}|'):\n",
41+
" for filename in files:\n",
42+
" filepath = os.path.join(subdir, filename)\n",
43+
"\n",
44+
" # read wav files\n",
45+
" if filepath.endswith(\".wav\"):\n",
46+
" try:\n",
47+
" wav, sr = sf.read(filepath)\n",
48+
" duration = len(wav) / sr\n",
49+
" \n",
50+
" # Only keep files with shorter durations than maximal_duration\n",
51+
" if duration <= maximal_duration:\n",
52+
" total_duration += duration\n",
53+
" path = Path(filepath)\n",
54+
" current_path = Path(path.parent.absolute())\n",
55+
" \n",
56+
" # find corresponding text file\n",
57+
" txt_file_path = os.path.join(current_path, \"txt\", filename.replace('.wav','.txt'))\n",
58+
" if not os.path.exists(txt_file_path):\n",
59+
" parent_path = Path(current_path.parent.absolute())\n",
60+
" txt_file_path = os.path.join(parent_path, \"txt\", filename.replace('.wav', '.txt'))\n",
61+
" if not os.path.exists(txt_file_path):\n",
62+
" break\n",
63+
" norm_text_file_path = txt_file_path.replace(\".txt\", \"_norm.txt\")\n",
64+
" text = open(txt_file_path, \"r\").read()\n",
65+
" if os.path.exists(norm_text_file_path):\n",
66+
" norm_text = open(norm_text_file_path, 'r').read()\n",
67+
" else : \n",
68+
" norm_text = text\n",
69+
" \n",
70+
" # ignore file if text contains digits, otherwise copy wav file and keep metadata to memory \n",
71+
" if not any(chr.isdigit() for chr in text):\n",
72+
" data_line = filename.replace(\".wav\", \"\") + '|' + text + '|' + norm_text\n",
73+
" data.append(data_line)\n",
74+
" copyfile(filepath, os.path.join(wav_dir, filename))\n",
75+
"\n",
76+
" except RuntimeError:\n",
77+
" print(filepath + \" not recognized and ignored.\") \n",
78+
"\n",
79+
"# save metadata\n",
80+
"with open(os.path.join(reorganized_dataset_path, \"synpaflex.txt\"), 'w') as f:\n",
81+
" for item in data:\n",
82+
" f.write(\"%s\\n\" % item)\n",
83+
"\n",
84+
"# display reorganized dataset total duration\n",
85+
"duration_hours = total_duration / 3600\n",
86+
"print(\"total duration = \" + str(f\"{duration_hours:.2f}\") + \" hours\")"
87+
]
88+
}
89+
],
90+
"metadata": {
91+
"kernelspec": {
92+
"display_name": "Python 3",
93+
"language": "python",
94+
"name": "python3"
95+
},
96+
"language_info": {
97+
"codemirror_mode": {
98+
"name": "ipython",
99+
"version": 3
100+
},
101+
"file_extension": ".py",
102+
"mimetype": "text/x-python",
103+
"name": "python",
104+
"nbconvert_exporter": "python",
105+
"pygments_lexer": "ipython3",
106+
"version": "3.8.5"
107+
}
108+
},
109+
"nbformat": 4,
110+
"nbformat_minor": 4
111+
}

notebooks/tacotron2_inference.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,7 @@
397397
"name": "python",
398398
"nbconvert_exporter": "python",
399399
"pygments_lexer": "ipython3",
400-
"version": "3.7.7"
400+
"version": "3.8.5"
401401
}
402402
},
403403
"nbformat": 4,
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
###########################################################
2+
# FEATURE EXTRACTION SETTING #
3+
###########################################################
4+
sampling_rate: 22050 # Sampling rate.
5+
fft_size: 1024 # FFT size.
6+
hop_size: 256 # Hop size. (fixed value, don't change)
7+
win_length: null # Window length.
8+
# If set to null, it will be the same as fft_size.
9+
window: "hann" # Window function.
10+
num_mels: 80 # Number of mel basis.
11+
fmin: 80 # Minimum freq in mel basis calculation.
12+
fmax: 7600 # Maximum frequency in mel basis calculation.
13+
global_gain_scale: 1.0 # Will be multiplied to all of waveform.
14+
trim_silence: true # Whether to trim the start and end of silence.
15+
trim_threshold_in_db: 20 # Need to tune carefully if the recording is not good.
16+
trim_frame_size: 2048 # Frame size in trimming.
17+
trim_hop_size: 512 # Hop size in trimming.
18+
format: "npy" # Feature file format. Only "npy" is supported.
19+

0 commit comments

Comments
 (0)