TensorSpeech
diff --git a/‎README.md‎
Lines changed: 7 additions & 4 deletions b/‎README.md‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎examples/multiband_melgan/conf/multiband_melgan.synpaflex.v1.yaml‎
Lines changed: 108 additions & 0 deletions b/‎examples/multiband_melgan/conf/multiband_melgan.synpaflex.v1.yaml‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎examples/tacotron2/conf/tacotron2.synpaflex.v1.yaml‎
Lines changed: 86 additions & 0 deletions b/‎examples/tacotron2/conf/tacotron2.synpaflex.v1.yaml‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎notebooks/prepare_synpaflex.ipynb‎
Lines changed: 111 additions & 0 deletions b/‎notebooks/prepare_synpaflex.ipynb‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎notebooks/tacotron2_inference.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎notebooks/tacotron2_inference.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎preprocess/synpaflex_preprocess.yaml‎
Lines changed: 19 additions & 0 deletions b/‎preprocess/synpaflex_preprocess.yaml‎
Lines changed: 19 additions & 0 deletions
@@ -116,7 +116,7 @@ Prepare a dataset in the following format:
 
 Where `metadata.csv` has the following format: `id|transcription`. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.
 
-Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts]` for example.
+Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts/synpaflex]` for example.
 
 ## Preprocessing
 
@@ -132,14 +132,17 @@ The preprocessing has two steps:
 
 To reproduce the steps above:
 ```
-tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
-tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
+tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
+tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/libritts/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
 ```
 
-Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/) and [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) for dataset argument. In the future, we intend to support more datasets.
+Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/), [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) and
+[`synpaflex`](https://www.ortolang.fr/market/corpora/synpaflex-corpus/) for dataset argument. In the future, we intend to support more datasets.
 
 **Note**: To run `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need to reformat it first before run preprocessing.
 
+**Note**: To run `synpaflex` preprocessing, please first run the notebook [notebooks/prepare_synpaflex.ipynb](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/notebooks/prepare_synpaflex.ipynb). We need to reformat it first before run preprocessing.
+
 After preprocessing, the structure of the project folder should be:
 ```
 |- [NAME_DATASET]/
 
@@ -0,0 +1,108 @@
+
+# This is the hyperparameter configuration file for Multi-Band MelGAN.
+# Please make sure this is adjusted for the LJSpeech dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 1000k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "multiband_melgan_generator"
+
+multiband_melgan_generator_params:
+    out_channels: 4               # Number of output channels (number of subbands).
+    kernel_size: 7                # Kernel size of initial and final conv layers.
+    filters: 384                  # Initial number of channels for conv layers.
+    upsample_scales: [8, 4, 2]    # List of Upsampling scales.
+    stack_kernel_size: 3          # Kernel size of dilated conv layers in residual stack.
+    stacks: 4                     # Number of stacks in a single residual stack module.
+    is_weight_norm: false         # Use weight-norm or not.
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+multiband_melgan_discriminator_params:
+    out_channels: 1                   # Number of output channels.
+    scales: 3                         # Number of multi-scales.
+    downsample_pooling: "AveragePooling1D"   # Pooling type for the input downsampling.
+    downsample_pooling_params:        # Parameters of the above pooling function.
+        pool_size: 4
+        strides: 2
+    kernel_sizes: [5, 3]              # List of kernel size.
+    filters: 16                       # Number of channels of the initial conv layer.
+    max_downsample_filters: 512       # Maximum number of channels of downsampling layers.
+    downsample_scales: [4, 4, 4]      # List of downsampling scales.
+    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
+    nonlinear_activation_params:      # Parameters of nonlinear activation function.
+        alpha: 0.2
+    is_weight_norm: false             # Use weight-norm or not.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+subband_stft_loss_params:
+    fft_lengths: [384, 683, 171]  # List of FFT size for STFT-based loss.
+    frame_steps: [30, 60, 10]     # List of hop size for STFT-based loss
+    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0      # Loss balancing coefficient for feature matching loss
+lambda_adv: 2.5              # Loss balancing coefficient for adversarial loss.
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 64                 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 8192    # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
+        values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+discriminator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000]
+        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+
+    amsgrad: false
+
+gradient_accumulation_steps: 1
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 200000  # steps begin training discriminator
+train_max_steps: 4000000                 # Number of training steps.
+save_interval_steps: 20000               # Interval steps to save checkpoint.
+eval_interval_steps: 5000                # Interval steps to evaluate the network.
+log_interval_steps: 200                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.
@@ -0,0 +1,86 @@
+# This is the hyperparameter configuration file for Tacotron2 v1.
+# Please make sure this is adjusted for the LJSpeech dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 200k iters but 65k iters is enough to get a good models.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#              NETWORK ARCHITECTURE SETTING               #
+###########################################################
+model_type: "tacotron2"
+
+tacotron2_params:
+    dataset: synpaflex
+    embedding_hidden_size: 512
+    initializer_range: 0.02
+    embedding_dropout_prob: 0.1
+    n_speakers: 1
+    n_conv_encoder: 5
+    encoder_conv_filters: 512
+    encoder_conv_kernel_sizes: 5
+    encoder_conv_activation: 'relu'
+    encoder_conv_dropout_rate: 0.5
+    encoder_lstm_units: 256
+    n_prenet_layers: 2
+    prenet_units: 256
+    prenet_activation: 'relu'
+    prenet_dropout_rate: 0.5
+    n_lstm_decoder: 1
+    reduction_factor: 1
+    decoder_lstm_units: 1024
+    attention_dim: 128
+    attention_filters: 32
+    attention_kernel: 31
+    n_mels: 80
+    n_conv_postnet: 5
+    postnet_conv_filters: 512
+    postnet_conv_kernel_sizes: 5
+    postnet_dropout_rate: 0.1
+    attention_type: "lsa"
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 32              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
+remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
+mel_length_threshold: 32    # remove all targets has mel_length <= 32 
+is_shuffle: true            # shuffle dataset after each epoch.
+use_fixed_shapes: true      # use_fixed_shapes for training (2x speed-up)
+                            # refer (https://github.com/dathudeptrai/TensorflowTTS/issues/34#issuecomment-642309118)
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+optimizer_params:
+    initial_learning_rate: 0.001
+    end_learning_rate: 0.00001
+    decay_steps: 150000          # < train_max_steps is recommend.
+    warmup_proportion: 0.02
+    weight_decay: 0.001
+    
+gradient_accumulation_steps: 1
+var_train_expr: null  # trainable variable expr (eg. 'embeddings|decoder_cell' )
+                      # must separate by |. if var_train_expr is null then we 
+                      # training all variables.
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+train_max_steps: 200000                 # Number of training steps.
+save_interval_steps: 2000               # Interval steps to save checkpoint.
+eval_interval_steps: 500                # Interval steps to evaluate the network.
+log_interval_steps: 200                 # Interval steps to record the training log.
+start_schedule_teacher_forcing: 200001  # don't need to apply schedule teacher forcing.
+start_ratio_value: 0.5                  # start ratio of scheduled teacher forcing.
+schedule_decay_steps: 50000             # decay step scheduled teacher forcing.
+end_ratio_value: 0.0                    # end ratio of scheduled teacher forcing.
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of results to be saved as intermediate results.
@@ -0,0 +1,111 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "import numpy as np\n",
+    "import soundfile as sf\n",
+    "from pathlib import Path\n",
+    "from shutil import copyfile\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "input_dataset_path = \"[your_local_path]/synpaflex-corpus/v0.1/\"\n",
+    "reorganized_dataset_path = \"../synpaflex/\"\n",
+    "\n",
+    "maximal_duration = 12 # maximal audio file duration in seconds\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "wav_dir = os.path.join(reorganized_dataset_path, \"wavs/\")\n",
+    "os.makedirs(wav_dir, exist_ok=True)\n",
+    "data = []\n",
+    "total_duration = 0\n",
+    "\n",
+    "# Precomputing walk_count for tqdm\n",
+    "walk_count = 0\n",
+    "for subdir, dirs, files in os.walk(input_dataset_path):\n",
+    "    walk_count += 1\n",
+    "\n",
+    "# walk through dataset\n",
+    "for subdir, dirs, files in tqdm(os.walk(input_dataset_path), total=walk_count, bar_format='Data Reorganization : {l_bar}{bar}|'):\n",
+    "    for filename in files:\n",
+    "        filepath = os.path.join(subdir, filename)\n",
+    "\n",
+    "        # read wav files\n",
+    "        if filepath.endswith(\".wav\"):\n",
+    "            try:\n",
+    "                wav, sr = sf.read(filepath)\n",
+    "                duration = len(wav) / sr\n",
+    "                \n",
+    "                # Only keep files with shorter durations than maximal_duration\n",
+    "                if duration <= maximal_duration:\n",
+    "                    total_duration += duration\n",
+    "                    path = Path(filepath)\n",
+    "                    current_path = Path(path.parent.absolute())\n",
+    "                    \n",
+    "                    # find corresponding text file\n",
+    "                    txt_file_path = os.path.join(current_path, \"txt\", filename.replace('.wav','.txt'))\n",
+    "                    if not os.path.exists(txt_file_path):\n",
+    "                        parent_path = Path(current_path.parent.absolute())\n",
+    "                        txt_file_path = os.path.join(parent_path, \"txt\", filename.replace('.wav', '.txt'))\n",
+    "                        if not os.path.exists(txt_file_path):\n",
+    "                            break\n",
+    "                    norm_text_file_path = txt_file_path.replace(\".txt\", \"_norm.txt\")\n",
+    "                    text = open(txt_file_path, \"r\").read()\n",
+    "                    if os.path.exists(norm_text_file_path):\n",
+    "                        norm_text = open(norm_text_file_path, 'r').read()\n",
+    "                    else : \n",
+    "                        norm_text = text\n",
+    "                    \n",
+    "                    # ignore file if text contains digits, otherwise copy wav file and keep metadata to memory \n",
+    "                    if not any(chr.isdigit() for chr in text):\n",
+    "                        data_line = filename.replace(\".wav\", \"\") + '|' + text + '|' + norm_text\n",
+    "                        data.append(data_line)\n",
+    "                        copyfile(filepath, os.path.join(wav_dir, filename))\n",
+    "\n",
+    "            except RuntimeError:\n",
+    "                print(filepath + \" not recognized and ignored.\")  \n",
+    "\n",
+    "# save metadata\n",
+    "with open(os.path.join(reorganized_dataset_path, \"synpaflex.txt\"), 'w') as f:\n",
+    "    for item in data:\n",
+    "        f.write(\"%s\\n\" % item)\n",
+    "\n",
+    "# display reorganized dataset total duration\n",
+    "duration_hours = total_duration / 3600\n",
+    "print(\"total duration = \" + str(f\"{duration_hours:.2f}\") + \" hours\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -397,7 +397,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.7"
+   "version": "3.8.5"
   }
  },
  "nbformat": 4,
 
@@ -0,0 +1,19 @@
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050     # Sampling rate.
+fft_size: 1024           # FFT size.
+hop_size: 256            # Hop size. (fixed value, don't change)
+win_length: null         # Window length.
+                         # If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+num_mels: 80             # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation.
+fmax: 7600               # Maximum frequency in mel basis calculation.
+global_gain_scale: 1.0   # Will be multiplied to all of waveform.
+trim_silence: true       # Whether to trim the start and end of silence.
+trim_threshold_in_db: 20 # Need to tune carefully if the recording is not good.
+trim_frame_size: 2048    # Frame size in trimming.
+trim_hop_size: 512       # Hop size in trimming.
+format: "npy"            # Feature file format. Only "npy" is supported.
+
Original file line number	Diff line number	Diff line change
`@@ -397,7 +397,7 @@`
`397`	`397`	`"name": "python",`
`398`	`398`	`"nbconvert_exporter": "python",`
`399`	`399`	`"pygments_lexer": "ipython3",`
`400`		`- "version": "3.7.7"`
	`400`	`+ "version": "3.8.5"`
`401`	`401`	`}`
`402`	`402`	`},`
`403`	`403`	`"nbformat": 4,`