iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
This repository is based on the opensource implementation of iSTFTNet (model C8C8I). Our contribution to the repository:
- shared the weights of the model we trained on robust internal dataset consists of
Russian speechrecorded in different acoustic conditions with sample rate22050 Hz; - added
loguru&wandb; - added
Docerfilefor faster env set up; - updated the code with several scripts to
compute mel-spectrogramsandconvert the model to .onnx.
Note: according to our tests iSTFT Net shows even higher synthesis quality than HiFi GAN, with a 2x acceleration of RTF.
bash run_docker.shconda create —name istft-vocoder python=3.10
pip install -r requirements.txtbash download_checkpoints.shYour file structure should look like:
├── data
│ ├── awesome_checkpoints
│ │ ├── do_00975000
│ │ ├── g_00975000
│ │ └── g_00975000.onnx
│ ├── deep_voices_mel
│ │ ├── andrey_preispolnilsya.npy
│ │ ├── egor_dora.npy
│ │ └── kirill_lunch.npy
│ └── deep_voices_wav
│ ├── andrey_preispolnilsya.wav
│ ├── egor_dora.wav
│ └── kirill_lunch.wavNote: we trained the model with batch size 16 using 4 a100 GPUs for ~1M steps.
| Filename | Description |
|---|---|
| do_00975000 | Discriminator checkpoint. |
| g_00975000 | Generator checkpoint. |
| g_00975000.onnx | .onnx model. |
| deep_voices_mel | Directory with 3 mel-spectrograms of test-audios. |
| deep_voices_wav | Directory with 3 original audios – voices of our team, this audios were not seen during the training. |
To run inference with downloaded test-files:
python -m src.inferenceTo run inference with your own files or parameters:
| Parameter | Description |
|---|---|
| config_path | Path to config.json. |
| input_wavs_dir | Directory with your wav files to synthesize, default is /app/data/deep_voices_wavs |
| input_mels_dir | Directory with pre-computed mel-spectrograms to synthesize mel. Note that mel-spectrograms should be computed with compute_mels_from_audio.py script, default is /app/data/deep_voices_mels. |
| compute_mels | Pass --no-compute_mels if you precomputed mels, if not specified mels will be computed from the audios in input_wavs_dir. |
| onnx_inference | If specified, checkpoint file should be .onnx file. |
| onnx_provider | Used if onnx_inference is specified, default provider is CPUExecutionProvider for CPU inference. |
| checkpoint_file | Path to the generator checkpoint or .onnx model. |
| output_dir | Path where generated wavs will be saved, default is /app/data/generated_files. |
To train the model:
- Login from CLI to Wanb account:
wandb login - Create
train.txtandval.txtwith create_manifests.py. - Run
src.train
Parameters for training and finetuning the model:
| Parameter | Description |
|---|---|
| input_training_file | Path to the train.txt. |
| input_validation_file | Path to the val.txt. |
| config_path | Path the config.json. |
| input_mels_dir | Path to the directory with mel-spectrograms, specify if you would like to train / finetune the model on Acoustic Model outputs. |
| fine_tuning | If specified will look for mel-spectrograms in input_mels_dir. |
| checkpoint_path | Path to the directory with checkpoints, if you would like to finetune the model on your data based on our checkpoints: /app/new_checkpoints. |
| training_epochs | N epochs to train the model. |
| wandb_log_interval | N steps through which log training loss to wandb. |
| checkpoint_interval | N steps through which save checkpoint. |
| log_audio_interval | N steps through which log generated audios from validation dataset to wandb. |
| validation_interval | N steps through which run validation and log validation loss to wandb. |
Note: for correct inference and finetuning from our checkpoints, parameters: num_mels, n_fft, hop_size, win_size, sampling_rate, fmin and fmax should not be changed.
Find the instructions to infer .onnx model in the Inference block. To convert trained model to .onnx:
python -m srcipts.convert_to_onnx| Parameter | Description |
|---|---|
| checkpoint_file | Path to the generator checkpoint. |
| config_path | Path to the config.json. |
| converted_model_path | Path where converted model will be saved, default is /app/istft_vocoder.onnx. |
@inproceedings{kaneko2022istftnet,
title={{iSTFTNet}: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform},
author={Takuhiro Kaneko and Kou Tanaka and Hirokazu Kameoka and Shogo Seki},
booktitle={ICASSP},
year={2022},
}
@misc{deepvk2023istft,
author = {Daria, Diatlova},
title = {istft-net},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {https://github.com/deepvk/istft-net}