The accompanying code for the papers Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis (accepted at Interspeech 2024) and Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables (published at ISMIR 2023).
The following instructions are for the Interspeech 2024 paper. For the ISMIR 2023 paper, please refer to this readme.
- Download the VCTK 0.92 dataset from here.
- Extract the dataset to a directory, e.g.,
data/vctk_raw
. - Run the following command to resample the dataset to 24 kHz wave files. The resampled files will be saved in the target directory with the same structure as the original files.
python scripts/resample_dir.py data/vctk_raw data/vctk --suffix .flac --sr 24000
- Extract the foundamental frequency (F0). The f0s will be saved as
.pv
file in the same directory with the original files using 5 ms hop size.
python scripts/wav2f0.py data/vctk --f0-floor 60
Below is the command to train each models in the Interspeech paper.
python autoencode.py fit --config cfg/ae/vctk.yaml --model cfg/ae/decoder/{MODEL}.yaml --trainer.logger false
The {MODEL}
corresponds to the following models:
-
ddsp
$\rightarrow$ DDSP -
nhv
$\rightarrow$ NHV (neural homomorphic vocoder) -
world
$\rightarrow$ $\nabla$ WORLD -
mlsa
$\rightarrow$ MLSA (differentiable Mel-cepstral synthesis filter) -
golf-v1
$\rightarrow$ GOLF-v1 -
golf
$\rightarrow$ GOLF-ff -
golf-precise
$\rightarrow$ GOLF-ss
By default, the checkpoints are automatically saved under checkpoints/
directory.
Feel free to remove --trainer.logger false
and edit the logger settings in the configuration file cfg/ae/vctk.yaml
to fit your needs.
Please checkout the LightningCLI instructions here.
After training the models, you can evaluate the models using the following command. Replace {YOUR_CONFIG}
and {YOUR_CHECKPOINT}
with the corresponding configuration file and checkpoint.
python autoencode.py test -c {YOUR_CONFIG}.yaml --ckpt_path {YOUR_CHECKPOINT}.ckpt --data.duration 2 --data.overlap 0 --seed_everything false --data.wav_dir data/vctk --data.batch_size 32 --trainer.logger false
For PESQ/FAD evaluation, you'll first need to store the synthesised waveforms in a directory. Replace {YOUR_CONFIG}
, {YOUR_CHECKPOINT}
, and {YOUR_OUTPUT_DIR}
with the corresponding configuration file, checkpoint, and output directory.
python autoencode.py predict -c {YOUR_CONFIG}.yaml --ckpt_path {YOUR_CHECKPOINT}.ckpt --trainer.logger false --seed_everything false --data.wav_dir data/vctk --trainer.callbacks+=ltng.cli.MyPredictionWriter --trainer.callbacks.output_dir {YOUR_OUTPUT_DIR}
Make a new directory and copy the following eight speakers, which form the test set, from data/vctk
.
data/vctk_test
├── p360
├── p361
├── p362
├── p363
├── p364
├── p374
├── p376
├── s5
Then, calculate the PESQ scores:
python eval_pesq.py data/vctk_test {YOUR_OUTPUT_DIR}
For the FAD scores:
python fad.py data/vctk_test {YOUR_OUTPUT_DIR}
We use fadtk and descript audio codec for the FAD evaluation.
Please use the checkpoints trained with golf.yaml
for the GOLF-fs model. Append --model.decoder.end_filter models.filters.LTVMinimumPhaseFilterPrecise
to the evaluation commands above (test/predict
) to use the sample-wise filter.
Please use the following commands to evaluate the non-differentiable WORLD model.
python autoencode.py test -c cfg/ae/pyworld.yaml --data.wav_dir data/vctk
python autoencode.py predict -c cfg/ae/pyworld.yaml --trainer.logger false --seed_everything false --data.wav_dir data/vctk --trainer.callbacks+=ltng.cli.MyPredictionWriter --trainer.callbacks.output_dir {YOUR_OUTPUT_DIR}
The checkpoints we used for evaluation are provided here.
Use the following command to benchmark the real-time factor of the models. Replace {YOUR_CONFIG}
and {YOUR_CHECKPOINT}
with the corresponding configuration file and checkpoint. Add --cuda
to benchmark on GPU.
python test_rtf.py {YOUR_CONFIG}.yaml {YOUR_CHECKPOINT}.ckpt {EXAMPLE_FILE}.wav
- Individual FAD on each test speaker and PESQ scores
- MCD and MSS comparison table on W&B
- Interspeech Figure 2 and some ablation observations
- Script to synthesise listening test samples
- Script to calculate MUSHRA scores and ANOVA
- Differentiable LP in PyTorch
If you find this code useful, please consider citing the following papers:
@inproceedings{ycy2023golf,
title = {Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables},
author = {Yu, Chin-Yun and Fazekas, György},
booktitle={Proc. International Society for Music Information Retrieval},
year={2023},
pages={667--675},
doi={10.5281/zenodo.10265377},
}
@inproceedings{ycy2024golf,
title = {Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis},
author = {Chin-Yun Yu and György Fazekas},
year = {2024},
booktitle = {Proc. Interspeech},
pages = {1820--1824},
doi = {10.21437/Interspeech.2024-1187},
}