-
Notifications
You must be signed in to change notification settings - Fork 91
Improve GPU utilisation by extracting phoneme alignments
TLDR; Monotonic alignment search is not parallel. Once it is learnt save it and train on it
When Matcha-TTS is trained enough that its alignment plots are not changing drastically, one can extract and train on these alignments. This improves GPU utilisation and makes training even faster as you can increase the batch size without any overhead due to GPU parallelization over batches.
Another benefit is getting phoneme-wise alignments in the number of mel frames, which you can multiply by hop size and divide by sample rate (default:256 / 22050) to get information in seconds. Could be useful for analysis.
If the dataset is structured as
data/
βββ LJSpeech-1.1
βββ metadata.csv
βββ README
βββ test.txt
βββ train.txt
βββ val.txt
βββ wavs/
Then you can extract the phoneme level alignments from a Trained Matcha-TTS model using:
python matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>
Example:
python matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckpt
or simply:
matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckpt
this will create a folder durations
right next to wavs
data/
βββ LJSpeech-1.1
βββ durations # Here
βββ metadata.csv
βββ README
βββ test.txt
βββ train.txt
βββ val.txt
βββ wavs/
Each file will have a .npy
alginments and JSON alignment output which will look similar to:
[
{
"p": {
"starttime": 0,
"endtime": 5,
"duration": 5
}
},
{
"ΙΉ": {
"starttime": 5,
"endtime": 8,
"duration": 3
}
},
{
"Λ": {
"starttime": 8,
"endtime": 11,
"duration": 3
}
},
{
"Ιͺ": {
"starttime": 11,
"endtime": 15,
"duration": 4
}
},
{
"n": {
"starttime": 15,
"endtime": 21,
"duration": 6
}
},
{
"t": {
"starttime": 21,
"endtime": 28,
"duration": 7
}
}
]
In the datasetconfig turn on load duration.
Example: ljspeech.yaml
load_durations: True
or see an examples in configs/experiment/ljspeech_from_durations.yaml
Training with this benefits by having large batch sizes and utilising more GPUs due to the removal of CPU based monotonic alignment search, effectively improving parallelization during training.