Improve GPU utilisation by extracting phoneme alignments

TLDR; Monotonic alignment search is not parallel. Once it is learnt save it and train on it

When Matcha-TTS is trained enough that its alignment plots are not changing drastically, one can extract and train on these alignments. This improves GPU utilisation and makes training even faster as you can increase the batch size without any overhead due to GPU parallelization over batches.

Another benefit is getting phoneme-wise alignments in the number of mel frames, which you can multiply by hop size and divide by sample rate (default:256 / 22050) to get information in seconds. Could be useful for analysis.

Extract phoneme alignments from Matcha-TTS

If the dataset is structured as

data/
└── LJSpeech-1.1
    ├── metadata.csv
    ├── README
    ├── test.txt
    ├── train.txt
    ├── val.txt
    └── wavs/

Then you can extract the phoneme level alignments from a Trained Matcha-TTS model using:

python  matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>

Example:

python  matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckpt

or simply:

matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckpt

this will create a folder durations right next to wavs

data/
└── LJSpeech-1.1
    ├── durations # Here
    ├── metadata.csv
    ├── README
    ├── test.txt
    ├── train.txt
    ├── val.txt
    └── wavs/

Each file will have a .npy alginments and JSON alignment output which will look similar to:

[
    {
        "p": {
            "starttime": 0,
            "endtime": 5,
            "duration": 5
        }
    },
    {
        "ɹ": {
            "starttime": 5,
            "endtime": 8,
            "duration": 3
        }
    },
    {
        "ˈ": {
            "starttime": 8,
            "endtime": 11,
            "duration": 3
        }
    },
    {
        "ɪ": {
            "starttime": 11,
            "endtime": 15,
            "duration": 4
        }
    },
    {
        "n": {
            "starttime": 15,
            "endtime": 21,
            "duration": 6
        }
    },
    {
        "t": {
            "starttime": 21,
            "endtime": 28,
            "duration": 7
        }
    }
]

Train using extracted alignments

In the datasetconfig turn on load duration. Example: ljspeech.yaml

load_durations: True

or see an examples in configs/experiment/ljspeech_from_durations.yaml

Training with this benefits by having large batch sizes and utilising more GPUs due to the removal of CPU based monotonic alignment search, effectively improving parallelization during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve GPU utilisation by extracting phoneme alignments

Extract phoneme alignments from Matcha-TTS

Train using extracted alignments

Clone this wiki locally