We provide implementation for the unit-based HiFi-GAN vocoder with a duration prediction module used in the direct speech-to-speech translation models in [1, 2].
# an example of training with HuBERT units
python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> \
-m examples.speech_to_speech_translation.train \
--checkpoint_path checkpoints/lj_hubert100_dur1.0 \
--config examples/speech_to_speech_translation/configs/hubert100_dw1.0.json
To generate with duration prediction, simply run:
python -m examples.speech_to_speech_translation.inference \
--checkpoint_file checkpoints/lj_hubert100_dur1.0 \
-n 10 \
--output_dir generations \
--num-gpu <NUM_GPUS> \
--input_code_file ./datasets/LJSpeech/hubert100/val.txt \
--dur-prediction
We also provide an implementation in fairseq for inference. See "Convert unit sequences to waveform" in the example.
[1] Direct speech-to-speech translation with discrete units
[2] Textless Speech-to-Speech Translation on Real Data