Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

Latest commit

 

History

History

speech_to_speech_translation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Unit-based HiFi-GAN Vocoder with Duration Prediction

We provide implementation for the unit-based HiFi-GAN vocoder with a duration prediction module used in the direct speech-to-speech translation models in [1, 2].

Training

# an example of training with HuBERT units

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> \
    -m examples.speech_to_speech_translation.train \
    --checkpoint_path checkpoints/lj_hubert100_dur1.0 \
    --config examples/speech_to_speech_translation/configs/hubert100_dw1.0.json

Inference

To generate with duration prediction, simply run:

python -m examples.speech_to_speech_translation.inference \
    --checkpoint_file checkpoints/lj_hubert100_dur1.0 \
    -n 10 \
    --output_dir generations \
    --num-gpu <NUM_GPUS> \
    --input_code_file ./datasets/LJSpeech/hubert100/val.txt \
    --dur-prediction

fairseq

We also provide an implementation in fairseq for inference. See "Convert unit sequences to waveform" in the example.

References

[1] Direct speech-to-speech translation with discrete units
[2] Textless Speech-to-Speech Translation on Real Data