We build and evaluate generative speech2speech systems using Log Mel Filtebank, Modified CPC, HuBERT Base and Wav2Vec 2.0 Large. Our system is composed of three components, namely, speech2unit, ulm and unit2speech. We explain about models and usage of these components in their respective sub-directories. See the links below.
Speech to unit model is used for quantizing raw speech into learned discrete speech units. More details
Unit Language Model is a generative language model trained on discrete speech units. More details
Unit to speech model is used for synthesizing speech from discrete speech units. More details
We show how to compute ASR based metrics as well as zero-shot metrics proposed in our paper here.
We share two tools to resynthesize a given spoken utterance, and generate novel spoken language given a spoken prompt. More detail