GitHub - Render-AI/GenVC: Self-supervised Generative LM-based Voice Conversion

Self-Supervised LM-Based Zero-Shot Voice Conversion

♠︎ Model | ♣︎ Github | ♥︎ Paper | ♦︎ Demo

GenVC is an open-source, language model-based zero-shot voice conversion system that leverages self-supervised training and supports streaming voice conversion.

Approach

Features

✅ Zero-shot Voice Conversion

✅ Streaming VC

✅ Self-supervised Training

Setup

Create a new conda environment and install the required packages:

# Create a new Conda environment
conda create -n genVC python=3.10

# Activate the environment
conda activate genVC

# Install necessary dependencies
pip install pip==20.3.4
pip install transformers==4.33.0
pip install fairseq
pip install torch==2.3.0 torchaudio==2.3.0

# Install additional requirements
pip install -r requirements.txt

We used Python 3.10, Torch 2.3, and Transformers 4.33 to train and test our models. However, the codebase is expected to be compatible with Torch versions below 2.6 and other versions of the Transformers package. Streaming inference may not be compatible with some higher versions of Transformers. The specified installation ensures that users can successfully install Fairseq without encountering issues.

Inference

NOTE: top_k is one of the key hyperparameters for inference, and you can adjust it using --top_k to achieve better results. For streaming inference, greedy decoding is recommended; you can set top_k to 1.

Available models

Model	Training Set
`GenVC_small`	LibriTTS
`GenVC_large`	LibriTTS, MLS, Common Voice

We recommend downloading the model and placing it in the pre_trained/ directory.

Non-streaming inference

python infer.py --model_path pre_trained/GenVC_small.pth --src_wav samples/EF4_ENG_0112_1.wav --ref_audio samples/EM1_ENG_0037_1.wav --output_path samples/converted.wav

Streaming inference

python infer.py --model_path pre_trained/GenVC_small.pth --src_wav samples/EF4_ENG_0112_1.wav --ref_audio samples/EM1_ENG_0037_1.wav --output_path samples/converted.wav --streaming

Latency and RTF

We evaluated the latency and real-time factor (processed time / audio length) of our model on three different GPUs, with the results presented in the table below. The streaming experiment was conducted using 1-second chunk processing, and latency was measured as the time taken for the model to generate the first audio signal after receiving an input signal. In a practical setting, the actual latency would be the sum of the audio length (1 second) and the processing latency, where the audio length represents the time required for a person to produce the speech.

GPU Type		Latency (ms)			RTF
	Avg.	Min	Max	Avg.	Min	Max
H100	95.2	94.5	96.4	0.28	0.03	0.44
A100	129.7	123.9	148.4	0.38	0.04	0.73
1080-TI	183.9	177.5	193.9	0.57	0.08	1.11

Training

We highly recommend using Weights & Biases (wandb) to track training progress. Please refer to our training history as a reference for training your own model.

Dataset preparation

Download the LibriTTS and place it in the data/ folder. Use our prepared metadata located in metafiles/libritts/. Each line in the metafile follows the format: utterance_path|spk.

Tokenizer training

We also provide pretrained tokenizers.

NOTE: Modify the configurations in train_*_dvae.py if necessary.

CUDA_VISIBLE_DEVICES=0 python train_audio_dvae.py

For phonetic tokenizer training, please download the ContentVec model and save it as pre_trained/contentVec.pt. If you’ve already downloaded the pre_trained directory from Hugging Face, no further download is necessary.

CUDA_VISIBLE_DEVICES=0 python train_content_dvae.py

GenVC training

CUDA_VISIBLE_DEVICES=0 python train_genVC.py

Vocoder training

CUDA_VISIBLE_DEVICES=0 python train_vocoder.py

Future updates

☑️ Multi-GPU Training

☑️ Causal Neural Vocoder

☑️ Multilingual VC

Impact Statement

While our work holds significant promise, it also carries potential societal implications that warrant consideration. GenVC is a voice conversion system capable of transforming source speech into desired voices. While this technology has valuable applications, such as enhancing privacy by anonymizing voices and enabling accessibility for individuals with speech impairments, it is also presents ethical challenges. Specifically, the ability to convincingly replicate voices can be misused to create audio deepfakes, which may be employed for malicious purposes, such as identity theft, fraud, and the spread of misinformation.

To mitigate these risks, we strongly advocate for the responsible and ethical use of voice conversion technologies. Researchers, developers, and users must comply with relevant laws and guidelines, ensuring that these systems are used exclusively for legitimate and beneficial applications. Trans- parency, informed consent, and robust safeguards should be prioritized to prevent misuse and protect individuals’ rights and privacy.

Acknowledgements

Coqui-AI/Trainer for the PyTorch model trainer
Tortoise-TTS and XTTS for the LLM backbone
Amphion for providing some of the vocoder codes
My team members for their feedback, contributions, and support

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
figures		figures
inference		inference
layers		layers
metafiles/libritts		metafiles/libritts
pre_trained		pre_trained
samples		samples
trainers		trainers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
infer.py		infer.py
requirements.txt		requirements.txt
train_audio_dvae.py		train_audio_dvae.py
train_content_dvae.py		train_content_dvae.py
train_genVC.py		train_genVC.py
train_vocoder.py		train_vocoder.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Approach

Features

Setup

Inference

Available models

Non-streaming inference

Streaming inference

Latency and RTF

Training

Dataset preparation

Tokenizer training

GenVC training

Vocoder training

Future updates

Impact Statement

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Render-AI/GenVC

Folders and files

Latest commit

History

Repository files navigation

Approach

Features

Setup

Inference

Available models

Non-streaming inference

Streaming inference

Latency and RTF

Training

Dataset preparation

Tokenizer training

GenVC training

Vocoder training

Future updates

Impact Statement

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages