VALL-E: Zero-Shot Text-to-Speech | PyTorch Implementation

Unofficial PyTorch implementation of VALL-E — a neural codec language model for zero-shot text-to-speech (TTS) and voice cloning. Train and synthesize natural speech from text using a single 3-second reference audio sample.

What is VALL-E?

VALL-E (Neural Codec Language Models) is a zero-shot text-to-speech synthesizer from Microsoft Research. Given a short audio prompt of a speaker, it can generate high-quality speech that matches the speaker's voice from arbitrary text — enabling voice cloning and personalized TTS without fine-tuning. This repository provides an open-source PyTorch reimplementation based on the EnCodec neural audio codec.

Keywords: text-to-speech, TTS, zero-shot TTS, voice cloning, neural codec, speech synthesis, VALL-E, PyTorch, EnCodec, autoregressive, non-autoregressive

Features

Zero-shot text-to-speech — Generate speech in a target voice from a single reference utterance
AR + NAR models — Autoregressive (AR) and non-autoregressive (NAR) transformer architectures
EnCodec tokenizer — Uses Facebook's EnCodec for neural audio quantization
DeepSpeed training — Scalable training with DeepSpeed
Synthesis CLI — Command-line interface for inference and voice cloning

Requirements

This trainer uses DeepSpeed. You need:

A GPU supported by DeepSpeed
CUDA or ROCm compiler installed

Installation

Install from GitHub:

pip install git+https://github.com/enhuiz/vall-e

Or clone with submodules:

git clone --recurse-submodules https://github.com/enhuiz/vall-e.git

Note: Tested with Python 3.10.7.

Quick Start: Train Your Own VALL-E Model

1. Prepare Data

Place your data in a folder (e.g. data/your_data):

Audio files: .wav suffix
Text files: .normalized.txt suffix

2. Quantize Audio

python -m vall_e.emb.qnt data/your_data

3. Generate Phonemes (G2P)

python -m vall_e.emb.g2p data/your_data

4. Configure Training

Create config/your_data/ar.yml and config/your_data/nar.yml. See config/test and vall_e/config.py for examples. Model presets (e.g. ar-quarter, ar-half, ar) are in vall_e/vall_e/__init__.py.

5. Train AR or NAR Model

python -m vall_e.train yaml=config/your_data/ar_or_nar.yml

Type quit in the CLI to stop; the latest checkpoint is saved automatically.

Export Trained Models

Export AR or NAR checkpoints:

python -m vall_e.export zoo/ar_or_nar.pt yaml=config/your_data/ar_or_nar.yml

Synthesis (Text-to-Speech / Voice Cloning)

Run zero-shot TTS with a reference audio file:

python -m vall_e <text> <ref_path> <out_path> --ar-ckpt zoo/ar.pt --nar-ckpt zoo/nar.pt

Colab Demo

Open in Google Colab — toy example that overfits a single utterance under data/test. Not for production. Pretrained checkpoints coming later.

Roadmap

AR model for first quantizer
Audio decoding from tokens
NAR model for remaining quantizers
Trainers for AR and NAR
AdaLN for NAR model
Sample-wise quantization level sampling for NAR training
Synthesis CLI
Pre-trained checkpoint and demos on LibriTTS

License & Citations

EnCodec is licensed under CC-BY-NC 4.0. If you use this code for audio quantization or decoding, comply with their license.

VALL-E (Microsoft):

@article{wang2023neural,
  title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
  author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
  journal={arXiv preprint arXiv:2301.02111},
  year={2023}
}

EnCodec (Meta):

@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

Related Keywords

text-to-speech · TTS · zero-shot TTS · voice cloning · VALL-E · neural codec · speech synthesis · PyTorch · EnCodec · zero-shot speech · voice synthesis

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github		.github
config		config
data/test		data/test
scripts		scripts
vall_e		vall_e
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
vall-e.png		vall-e.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VALL-E: Zero-Shot Text-to-Speech | PyTorch Implementation

What is VALL-E?

Features

Requirements

Installation

Quick Start: Train Your Own VALL-E Model

1. Prepare Data

2. Quantize Audio

3. Generate Phonemes (G2P)

4. Configure Training

5. Train AR or NAR Model

Export Trained Models

Synthesis (Text-to-Speech / Voice Cloning)

Colab Demo

Roadmap

License & Citations

Related Keywords

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

KuchikiRenji/vall-e

Folders and files

Latest commit

History

Repository files navigation

VALL-E: Zero-Shot Text-to-Speech | PyTorch Implementation

What is VALL-E?

Features

Requirements

Installation

Quick Start: Train Your Own VALL-E Model

1. Prepare Data

2. Quantize Audio

3. Generate Phonemes (G2P)

4. Configure Training

5. Train AR or NAR Model

Export Trained Models

Synthesis (Text-to-Speech / Voice Cloning)

Colab Demo

Roadmap

License & Citations

Related Keywords

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages