Amphion Text-to-Speech (TTS) Recipe

Quick Start

We provide a beginner recipe to demonstrate how to train a cutting edge TTS model. Specifically, it is Amphion's re-implementation for VALL-E, which is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.

Supported Model Architectures

Until now, Amphion TTS supports the following models or architectures,

FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. This model is our updated VALL-E implementation as of June 2024 which uses Llama as its underlying architecture. The previous version of VALL-E release can be found here
NaturalSpeech2 (👨‍💻 developing): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.

Amphion TTS Demo

Here are some TTS samples from Amphion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Amphion Text-to-Speech (TTS) Recipe

Quick Start

Supported Model Architectures

Amphion TTS Demo

Files

README.md

Latest commit

History

README.md

File metadata and controls

Amphion Text-to-Speech (TTS) Recipe

Quick Start

Supported Model Architectures

Amphion TTS Demo