A PyTorch implementation of VITS: Conditional Variational Autoencoder with Adversarial Learning
This project implements VITS (Conditional Variational Autoencoder with Adversarial Learning), a state-of-the-art end-to-end Text-to-Speech model that directly generates waveforms from text. Key features include:
- End-to-end text-to-speech synthesis
- Parallel sampling for ultra-fast inference
- High-quality audio generation
- Multi-speaker support
- Emotion and style control
- Python 3.8+
- CUDA-compatible GPU (8GB+ VRAM)
- 16GB+ RAM
- 50GB+ disk space
-
Create and activate virtual environment:
python -m venv venv # Linux/Mac source venv/bin/activate # Windows .\venv\Scripts\activate
-
Install PyTorch:
# Windows/Linux with CUDA 11.8 pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118 # CPU only pip install torch torchaudio
-
Install dependencies:
pip install -r requirements.txt
-
Verify installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
mkdir -p data/raw/LJSpeech-1.1
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -P data/raw
tar -xvf data/raw/LJSpeech-1.1.tar.bz2 -C data/raw
rm data/raw/LJSpeech-1.1.tar.bz2
New-Item -ItemType Directory -Force -Path "data\raw\LJSpeech-1.1"
Invoke-WebRequest -Uri "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2" -OutFile "data\raw\LJSpeech-1.1.tar.bz2"
& 'C:\Program Files\7-Zip\7z.exe' x "data\raw\LJSpeech-1.1.tar.bz2" -o"data\raw"
& 'C:\Program Files\7-Zip\7z.exe' x "data\raw\LJSpeech-1.1.tar" -o"data\raw"
Remove-Item "data\raw\LJSpeech-1.1.tar*"
import os, requests, tarfile
from pathlib import Path
data_dir = Path("data/raw/LJSpeech-1.1")
data_dir.mkdir(parents=True, exist_ok=True)
url = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2"
archive_path = data_dir.parent / "LJSpeech-1.1.tar.bz2"
print("Downloading LJSpeech dataset...")
response = requests.get(url, stream=True)
with open(archive_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Extracting dataset...")
with tarfile.open(archive_path, 'r:bz2') as tar:
tar.extractall(path=data_dir.parent)
archive_path.unlink()
-
Prepare dataset:
python scripts/prepare_dataset.py --config configs/vits_config.yaml
-
Start training:
# Single GPU python scripts/train.py --config configs/vits_config.yaml # Multi-GPU (e.g., 4 GPUs) python scripts/train.py --config configs/vits_config.yaml --world_size 4
-
Monitor training:
# TensorBoard tensorboard --logdir data/logs # Weights & Biases monitoring is automatic if enabled in config
from src.inference import VITS
# Initialize model
vits = VITS(checkpoint="path/to/checkpoint")
# Basic synthesis
audio = vits.synthesize(
text="Hello, world!",
speaker_id=0,
speed_factor=1.0
)
# Save audio
vits.save_audio(audio, "output.wav")
# Batch processing
texts = [
"First sentence.",
"Second sentence.",
"Third sentence."
]
audios = vits.synthesize_batch(texts, speaker_id=0)
Text → [Text Encoder] → Hidden States
↓
[Posterior Encoder]
↓
[Flow Decoder] → Audio
↓
[Multi-Period Discriminator]
[Multi-Scale Discriminator]
Key components:
- Text Encoder: Transformer-based with multi-head attention
- Flow Decoder: Normalizing flows with residual coupling
- Posterior Encoder: WaveNet-style architecture
- Discriminators: Multi-period and multi-scale for quality
- Voice Conversion: Optional cross-speaker style transfer
-
Out of Memory (OOM):
# Reduce batch size in config # Enable gradient accumulation # Use mixed precision (fp16)
-
Poor Audio Quality:
- Check preprocessing parameters
- Verify loss convergence
- Ensure proper normalization
-
Slow Training:
- Enable mixed precision
- Use DDP for multi-GPU
- Optimize dataloader workers
@inproceedings{kim2021vits,
title={Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={International Conference on Machine Learning},
year={2021}
}
MIT License - see LICENSE file