While diffusion and flow-matching models have advanced TTS, generating high-arousal emotions remains a persistent challenge due to the trade-off between stability and expressiveness. Existing systems often suffer from linguistic collapse when pursuing high intensity or fail to meet target emotional levels under stable settings. In this work, we identify that standard Gaussian initialization inevitably introduces a neutral prosody bias, while uniform Classifier-Free Guidance often distorts the acoustic manifold, leading to artifacts. To address this, we propose an inference framework that rectifies the emotional trajectory. An Emotion-Rectified Noise Prior injects a semantic gradient at initialization to align sampling with the target emotional manifold, and Likelihood-Inverse Guidance adaptively schedules guidance via a conditional/unconditional likelihood ratio, strengthening guidance only when the trajectory drifts toward a neutral fallback. Extensive experiments demonstrate that our method effectively resolves the stability bottleneck in high-intensity scenarios, achieving superior linguistic accuracy and emotional fidelity without model retraining.
- π― Zero retraining β pure inference-time enhancement, works on any Flow-Matching TTS
- π§ ERNP (Emotion-Rectified Noise Prior) β steers initial noise toward emotional manifold via
lookahead β calibration β re-normalization - π LIG (Likelihood-Inverse Guidance) β replaces constant CFG with
dynamic Ξ»(t)derived from recursive likelihood-ratio estimation - β‘ No extra model calls β LIG reuses existing conditional/unconditional velocity fields
- π SOTA results β WER 4.41% β 2.53%, EMOS 3.63 β 3.89 on HIED benchmark
- π Plug-and-play β validated on CosyVoice2, IndexTTS2, and F5-TTS architectures
Our framework operates entirely at inference time with zero retraining, consisting of two complementary components:
Rectifies the initial Gaussian noise before the ODE solve via a two-step lookaheadβcalibration cycle:
-
Lookahead β forward one step from
$x_0$ with high guidance strength$\lambda_{\text{init}}$ : $\quad x_\tau = x_0 + \tau \cdot \tilde{v}{\lambda{\text{init}}}(x_0, 0)$ -
Calibration β backward one step with base guidance
$\lambda_{\text{base}}$ : $\quad x_0^* = x_\tau - \tau \cdot \tilde{v}{\lambda{\text{base}}}(x_\tau, \tau)$ -
Re-normalization β strictly standardize
$x_0^*$ back to$\mathcal{N}(0, I)$
The net effect is a controlled displacement along the emotional semantic gradient, steering the starting point toward the target emotional manifold.
Replaces constant CFG with a dynamic, trajectory-aware guidance schedule. We model the learned conditional distribution as an additive mixture of neutral and emotional components, and derive the per-step guidance strength:
where
xβ ~ N(0,I) ββ[ERNP]βββΆ Rectified xβ* ββ[LIG: dynamic Ξ»(t)]βββΆ Emotional speech xβ
# Create environment
conda create -n emo-tts python=3.11
conda activate emo-tts
conda install ffmpeg
# Install PyTorch (match your CUDA version)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
# Install from source
cd Emo-TTS
pip install -e .Emo-TTS/
βββ src/emo_tts/
β βββ model/
β β βββ cfm.py # Core: CFM sampling with ERNP + LIG
β β βββ backbones/ # DiT, MMDiT, UNet-T
β β βββ modules.py # Mel spectrogram, attention, etc.
β β βββ trainer.py # Training loop
β β βββ utils.py # Utilities
β βββ configs/ # Model architecture YAML configs
β βββ infer/
β β βββ infer_cli.py # CLI inference
β β βββ infer_gradio.py # Gradio web UI
β β βββ infer_emo_test.py # ERNP + LIG experiment inference
β β βββ utils_infer.py # Inference utilities
β βββ train/ # Training & finetuning scripts
β βββ eval/ # Evaluation tools (WER, EMOS, UTMOS)
β βββ runtime/ # Triton + TensorRT-LLM deployment
βββ method.png
βββ pyproject.toml
βββ README.md
Key file:
src/emo_tts/model/cfm.pyβ theCFM.sample()method integrates both ERNP (noise rectification before ODE) and LIG (dynamic guidance inside ODE).
# CLI inference
emo-tts_infer-cli --model EmoTTS_v1_Base \
--ref_audio "path/to/reference.wav" \
--ref_text "Transcription of the reference audio." \
--gen_text "Text you want to synthesize."# Gradio web UI
emo-tts_infer-gradio# Python API
from emo_tts.api import EmoTTS
tts = EmoTTS(model="EmoTTS_v1_Base")
wav, sr, spec = tts.infer(
ref_file="path/to/reference.wav",
ref_text="Transcription of the reference audio.",
gen_text="Text you want to synthesize.",
file_wave="output.wav",
)We evaluate on the HIED benchmark (400 high-arousal emotional samples) across three TTS architectures: F5-TTS, CosyVoice2, and IndexTTS2. The two core metrics are:
- WER (β) β Word Error Rate via ASR, measuring linguistic accuracy
- EMOS (β) β Emotion Score via emotion classifier, measuring emotional fidelity
All datasets used in this work are publicly available at π€ erminga/emo-tts.
HIED (High-Intensity Emotional Dataset) is our curated evaluation benchmark specifically designed to stress-test TTS systems under high-arousal emotional conditions.
| Details | |
|---|---|
| Total samples | 400 (100 per emotion) |
| Emotions | Angry, Happy, Sad, Surprise |
| Sources | ESD (354 samples), EmoV-DB (46 samples) |
| Avg duration | 3.85 s |
| Total duration | ~0.43 h |
| Acoustic features | RMS energy, F0 mean/std/range, speaking rate |
# Load HIED directly
from datasets import load_dataset
hied = load_dataset("erminga/emo-tts", "HIED", split="test")HIED sample fields
| Field | Type | Description |
|---|---|---|
id |
string | Unique ID (HIED_0000 β¦ HIED_0399) |
audio |
audio | Speech waveform |
emotion |
string | Emotion class (Angry / Happy / Sad / Surprise) |
source_dataset |
string | Origin (ESD / EmoV-DB) |
speaker |
string | Speaker identifier |
rms_energy |
float | RMS energy |
f0_mean |
float | Mean fundamental frequency (Hz) |
f0_std |
float | F0 standard deviation |
f0_range |
float | F0 range (Hz) |
speaking_rate |
float | Speaking rate (phonemes/s) |
duration |
float | Duration (seconds) |
| Dataset | Emotions | Speakers | Language | Reference |
|---|---|---|---|---|
| ESD | Neutral, Happy, Sad, Angry, Surprise | 10 EN + 10 ZH | EN / ZH | Zhou et al., 2022 |
| EmoV-DB | Neutral, Amused, Angry, Sleepy, Disgusted | 4 (bea, jenie, josh, sam) | EN / FR | OpenSLR-115 Β· Adigwe et al., 2018 |
| Expresso | 8 read + 26 improvised styles | 4 (2M, 2F), 48kHz | EN | ylacombe/expresso Β· Nguyen et al., 2023 |
All three source datasets, along with the HIED benchmark, are mirrored in our HuggingFace repository for one-stop download:
# Download everything (~10 GB) huggingface-cli download erminga/emo-tts --repo-type dataset --local-dir ./emo-tts-data
Step 1. Download the HIED dataset:
from datasets import load_dataset
hied = load_dataset("erminga/emo-tts", "HIED", split="test")Step 2. Run inference (Baseline vs. ERNP + LIG ablations):
# Baseline β standard CFG
python src/emo_tts/infer/infer_emo_test.py \
--config configs/emo_infer.yaml \
--output_dir results/baseline
# ERNP only β emotion-rectified noise prior
python src/emo_tts/infer/infer_emo_test.py \
--config configs/emo_infer.yaml \
--ernp_lambda_init 50.0 --ernp_lambda_base 2.0 \
--output_dir results/ernp_only
# LIG only β likelihood-inverse guidance
python src/emo_tts/infer/infer_emo_test.py \
--config configs/emo_infer.yaml \
--lig_pi 0.99 --lig_lambda_max 15.0 --lig_sigma 0.5 \
--output_dir results/lig_only
# Full method: ERNP + LIG
python src/emo_tts/infer/infer_emo_test.py \
--config configs/emo_infer.yaml \
--ernp_lambda_init 50.0 --ernp_lambda_base 2.0 \
--lig_pi 0.99 --lig_lambda_max 15.0 --lig_sigma 0.5 \
--output_dir results/ernp_ligStep 3. Evaluate (WER + EMOS + UTMOSv2):
pip install -e .[eval]
# WER β Word Error Rate
# EN: Whisper (Radford et al., 2023)
# ZH: FunASR (Gao et al., 2023)
python src/emo_tts/eval/eval_wer.py \
--gen_wav_dir results/ernp_lig \
--gpu_nums 8
# EMOS β Emotion Score via emotion2vec (Ma et al., 2024)
python src/emo_tts/eval/eval_emos.py \
--gen_wav_dir results/ernp_lig
# UTMOSv2 β Speech Quality (MOS prediction)
python src/emo_tts/eval/eval_utmos.py \
--audio_dir results/ernp_lig --ext wav| Tool | Purpose | Source |
|---|---|---|
| Whisper | English ASR (WER) | openai/whisper Β· Radford et al., 2023 |
| FunASR | Chinese ASR (WER) | modelscope/FunASR Β· Gao et al., 2023 |
| emotion2vec | Emotion Score (EMOS) | ddlBoJack/emotion2vec Β· Ma et al., 2024 |
| UTMOSv2 | Speech Quality (MOS) | sarulab-speech/UTMOSv2 Β· HF |
Code is released under the MIT License.
