Skip to content

feat(voices): coqui VITS engine + 36 voices across 33 languages#149

Merged
JarbasAl merged 7 commits into
devfrom
feat/more-voices
Jun 6, 2026
Merged

feat(voices): coqui VITS engine + 36 voices across 33 languages#149
JarbasAl merged 7 commits into
devfrom
feat/more-voices

Conversation

@JarbasAl

@JarbasAl JarbasAl commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Adds coqui VITS support and converts 32 zoo voices across 29 languages — a large coverage boost across phoonnx architectures.

Engine + tokenization

  • engine=coquiVitsAdapter with the correct VitsCharacters tokenization: vocab [pad] + punctuations + (graphemes + ipa) + [blank], unsorted, is_unique=False (no dedup; blank id = full-list length). Different from Graphemes/IPAPhonemes (sorted) — the off-by-one blank was the "gibberish".
  • Refactor: the generic Coqui bridge moved out of glowtts_config.py into coqui_config.py (it serves GlowTTS/VITS/FastPitch); glowtts keeps only Larynx + a back-compat re-export.

Exporter (the language unlock)

The css10/CommonVoice models were blocked by two coqui quirks; scripts/conversion/coqui_vits_export now:

  • infers dims from the checkpoint (configs disagree, e.g. css10 says hidden=192 but the checkpoint is 196),
  • detects the language embedding (emb_l) even for single-language models (encoder = text_emb + lang_emb), baking langid=0 so no extra onnx input.

Voices (32, languages: bg bn cs da ee el en es et fi fr ga ha hr hu it ln lt lv mt nl pl pt ro sk sl sv tw yo)

Incl. en/vctk (109 speakers), 11 CommonVoice, 5 css10, 6 openbible African (Ewe/Hausa/Lingala/Twi×2/Yoruba), it/mai. All load from the index + synthesize (VitsAdapter); suite 221 passed.

Edge cases skipped: uk-mai (degenerate synth), de-thorsten (needs gruut[de]), ca/fa-custom (non-standard arch).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added support for Coqui VITS voice models with automatic configuration conversion from Coqui TTS format.
    • Expanded voice index to include multiple Coqui VITS voices across different languages (Bulgarian, Bengali, Czech, Danish, German, and others).
    • Enhanced German language support with phoneme processing capabilities.

JarbasAl and others added 3 commits June 5, 2026 23:02
Coqui VITS uses VitsCharacters, not Graphemes/IPAPhonemes: the vocab is
[pad] + punctuations + (graphemes + ipa_characters) + [blank], NOT sorted, and
is_unique=False (no dedup; char_to_id keeps the last occurrence; num_chars counts
the full list incl. the trailing blank). Deduping shifts the interspersed blank id
by one -> garbage. voice_config_from_coqui now builds this exact table when the
config's characters_class is VitsCharacters, with multi-speaker support. Enables
converting the coqui zoo VITS models. Golden test locks the vocab order/blank id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert + index coqui zoo VITS via the truly-dynamic export_vits exporter and the
VitsCharacters tokenization fix: en/ljspeech, en/vctk (109 speakers), it/mai
female+male. engine=coqui -> VitsAdapter. None-safe phonemes in the bridge.

Models with type-2 decoders / multilingual heads (css10, CommonVoice) need
per-architecture exporter handling (follow-up). coqui_vits.json wired into the
manager (index entries must not carry extra fields like num_speakers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Standalone VITS exporter (vendored pure-torch model, truly-dynamic text->audio
export). Converts standard coqui VITS (en/ljspeech, en/vctk multi-speaker,
it/mai). Models with non-standard/inconsistent architectures (css10 emb!=hidden,
CommonVoice multilingual) need per-model dim handling. MPL-2.0 vendored code.
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7cfe3a05-4110-496f-951a-72d25289ad23

📥 Commits

Reviewing files that changed from the base of the PR and between 84aa9a5 and 0c007d0.

📒 Files selected for processing (28)
  • phoonnx/engines/coqui_config.py
  • phoonnx/engines/glowtts_config.py
  • phoonnx/model_manager.py
  • phoonnx/voice_index/coqui_vits.json
  • pyproject.toml
  • scripts/conversion/coqui_vits_export/__init__.py
  • scripts/conversion/coqui_vits_export/base_tts.py
  • scripts/conversion/coqui_vits_export/characters.py
  • scripts/conversion/coqui_vits_export/export_vits.py
  • scripts/conversion/coqui_vits_export/generic/__init__.py
  • scripts/conversion/coqui_vits_export/generic/gated_conv.py
  • scripts/conversion/coqui_vits_export/generic/normalization.py
  • scripts/conversion/coqui_vits_export/generic/res_conv_bn.py
  • scripts/conversion/coqui_vits_export/generic/time_depth_sep_conv.py
  • scripts/conversion/coqui_vits_export/generic/wavenet.py
  • scripts/conversion/coqui_vits_export/glow_tts/__init__.py
  • scripts/conversion/coqui_vits_export/glow_tts/decoder.py
  • scripts/conversion/coqui_vits_export/glow_tts/duration_predictor.py
  • scripts/conversion/coqui_vits_export/glow_tts/encoder.py
  • scripts/conversion/coqui_vits_export/glow_tts/glow.py
  • scripts/conversion/coqui_vits_export/glow_tts/transformer.py
  • scripts/conversion/coqui_vits_export/helpers.py
  • scripts/conversion/coqui_vits_export/hifigan_generator.py
  • scripts/conversion/coqui_vits_export/networks.py
  • scripts/conversion/coqui_vits_export/sdp.py
  • scripts/conversion/coqui_vits_export/transforms.py
  • scripts/conversion/coqui_vits_export/vits.py
  • tests/test_fastpitch.py

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Coqui VITS export and conversion framework. It refactors the Coqui config bridge into a dedicated module, implements complete VITS architecture components (encoder, flow, duration prediction, vocoder), and provides training/inference infrastructure with ONNX export capabilities. A voice index system is added to manage Coqui VITS voice metadata.

Changes

VITS Export Framework & Coqui Bridge

Layer / File(s) Summary
Config Bridge Refactoring & Voice Index Integration
phoonnx/engines/coqui_config.py, phoonnx/engines/glowtts_config.py, phoonnx/model_manager.py, phoonnx/voice_index/coqui_vits.json, pyproject.toml
Relocates voice_config_from_coqui to dedicated coqui_config.py module handling Coqui-TTS→phoonnx VoiceConfig conversion with phonemizer backend mapping and VITS-specific tokenizer construction; glowtts_config.py re-exports for backwards compatibility; adds coqui_vits.json voice index with Hugging Face URLs and metadata; TTSModelManager.merge_default_voices() now loads voice index data; gruut[de] dependency added to German extras.
Character/Vocabulary System & Base TTS Class
scripts/conversion/coqui_vits_export/characters.py, scripts/conversion/coqui_vits_export/base_tts.py
Implements symbol parsing and vocabulary builders: BaseVocabulary for token list + special tokens, BaseCharacters for combined character/punctuation/special-token vocabulary, and subclasses IPAPhonemes/Graphemes with config-driven initialization and legacy compatibility. BaseTTS base class handles config/args initialization with num_chars synchronization, optional multi-speaker embedding initialization, and test sentence handling; BaseTTSE2E ensures end-to-end char-count propagation.
Data Loading & Synthesis Infrastructure
scripts/conversion/coqui_vits_export/base_tts.py
Implements format_batch computing duration/stop-target from attention masks, optional weighted sampling (get_sampler) for language/speaker/length, get_data_loader with DDP/feature-cache support, and synthesis pipeline: _get_language_id with validation, _get_speaker_id_or_dvector selecting id/d-vector/cloned embeddings, synthesize tokenizing text, resolving speaker/language, calling inference, and post-processing outputs via Griffin-Lim/silence-trimming.
Neural Building Blocks & Core Architectures
scripts/conversion/coqui_vits_export/generic/normalization.py, scripts/conversion/coqui_vits_export/generic/res_conv_bn.py, scripts/conversion/coqui_vits_export/generic/gated_conv.py, scripts/conversion/coqui_vits_export/generic/time_depth_sep_conv.py, scripts/conversion/coqui_vits_export/generic/wavenet.py, scripts/conversion/coqui_vits_export/glow_tts/transformer.py
Implements normalization layers (LayerNorm, ActNorm with data-dependent initialization), 1D convolution blocks with batch norm and residual connections, gated convolutions with GLU gating, time-depth-separable convolutions with residual adds, WaveNet-style dilated stacks with dropout/residual/skip connections and optional conditioning, and relative-position transformer with multi-head attention (including proximal bias), feed-forward networks, and layer normalization.
GlowTTS Encoder, Decoder & Flow Components
scripts/conversion/coqui_vits_export/glow_tts/encoder.py, scripts/conversion/coqui_vits_export/glow_tts/duration_predictor.py, scripts/conversion/coqui_vits_export/glow_tts/decoder.py, scripts/conversion/coqui_vits_export/glow_tts/glow.py
Implements complete GlowTTS: token embedding with optional prenet/postnet, encoder routing through selectable backends (relative transformer, gated/residual/time-depth-separable convolution), 2-layer duration predictor with speaker/language conditioning, invertible decoder with flow blocks (ActNorm, InvConvNear, CouplingBlock), and affine coupling layers using WaveNet-based conditioning with optional sigmoid-scaled parameters.
VITS-Specific Networks, Duration Prediction & Transforms
scripts/conversion/coqui_vits_export/networks.py, scripts/conversion/coqui_vits_export/helpers.py, scripts/conversion/coqui_vits_export/sdp.py, scripts/conversion/coqui_vits_export/transforms.py
Implements VITS text encoder with optional language embedding, residual affine coupling blocks (single/stacked), posterior encoder for VAE latents, helpers for masking/segmentation/alignment-path/attention generation, stochastic duration predictor with spline flows, and rational-quadratic spline transforms supporting bounded/unbounded domains with forward/inverse log-determinant computation.
HiFi-GAN Vocoder & Full VITS Model Implementation
scripts/conversion/coqui_vits_export/hifigan_generator.py, scripts/conversion/coqui_vits_export/vits.py, scripts/conversion/coqui_vits_export/export_vits.py
Implements HiFi-GAN generator with multi-kernel residual upsampling and optional pre-projection/global/per-layer conditioning; complete Vits model wiring text/posterior encoders, normalizing flow, duration predictor (stochastic/deterministic), HiFi-GAN decoder, speaker/language embeddings, and training forward pass with optional speaker-encoder loss; inference paths (synthesis, voice conversion), training integration (two-optimizer GAN, mixed precision), batch formatting, sampling/checkpointing, ONNX export, and character compatibility. Standalone VitsExport exporter script loads checkpoints and exports to ONNX.
Export Utility & Test Updates
tests/test_fastpitch.py
Test file updated to import voice_config_from_coqui from refactored coqui_config module; added assertions for phonemizer→PhonemeType mapping and VITS character vocabulary ordering validation including blank token placement.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • TigreGotico/phoonnx#143: Both PRs modify the Coqui config bridge—PR #143 introduced voice_config_from_coqui in glowtts_config.py, while this PR refactors it into a dedicated coqui_config.py module and re-exports from glowtts_config.py for backwards compatibility.
  • TigreGotico/phoonnx#148: Both PRs enhance the Coqui→VoiceConfig bridge by extending voice_config_from_coqui to accept an engine parameter and derive phoneme_type/tokenization from Coqui phonemizer backend, overlapping directly in the config refactoring.
  • TigreGotico/phoonnx#104: Both PRs modify phoonnx/model_manager.py at the merge_default_voices() function—this PR adds loading coqui_vits.json, while PR #104 adjusts robustness/cache handling in the same function.

🐰 A VITS dream takes flight,
With flows and coupling blocks so tight,
Phonemes and voices dance with glee,
As neural nets craft melody—
ONNX exports shine so bright! ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/more-voices

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added feature and removed feature labels Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

I've done the heavy lifting! Here are the check results. 🏋️‍♂️

I've aggregated the results of the automated checks for this PR below.

🏷️ Release Preview

Ensuring our release process remains smooth and efficient. 🚂

Current: 1.12.0a1Next: 1.13.0a1

Signal Value
Label feature
PR title feat(voices): coqui VITS engine + 32 voices across 29 languages
Bump minor

⚠️ No conventional commit prefix — alpha-only bump.
Suggested: fix: update the thing or feat: update the thing


🚀 Release Channel Compatibility

Predicted next version: 1.13.0a1

Channel Status Note Current Constraint
Stable Not in channel -
Testing Not in channel -
Alpha Not in channel -

📊 Coverage

Ensuring every change is backed by a test. ✅

39.7% total coverage

Files below 80% coverage (37 files)
File Coverage Missing lines
phoonnx/cli.py 0.0% 98
phoonnx/thirdparty/kog2p/__init__.py 0.0% 203
phoonnx/thirdparty/mantoq/unicode_symbol2label.py 0.0% 1
phoonnx/thirdparty/bw2ipa.py 7.5% 86
phoonnx/thirdparty/mantoq/pyarabic/number.py 7.7% 371
phoonnx/thirdparty/mantoq/buck/phonetise_buckwalter.py 10.4% 180
phoonnx/thirdparty/hangul2ipa.py 16.6% 372
phoonnx/phonemizers/en.py 17.5% 104
phoonnx/thirdparty/mantoq/pyarabic/trans.py 18.2% 135
phoonnx/model_manager.py 19.9% 214
phoonnx/voice.py 21.7% 220
phoonnx/thirdparty/zh_num.py 23.1% 83
phoonnx/phonemizers/mul.py 23.9% 236
phoonnx/thirdparty/tashkeel/__init__.py 23.9% 89
phoonnx/phonemizers/zh.py 27.0% 92
phoonnx/phonemizers/ko.py 30.4% 32
phoonnx/phonemizers/gl.py 31.1% 42
phoonnx/phonemizers/ar.py 31.2% 44
phoonnx/thirdparty/mantoq/buck/tokenization.py 32.5% 27
phoonnx/thirdparty/phonikud/__init__.py 35.3% 11
phoonnx/phonemizers/ja.py 36.0% 32
phoonnx/phonemizers/fa.py 36.4% 14
phoonnx/phonemizers/pt.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/normalize.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/araby.py 39.7% 298
phoonnx/phonemizers/he.py 40.0% 12
phoonnx/phonemizers/vi.py 40.0% 12
phoonnx/phonemizers/base.py 40.8% 71
phoonnx/thirdparty/mantoq/pyarabic/stack.py 45.5% 6
phoonnx/thirdparty/mantoq/num2words.py 47.6% 11
phoonnx/phonemizers/mwl.py 50.0% 8
phoonnx/tokenizer.py 52.4% 147
phoonnx/thirdparty/mantoq/__init__.py 60.0% 10
phoonnx/thirdparty/mantoq/pyarabic/arabrepr.py 60.0% 6
phoonnx/config.py 61.1% 130
phoonnx/engines/vocoders/griffinlim.py 61.4% 27
phoonnx/engines/optispeech.py 69.6% 24

Full report: download the coverage-report artifact.

🔍 Lint

Checking the alignment of your contribution. 📏

ruff: issues found — see job log

🔒 Security (pip-audit)

Evaluating the risk associated with these changes. ⚖️

✅ No known vulnerabilities found (61 packages scanned).

📋 Repo Health

A thorough inspection of the project's hygiene. 🧼

⚠️ Some required files are missing.

Latest Version: 1.12.0a1

phoonnx/version.py — Version file
README.md — README
LICENSE — License file
pyproject.toml — pyproject.toml
⚠️ setup.py — setup.py
CHANGELOG.md — Changelog
phoonnx/version.py has valid version block markers

⚖️ License Check

Ensuring our licenses allow for commercial use. 🏢

❌ License violations detected (43 packages) — review required before merging.

Dependency                          License Name                                            License Type         Misc                                    
phoonnx:1.3.3                       Error                                                   Error                                                        

License Type                        Found                                                  
Error                               1

License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more

Full breakdown — 43 packages
Package Version License URL
build 1.5.0 MIT link
certifi 2026.5.20 Mozilla Public License 2.0 (MPL 2.0) link
charset-normalizer 3.4.7 MIT link
click 8.4.1 BSD-3-Clause link
combo_lock 0.3.1 Apache-2.0 link
dateparser 1.4.0 BSD License link
filelock 3.29.1 MIT link
flatbuffers 25.12.19 Apache Software License link
idna 3.18 BSD-3-Clause link
json-database 0.10.1 MIT link
kthread 0.2.3 MIT License link
langcodes 3.5.1 MIT License link
markdown-it-py 4.2.0 MIT License link
mdurl 0.1.2 MIT License link
memory-tempfile 2.2.3 MIT License link
numpy 2.4.6 BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0 link
onnxruntime 1.26.0 MIT License link
ovos-config 2.1.1 Apache-2.0 link
ovos-date-parser 0.7.0a5 Apache Software License link
ovos-number-parser 0.5.1 Apache Software License link
ovos-utils 0.8.5 Apache-2.0 link
packaging 26.2 Apache-2.0 OR BSD-2-Clause link
pexpect 4.9.0 ISC License (ISCL) link
phoonnx 1.12.0a1 Apache Software License link
protobuf 7.35.0 3-Clause BSD License link
ptyprocess 0.7.0 ISC License (ISCL) link
pyee 13.0.1 MIT License link
Pygments 2.20.0 BSD-2-Clause link
pyproject_hooks 1.2.0 MIT License link
python-dateutil 2.9.0.post0 Apache Software License; BSD License link
pytz 2026.2 MIT License link
PyYAML 6.0.3 MIT License link
quebra-frases 0.3.7 Apache Software License link
regex 2026.5.9 Apache-2.0 AND CNRI-Python link
requests 2.34.2 Apache Software License link
rich 13.9.4 MIT License link
rich-click 1.9.8 MIT License

Copyright (c) 2022 Phil Ewels

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
| link |
| six | 1.17.0 | MIT License | link |
| typing_extensions | 4.15.0 | PSF-2.0 | link |
| tzlocal | 5.3.1 | MIT License | link |
| unicode-rbnf | 2.4.0 | MIT License | |
| urllib3 | 2.7.0 | MIT | link |
| watchdog | 6.0.0 | Apache Software License | link |

Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed.

🔨 Build Tests

I tried building your changes, and here's what happened! 🔨

✅ All versions pass

Python Build Install Tests
3.10
3.11
3.12
3.13
3.14

Automating the path to a better future 🌈

JarbasAl and others added 2 commits June 5, 2026 23:36
voice_config_from_coqui had grown to handle GlowTTS, VITS (VitsCharacters) and
FastPitch -- it is a generic Coqui-TTS config bridge, not GlowTTS-specific. Move
it to phoonnx/engines/coqui_config.py; glowtts_config keeps only the Larynx
bridge and re-exports voice_config_from_coqui for back-compat. Also None-safe the
Graphemes/IPAPhonemes path (phonemes/characters may be null in some configs).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert the coqui zoo VITS via the fixed exporter + VitsCharacters tokenization.
The unlock for the language set was the exporter inferring dims from the
checkpoint (configs disagree, e.g. css10 hidden 192 vs 196) and detecting the
language embedding (emb_l) even for single-language models (langid baked to 0).

Languages: bg bn cs da ee el en es et fi fr ga ha hr hu it ln lt lv mt nl pl pt
ro sk sl sv tw yo -- incl. en/vctk (109 speakers), 11 CommonVoice, 5 css10,
6 openbible African, it/mai. Edge cases skipped: uk-mai (degenerate synth),
de-thorsten (needs gruut[de]), ca/fa custom (non-standard).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JarbasAl JarbasAl changed the title feat(voices): coqui VITS engine support + zoo voices (en/vctk 109-spk, it/mai) feat(voices): coqui VITS engine + 32 voices across 29 languages Jun 5, 2026
@github-actions github-actions Bot added feature and removed feature labels Jun 5, 2026
JarbasAl and others added 2 commits June 6, 2026 00:24
- de-thorsten: IPAPhonemes stores its IPA in the 'characters' field (not
  'phonemes'); the bridge now falls back to characters for phoneme models. Needs
  gruut[de] (added as the 'de' extra) since it was trained on gruut German.
- uk-mai: was never broken -- it is Ukrainian graphemes (Cyrillic), the batch's
  English sanity text just wasn't in its vocab. Converted with proper text.

34 coqui VITS voices, 31 languages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JarbasAl JarbasAl changed the title feat(voices): coqui VITS engine + 32 voices across 29 languages feat(voices): coqui VITS engine + 34 voices across 31 languages Jun 5, 2026
@github-actions github-actions Bot added feature and removed feature labels Jun 5, 2026
@JarbasAl JarbasAl marked this pull request as ready for review June 6, 2026 18:17
@JarbasAl JarbasAl merged commit ccd5ec4 into dev Jun 6, 2026
13 of 14 checks passed
@github-actions github-actions Bot added feature and removed feature labels Jun 6, 2026
@JarbasAl JarbasAl changed the title feat(voices): coqui VITS engine + 34 voices across 31 languages feat(voices): coqui VITS engine + 36 voices across 33 languages Jun 6, 2026
@github-actions github-actions Bot added feature and removed feature labels Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant