feat(voices): coqui VITS engine + 36 voices across 33 languages#149
Conversation
Coqui VITS uses VitsCharacters, not Graphemes/IPAPhonemes: the vocab is [pad] + punctuations + (graphemes + ipa_characters) + [blank], NOT sorted, and is_unique=False (no dedup; char_to_id keeps the last occurrence; num_chars counts the full list incl. the trailing blank). Deduping shifts the interspersed blank id by one -> garbage. voice_config_from_coqui now builds this exact table when the config's characters_class is VitsCharacters, with multi-speaker support. Enables converting the coqui zoo VITS models. Golden test locks the vocab order/blank id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert + index coqui zoo VITS via the truly-dynamic export_vits exporter and the VitsCharacters tokenization fix: en/ljspeech, en/vctk (109 speakers), it/mai female+male. engine=coqui -> VitsAdapter. None-safe phonemes in the bridge. Models with type-2 decoders / multilingual heads (css10, CommonVoice) need per-architecture exporter handling (follow-up). coqui_vits.json wired into the manager (index entries must not carry extra fields like num_speakers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Standalone VITS exporter (vendored pure-torch model, truly-dynamic text->audio export). Converts standard coqui VITS (en/ljspeech, en/vctk multi-speaker, it/mai). Models with non-standard/inconsistent architectures (css10 emb!=hidden, CommonVoice multilingual) need per-model dim handling. MPL-2.0 vendored code.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (28)
📝 WalkthroughWalkthroughThis PR introduces a comprehensive Coqui VITS export and conversion framework. It refactors the Coqui config bridge into a dedicated module, implements complete VITS architecture components (encoder, flow, duration prediction, vocoder), and provides training/inference infrastructure with ONNX export capabilities. A voice index system is added to manage Coqui VITS voice metadata. ChangesVITS Export Framework & Coqui Bridge
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
I've done the heavy lifting! Here are the check results. 🏋️♂️I've aggregated the results of the automated checks for this PR below. 🏷️ Release PreviewEnsuring our release process remains smooth and efficient. 🚂 Current:
🚀 Release Channel Compatibility Predicted next version:
📊 CoverageEnsuring every change is backed by a test. ✅ ❌ 39.7% total coverage Files below 80% coverage (37 files)
Full report: download the 🔍 LintChecking the alignment of your contribution. 📏 ❌ ruff: issues found — see job log 🔒 Security (pip-audit)Evaluating the risk associated with these changes. ⚖️ ✅ No known vulnerabilities found (61 packages scanned). 📋 Repo HealthA thorough inspection of the project's hygiene. 🧼 Latest Version: ✅ ⚖️ License CheckEnsuring our licenses allow for commercial use. 🏢 ❌ License violations detected (43 packages) — review required before merging. License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more Full breakdown — 43 packages
Copyright (c) 2022 Phil Ewels Permission is hereby granted, free of charge, to any person obtaining a copy The above copyright notice and this permission notice shall be included in all THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. 🔨 Build TestsI tried building your changes, and here's what happened! 🔨 ✅ All versions pass
Automating the path to a better future 🌈 |
voice_config_from_coqui had grown to handle GlowTTS, VITS (VitsCharacters) and FastPitch -- it is a generic Coqui-TTS config bridge, not GlowTTS-specific. Move it to phoonnx/engines/coqui_config.py; glowtts_config keeps only the Larynx bridge and re-exports voice_config_from_coqui for back-compat. Also None-safe the Graphemes/IPAPhonemes path (phonemes/characters may be null in some configs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert the coqui zoo VITS via the fixed exporter + VitsCharacters tokenization. The unlock for the language set was the exporter inferring dims from the checkpoint (configs disagree, e.g. css10 hidden 192 vs 196) and detecting the language embedding (emb_l) even for single-language models (langid baked to 0). Languages: bg bn cs da ee el en es et fi fr ga ha hr hu it ln lt lv mt nl pl pt ro sk sl sv tw yo -- incl. en/vctk (109 speakers), 11 CommonVoice, 5 css10, 6 openbible African, it/mai. Edge cases skipped: uk-mai (degenerate synth), de-thorsten (needs gruut[de]), ca/fa custom (non-standard). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- de-thorsten: IPAPhonemes stores its IPA in the 'characters' field (not 'phonemes'); the bridge now falls back to characters for phoneme models. Needs gruut[de] (added as the 'de' extra) since it was trained on gruut German. - uk-mai: was never broken -- it is Ukrainian graphemes (Cyrillic), the batch's English sanity text just wasn't in its vocab. Converted with proper text. 34 coqui VITS voices, 31 languages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds coqui VITS support and converts 32 zoo voices across 29 languages — a large coverage boost across phoonnx architectures.
Engine + tokenization
engine=coqui→VitsAdapterwith the correctVitsCharacterstokenization: vocab[pad] + punctuations + (graphemes + ipa) + [blank], unsorted,is_unique=False(no dedup; blank id = full-list length). Different from Graphemes/IPAPhonemes (sorted) — the off-by-one blank was the "gibberish".glowtts_config.pyintocoqui_config.py(it serves GlowTTS/VITS/FastPitch); glowtts keeps only Larynx + a back-compat re-export.Exporter (the language unlock)
The css10/CommonVoice models were blocked by two coqui quirks;
scripts/conversion/coqui_vits_exportnow:hidden=192but the checkpoint is196),emb_l) even for single-language models (encoder = text_emb + lang_emb), bakinglangid=0so no extra onnx input.Voices (32, languages: bg bn cs da ee el en es et fi fr ga ha hr hu it ln lt lv mt nl pl pt ro sk sl sv tw yo)
Incl. en/vctk (109 speakers), 11 CommonVoice, 5 css10, 6 openbible African (Ewe/Hausa/Lingala/Twi×2/Yoruba), it/mai. All load from the index + synthesize (
VitsAdapter); suite 221 passed.Edge cases skipped:
uk-mai(degenerate synth),de-thorsten(needsgruut[de]),ca/fa-custom(non-standard arch).🤖 Generated with Claude Code
Summary by CodeRabbit