feat(engines): VITS2 + StyleTTS2 family (pure StyleTTS2 + Kokoro, multilingual)#153
Conversation
- StyleTTS2Adapter (Engine.STYLETTS2): single-graph 'tokens + speed [+ style] [+ attention_mask] -> waveform', covering pure StyleTTS2 (baked reference style) and Kokoro (per-voice style pack, length-indexed). StyleTTS2 $-pad convention. - Pure StyleTTS2 indexed (ddatt/en-styletts2): the DDATT 5-onnx pipeline STITCHED into one graph via onnx.compose (plbert->bert->final, ref_p/ref_s baked) -- no re-export. Validated through the pipeline (rms 0.14). - VITS2 (frappuccino/vits2-ru-natasha): runs on the VitsAdapter (identical I/O); Russian graphemes. vits2.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le packs) - StyleTTS2Adapter.configure() loads a per-voice style pack from engine_params['style_path'], reshaped to [N, 256] (Kokoro: 510 rows indexed by token length; a fixed style is [1, 256]). - model_manager: style_url field + download_style() -> engine_params['style_path'] (mirrors the vocoder per-voice-asset flow). - Indexed 29 English Kokoro voices (af/am/bf/bm) on the shared Kokoro-82M fp16 onnx (potato-size) + misaki G2P; per-voice .bin styles. Voices verified distinct. - pyproject: spacy>=3.7 guard on the misaki extras (en/ja/vi/zh). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
📝 WalkthroughWalkthroughThis PR introduces StyleTTS2/Kokoro engine support with optional per-voice style packs, expands MisakiPhonemizer with language-specific variants and alphabet selection, and updates config enums, model discovery, and dependencies to support both features. ChangesStyleTTS2 Engine and Misaki Phonemizer Expansion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greetings! I've analyzed your changes and have some results to share. 🖖I've aggregated the results of the automated checks for this PR below. 📋 Repo HealthScanning for any signs of 'comment' bad breath. 🌬️ Latest Version: ✅ 🔍 LintEverything looks good so far! ✅ ❌ ruff: issues found — see job log 📊 CoverageHow well-protected is our logic? Let's find out! 🛡️ ❌ 40.5% total coverage Files below 80% coverage (37 files)
Full report: download the ⚖️ License CheckChecking for any restrictive patent clauses. 📜 ❌ License violations detected (43 packages) — review required before merging. License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more Full breakdown — 43 packages
Copyright (c) 2022 Phil Ewels Permission is hereby granted, free of charge, to any person obtaining a copy The above copyright notice and this permission notice shall be included in all THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. 🔨 Build TestsThe build pipeline has finished its work. 🏁 ✅ All versions pass
🏷️ Release PreviewA look ahead at the next milestone. 🚩 Current:
🚀 Release Channel Compatibility Predicted next version:
🔒 Security (pip-audit)Ensuring our dependency tree is clean of rot. 🌳 ✅ No known vulnerabilities found (61 packages scanned). Thanks for making OVOS better today! 🙌 |
Index 13 Asian-language Kokoro voices on the StyleTTS2 engine, exercising the misaki ja (JAG2P, openjtalk/unidic) and zh (ZHG2P, pypinyin/jieba) G2Ps: - ja (5): jf_alpha/gongitsune/nezumi/tebukuro, jm_kumo - zh (8): zf_xiaobei/xiaoni/xiaoxiao/xiaoyi, zm_yunjian/yunxi/yunxia/yunyang Per-language config (lang_code drives the misaki dispatch); shared Kokoro-82M fp16 onnx + per-voice style packs. Validated from the index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uropean - misaki zh G2P version switch: ZHG2P(version=...) wired through get_phonemizer's model arg (phonemizer_model). v1.0 zh = IPA (tone marks), v1.1 zh = bopomofo + tone numbers; the version must match the model's vocab. - Kokoro v1.1-zh finetune (int8, CPU-stable potato-size): 100 Chinese (zf/zm, version 1.1) + 3 English voices. - Kokoro v1.0 European voices via espeak (misaki's EspeakG2P fallback): es/fr/hi/ it/pt (13). - Kokoro v0.19 legacy (int8): 11 English voices. - fp16 onnx NaNs on CPU (no fp16 kernels) -> int8 model_quantized for v1.1-zh/v0.19. styletts2.json: 170 voices. All validated from the index (no NaN). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ia alphabet misaki is not a thin wrapper — en ships ~6MB curated lexicons + spacy, ja a cutlet romanizer + lexicon, zh adds tone sandhi + frontend on pypinyin; only the espeak fallback (es/fr/hi/it/pt) is a passthrough. Split the single dispatching MisakiPhonemizer into per-language phoneme types: MISAKI_EN MISAKI_JA MISAKI_ZH MISAKI_KO MISAKI_VI The zh IPA-vs-bopomofo difference is just a representation, so it's the ALPHABET, not a separate class or version param: MISAKI_ZH + Alphabet.IPA -> misaki v1.0 (IPA + tone marks), + Alphabet.BOPOMOFO -> v1.1 (bopomofo + tone numbers). Added Alphabet.BOPOMOFO; misaki phonemizers default to IPA. The base class stays a back-compat dispatcher for the legacy 'misaki' type. Kokoro voices re-indexed to the explicit types (v1.1-zh = misaki_zh + bopomofo). Suite 229. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two new architectures (VITS2, StyleTTS2) + the full Kokoro family (170 StyleTTS2 voices).
VITS2
frappuccino/vits2-ru-natasha— identical I/O to VITS, runs on the existing VitsAdapter.vits2.json.StyleTTS2 engine
New
StyleTTS2Adapter(Engine.STYLETTS2): single-graphtokens + speed [+ style] [+ attention_mask] → waveform, end-to-end. Per-voice style packs flow viastyle_url → engine_params['style_path'] → configure().ddatt/en-styletts2) — the real 5-onnx DDATT pipeline stitched into one graph withonnx.compose(ref style baked, no diffusion at inference). No re-export.Kokoro — every public variant (170 voices)
Key fixes:
ZHG2P(version=...)wired throughget_phonemizer'smodelarg. v1.0 zh = IPA (tone marks↓↗↘), v1.1 zh = bopomofo + tone numbers (ㄋㄧ2ㄏㄠ3); must match the model's vocab.EspeakG2P== phoonnx espeak (identical IPA).model_quantized(potato-size, CPU-stable) for v1.1-zh/v0.19.Coverage impact
misakinow exercised across en/ja/zh(both reprs); European Kokoro via espeak.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores
Tests