Skip to content

feat(engines): VITS2 + StyleTTS2 family (pure StyleTTS2 + Kokoro, multilingual)#153

Merged
JarbasAl merged 5 commits into
devfrom
feat/vits2-styletts
Jun 7, 2026
Merged

feat(engines): VITS2 + StyleTTS2 family (pure StyleTTS2 + Kokoro, multilingual)#153
JarbasAl merged 5 commits into
devfrom
feat/vits2-styletts

Conversation

@JarbasAl

@JarbasAl JarbasAl commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Two new architectures (VITS2, StyleTTS2) + the full Kokoro family (170 StyleTTS2 voices).

VITS2

  • frappuccino/vits2-ru-natasha — identical I/O to VITS, runs on the existing VitsAdapter. vits2.json.

StyleTTS2 engine

New StyleTTS2Adapter (Engine.STYLETTS2): single-graph tokens + speed [+ style] [+ attention_mask] → waveform, end-to-end. Per-voice style packs flow via style_url → engine_params['style_path'] → configure().

  • Pure StyleTTS2 (ddatt/en-styletts2) — the real 5-onnx DDATT pipeline stitched into one graph with onnx.compose (ref style baked, no diffusion at inference). No re-export.

Kokoro — every public variant (170 voices)

Variant Voices G2P Notes
v1.0 55 misaki en/ja/zh + espeak es/fr/hi/it/pt base; European langs use misaki's espeak fallback
v1.1-zh (finetune) 103 misaki zh v1.1 (bopomofo) + en 100 Chinese voices; ZHG2P version switch
v0.19 (legacy) 11 misaki en older weights

Key fixes:

  • misaki zh version switchZHG2P(version=...) wired through get_phonemizer's model arg. v1.0 zh = IPA (tone marks ↓↗↘), v1.1 zh = bopomofo + tone numbers (ㄋㄧ2ㄏㄠ3); must match the model's vocab.
  • espeak for European Kokoro — verified misaki's EspeakG2P == phoonnx espeak (identical IPA).
  • fp16 NaNs on CPU (no fp16 kernels) → int8 model_quantized (potato-size, CPU-stable) for v1.1-zh/v0.19.

Coverage impact

  • Archs: VITS2 + StyleTTS2. The onnx-stitch trick generalizes to split-onnx pipelines.
  • G2P: misaki now exercised across en/ja/zh(both reprs); European Kokoro via espeak.
  • All 170 voices validated from the index (no NaN). Suite 226 green.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for StyleTTS2 and Kokoro TTS engines with style-conditioned synthesis capabilities.
    • Added language-specific phonemizers for English, Japanese, Chinese, Korean, and Vietnamese.
    • Added Bopomofo alphabet support.
    • Added Russian VITS2 voice model to the voice index.
  • Chores

    • Updated optional language dependencies to include spacy support.
  • Tests

    • Added test coverage for language-specific phonemizers and StyleTTS2 engine functionality.

JarbasAl and others added 2 commits June 7, 2026 17:34
- StyleTTS2Adapter (Engine.STYLETTS2): single-graph 'tokens + speed [+ style]
  [+ attention_mask] -> waveform', covering pure StyleTTS2 (baked reference style)
  and Kokoro (per-voice style pack, length-indexed). StyleTTS2 $-pad convention.
- Pure StyleTTS2 indexed (ddatt/en-styletts2): the DDATT 5-onnx pipeline STITCHED
  into one graph via onnx.compose (plbert->bert->final, ref_p/ref_s baked) -- no
  re-export. Validated through the pipeline (rms 0.14).
- VITS2 (frappuccino/vits2-ru-natasha): runs on the VitsAdapter (identical I/O);
  Russian graphemes. vits2.json.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le packs)

- StyleTTS2Adapter.configure() loads a per-voice style pack from
  engine_params['style_path'], reshaped to [N, 256] (Kokoro: 510 rows indexed by
  token length; a fixed style is [1, 256]).
- model_manager: style_url field + download_style() -> engine_params['style_path']
  (mirrors the vocoder per-voice-asset flow).
- Indexed 29 English Kokoro voices (af/am/bf/bm) on the shared Kokoro-82M fp16
  onnx (potato-size) + misaki G2P; per-voice .bin styles. Voices verified distinct.
- pyproject: spacy>=3.7 guard on the misaki extras (en/ja/vi/zh).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe986cf0-9db2-455f-ae3d-205363a261e8

📥 Commits

Reviewing files that changed from the base of the PR and between b2faeef and fb175a4.

📒 Files selected for processing (11)
  • phoonnx/config.py
  • phoonnx/engines/__init__.py
  • phoonnx/engines/styletts2.py
  • phoonnx/model_manager.py
  • phoonnx/phonemizers/__init__.py
  • phoonnx/phonemizers/mul.py
  • phoonnx/voice_index/styletts2.json
  • phoonnx/voice_index/vits2.json
  • pyproject.toml
  • tests/test_misaki_split.py
  • tests/test_styletts2.py

📝 Walkthrough

Walkthrough

This PR introduces StyleTTS2/Kokoro engine support with optional per-voice style packs, expands MisakiPhonemizer with language-specific variants and alphabet selection, and updates config enums, model discovery, and dependencies to support both features.

Changes

StyleTTS2 Engine and Misaki Phonemizer Expansion

Layer / File(s) Summary
Config enum and phonemizer dispatch extensions
phoonnx/config.py
Engine enum gains STYLETTS2, Alphabet gains BOPOMOFO, and PhonemeType expands to language-specific MISAKI_EN, MISAKI_JA, MISAKI_ZH, MISAKI_KO, MISAKI_VI variants. The get_phonemizer() dispatch logic is updated to pass alphabet to phonemizer constructors and route new PhonemeType variants to their corresponding classes.
Misaki phonemizer alphabet support and language variants
phoonnx/phonemizers/mul.py, phoonnx/phonemizers/__init__.py
MisakiPhonemizer constructor accepts alphabet parameter and introduces a zh_version property to select Chinese backend variant (IPA or BOPOMOFO). Five new subclasses (MisakiEnPhonemizer, MisakiJaPhonemizer, MisakiZhPhonemizer, MisakiKoPhonemizer, MisakiViPhonemizer) narrow MISAKI_LANGS by language while inheriting lazy-loading dispatch behavior.
Misaki phonemizer test suite
tests/test_misaki_split.py
Tests validate that get_phonemizer() maps each PhonemeType.MISAKI_* to the correct Misaki subclass, verify zh_version is alphabet-driven, confirm language-scope narrowing via MISAKI_LANGS, and assert backward compatibility of the base MisakiPhonemizer.
StyleTTS2 adapter implementation and registration
phoonnx/engines/styletts2.py, phoonnx/engines/__init__.py
New StyleTTS2Adapter handles token padding with StyleTTS2 pad ID at both ends, optional style pack indexing by token sequence length, and selects the largest output tensor as waveform. The adapter is registered with priority 33 and supports detection for both styletts2 and kokoro engines. Phoneme IDs are converted to int64, attention mask is created, and speed parameter is passed through.
StyleTTS2 adapter test suite
tests/test_styletts2.py
Tests confirm adapter registration and engine detection, validate feed dict padding (5 tokens + 2 pad = 7), verify style pack token-length indexing for Kokoro, validate waveform selection in output parsing, and confirm configure() loads style binary from engine parameters and reshapes to [510, 256].
Model manager style URL and download support
phoonnx/model_manager.py
TTSModelInfo adds optional style_url field and download_style() method to cache StyleTTS2/Kokoro style embeddings locally. engine_params() passes the resolved style_path into synthesis parameters. merge_default_voices() additionally loads vits2.json and styletts2.json bundled voice indexes.
Dependency and voice index updates
pyproject.toml, phoonnx/voice_index/vits2.json
pyproject.toml adds spacy>=3.7 to en, ja, vi, zh language extras. New vits2.json entry for frappuccino/vits2-ru-natasha with Hugging Face model/config URLs and phoneme/encoding metadata.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • TigreGotico/phoonnx#131: Introduces the BaseOnnxAdapter framework and phoonnx.engines registration mechanism that this PR extends with StyleTTS2Adapter.
  • TigreGotico/phoonnx#149: Also extends TTSModelManager.merge_default_voices() to load additional bundled voice indexes, though for different engines.
  • TigreGotico/phoonnx#70: Similarly modifies get_phonemizer() to incorporate Alphabet into phonemizer wiring for a different phonemizer family.

Poem

🐰 A hop through configs new and sound,
Language-split Misakis abound!
StyleTTS2 paints with style and speed,
For Kokoro's flowing voice we need.
Padded tokens, waveforms bright—
Speech synthesis takes flight! 🎵

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/vits2-styletts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added feature and removed feature labels Jun 7, 2026
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Greetings! I've analyzed your changes and have some results to share. 🖖

I've aggregated the results of the automated checks for this PR below.

📋 Repo Health

Scanning for any signs of 'comment' bad breath. 🌬️

⚠️ Some required files are missing.

Latest Version: 1.15.0a1

phoonnx/version.py — Version file
README.md — README
LICENSE — License file
pyproject.toml — pyproject.toml
⚠️ setup.py — setup.py
CHANGELOG.md — Changelog
phoonnx/version.py has valid version block markers

🔍 Lint

Everything looks good so far! ✅

ruff: issues found — see job log

📊 Coverage

How well-protected is our logic? Let's find out! 🛡️

40.5% total coverage

Files below 80% coverage (37 files)
File Coverage Missing lines
phoonnx/cli.py 0.0% 98
phoonnx/thirdparty/kog2p/__init__.py 0.0% 203
phoonnx/thirdparty/mantoq/unicode_symbol2label.py 0.0% 1
phoonnx/thirdparty/bw2ipa.py 7.5% 86
phoonnx/thirdparty/mantoq/pyarabic/number.py 7.7% 371
phoonnx/thirdparty/mantoq/buck/phonetise_buckwalter.py 10.4% 180
phoonnx/thirdparty/hangul2ipa.py 16.6% 372
phoonnx/phonemizers/en.py 17.5% 104
phoonnx/thirdparty/mantoq/pyarabic/trans.py 18.2% 135
phoonnx/model_manager.py 19.4% 229
phoonnx/voice.py 21.7% 220
phoonnx/thirdparty/zh_num.py 23.1% 83
phoonnx/thirdparty/tashkeel/__init__.py 23.9% 89
phoonnx/phonemizers/zh.py 27.0% 92
phoonnx/phonemizers/mul.py 27.6% 234
phoonnx/phonemizers/ko.py 30.4% 32
phoonnx/phonemizers/gl.py 31.1% 42
phoonnx/phonemizers/ar.py 31.2% 44
phoonnx/thirdparty/mantoq/buck/tokenization.py 32.5% 27
phoonnx/thirdparty/phonikud/__init__.py 35.3% 11
phoonnx/phonemizers/ja.py 36.0% 32
phoonnx/phonemizers/fa.py 36.4% 14
phoonnx/phonemizers/pt.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/normalize.py 38.1% 13
phoonnx/thirdparty/mantoq/pyarabic/araby.py 39.7% 298
phoonnx/phonemizers/he.py 40.0% 12
phoonnx/phonemizers/vi.py 40.0% 12
phoonnx/phonemizers/base.py 40.8% 71
phoonnx/thirdparty/mantoq/pyarabic/stack.py 45.5% 6
phoonnx/thirdparty/mantoq/num2words.py 47.6% 11
phoonnx/phonemizers/mwl.py 50.0% 8
phoonnx/tokenizer.py 52.4% 147
phoonnx/thirdparty/mantoq/__init__.py 60.0% 10
phoonnx/thirdparty/mantoq/pyarabic/arabrepr.py 60.0% 6
phoonnx/engines/vocoders/griffinlim.py 61.4% 27
phoonnx/config.py 65.8% 120
phoonnx/engines/optispeech.py 69.6% 24

Full report: download the coverage-report artifact.

⚖️ License Check

Checking for any restrictive patent clauses. 📜

❌ License violations detected (43 packages) — review required before merging.

Dependency                          License Name                                            License Type         Misc                                    
phoonnx:1.3.3                       Error                                                   Error                                                        

License Type                        Found                                                  
Error                               1

License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more

Full breakdown — 43 packages
Package Version License URL
build 1.5.0 MIT link
certifi 2026.5.20 Mozilla Public License 2.0 (MPL 2.0) link
charset-normalizer 3.4.7 MIT link
click 8.4.1 BSD-3-Clause link
combo_lock 0.3.1 Apache-2.0 link
dateparser 1.4.0 BSD License link
filelock 3.29.1 MIT link
flatbuffers 25.12.19 Apache Software License link
idna 3.18 BSD-3-Clause link
json-database 0.10.1 MIT link
kthread 0.2.3 MIT License link
langcodes 3.5.1 MIT License link
markdown-it-py 4.2.0 MIT License link
mdurl 0.1.2 MIT License link
memory-tempfile 2.2.3 MIT License link
numpy 2.4.6 BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0 link
onnxruntime 1.26.0 MIT License link
ovos-config 2.1.1 Apache-2.0 link
ovos-date-parser 0.7.0a5 Apache Software License link
ovos-number-parser 0.5.1 Apache Software License link
ovos-utils 0.8.5 Apache-2.0 link
packaging 26.2 Apache-2.0 OR BSD-2-Clause link
pexpect 4.9.0 ISC License (ISCL) link
phoonnx 1.15.0a1 Apache Software License link
protobuf 7.35.0 3-Clause BSD License link
ptyprocess 0.7.0 ISC License (ISCL) link
pyee 13.0.1 MIT License link
Pygments 2.20.0 BSD-2-Clause link
pyproject_hooks 1.2.0 MIT License link
python-dateutil 2.9.0.post0 Apache Software License; BSD License link
pytz 2026.2 MIT License link
PyYAML 6.0.3 MIT License link
quebra-frases 0.3.7 Apache Software License link
regex 2026.5.9 Apache-2.0 AND CNRI-Python link
requests 2.34.2 Apache Software License link
rich 13.9.4 MIT License link
rich-click 1.9.8 MIT License

Copyright (c) 2022 Phil Ewels

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
| link |
| six | 1.17.0 | MIT License | link |
| typing_extensions | 4.15.0 | PSF-2.0 | link |
| tzlocal | 5.3.1 | MIT License | link |
| unicode-rbnf | 2.4.0 | MIT License | |
| urllib3 | 2.7.0 | MIT | link |
| watchdog | 6.0.0 | Apache Software License | link |

Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed.

🔨 Build Tests

The build pipeline has finished its work. 🏁

✅ All versions pass

Python Build Install Tests
3.10
3.11
3.12
3.13
3.14

🏷️ Release Preview

A look ahead at the next milestone. 🚩

Current: 1.15.0a1Next: 1.16.0a1

Signal Value
Label feature
PR title feat(engines): VITS2 + StyleTTS2 family (pure StyleTTS2 + Kokoro, multilingual)
Bump minor

⚠️ No conventional commit prefix — alpha-only bump.
Suggested: fix: update the thing or feat: update the thing


🚀 Release Channel Compatibility

Predicted next version: 1.16.0a1

Channel Status Note Current Constraint
Stable Not in channel -
Testing Not in channel -
Alpha Not in channel -

🔒 Security (pip-audit)

Ensuring our dependency tree is clean of rot. 🌳

✅ No known vulnerabilities found (61 packages scanned).


Thanks for making OVOS better today! 🙌

Index 13 Asian-language Kokoro voices on the StyleTTS2 engine, exercising the
misaki ja (JAG2P, openjtalk/unidic) and zh (ZHG2P, pypinyin/jieba) G2Ps:
- ja (5): jf_alpha/gongitsune/nezumi/tebukuro, jm_kumo
- zh (8): zf_xiaobei/xiaoni/xiaoxiao/xiaoyi, zm_yunjian/yunxi/yunxia/yunyang
Per-language config (lang_code drives the misaki dispatch); shared Kokoro-82M
fp16 onnx + per-voice style packs. Validated from the index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JarbasAl JarbasAl changed the title feat(engines): VITS2 + StyleTTS2 family (pure StyleTTS2 + Kokoro) feat(engines): VITS2 + StyleTTS2 family (pure StyleTTS2 + Kokoro, multilingual) Jun 7, 2026
@github-actions github-actions Bot added feature and removed feature labels Jun 7, 2026
…uropean

- misaki zh G2P version switch: ZHG2P(version=...) wired through get_phonemizer's
  model arg (phonemizer_model). v1.0 zh = IPA (tone marks), v1.1 zh = bopomofo +
  tone numbers; the version must match the model's vocab.
- Kokoro v1.1-zh finetune (int8, CPU-stable potato-size): 100 Chinese (zf/zm,
  version 1.1) + 3 English voices.
- Kokoro v1.0 European voices via espeak (misaki's EspeakG2P fallback): es/fr/hi/
  it/pt (13).
- Kokoro v0.19 legacy (int8): 11 English voices.
- fp16 onnx NaNs on CPU (no fp16 kernels) -> int8 model_quantized for v1.1-zh/v0.19.

styletts2.json: 170 voices. All validated from the index (no NaN).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added feature and removed feature labels Jun 7, 2026
…ia alphabet

misaki is not a thin wrapper — en ships ~6MB curated lexicons + spacy, ja a cutlet
romanizer + lexicon, zh adds tone sandhi + frontend on pypinyin; only the espeak
fallback (es/fr/hi/it/pt) is a passthrough. Split the single dispatching
MisakiPhonemizer into per-language phoneme types:

  MISAKI_EN  MISAKI_JA  MISAKI_ZH  MISAKI_KO  MISAKI_VI

The zh IPA-vs-bopomofo difference is just a representation, so it's the ALPHABET,
not a separate class or version param: MISAKI_ZH + Alphabet.IPA -> misaki v1.0
(IPA + tone marks), + Alphabet.BOPOMOFO -> v1.1 (bopomofo + tone numbers). Added
Alphabet.BOPOMOFO; misaki phonemizers default to IPA. The base class stays a
back-compat dispatcher for the legacy 'misaki' type. Kokoro voices re-indexed to
the explicit types (v1.1-zh = misaki_zh + bopomofo). Suite 229.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JarbasAl JarbasAl marked this pull request as ready for review June 7, 2026 18:54
@JarbasAl JarbasAl merged commit 9b07bf1 into dev Jun 7, 2026
11 of 12 checks passed
@github-actions github-actions Bot added feature and removed feature labels Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant