feat(engines): GlowTTS / Larynx inference adapter#143
Conversation
|
Warning Review limit reached
More reviews will be available in 48 minutes and 15 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds complete GlowTTS (Larynx) TTS engine support with a parametric Griffin-Lim vocoder fallback. Changes include the GlowTTS ONNX adapter, config conversion from Larynx/Coqui formats, mel preprocessing and Griffin-Lim vocoder implementation, voice registry with 50+ voice definitions, model manager integration for parametric vocoders, comprehensive documentation, and test coverage. ChangesGlowTTS Engine and Vocoder Support
🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add GlowTTS (flow-based acoustic + separate vocoder) support — the engine behind Larynx, the mimic3/piper precursor. It is two-stage like Matcha-TTS (text -> mel, then a vocoder), so the adapter reuses the vocoder registry. - GlowTTSAdapter: input/input_lengths/scales=[noise_scale, length_scale] -> mel, picks the mel by its n_mels axis (Larynx emits an extra output) and runs the vocoder from engine_params. - glowtts_config.py: voice_config_from_larynx() builds a native VoiceConfig from a Larynx config.json + phonemes.txt (gruut, blank-interspersed tokenization). - Engine.GLOWTTS; registered with detect_priority before VITS (both have a `scales` input, but GlowTTS is identified by its mel output). - Mirror Larynx voices (cmu_aew, ljspeech) to OpenVoiceOS/phoonnx-glowtts with modernized native configs + the HiFi-GAN vocoder to phoonnx-vocoders; voice_index/glowtts.json links them. Verified: voices load from the index (auto-download model + vocoder) and synthesize end-to-end. 9 unit tests; full suite 176 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the full Larynx glow_tts voice set (9 languages: en/de/es/fr/it/nl/ru/sv/sw, 51 voices) to OpenVoiceOS/phoonnx-glowtts with native configs. Phonemizer is auto-detected per voice from phonemes.txt (IPA -> gruut, plain chars -> graphemes); all 51 are gruut. Each linked to the HiFi-GAN vocoder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert + mirror coqui-TTS GlowTTS voices (official zoo) alongside Larynx, with their finetuned, mel-matched vocoders for neural quality where available. - GriffinLimVocoder: parametric mel->audio vocoder (no model file), matching coqui's AudioProcessor de-normalization (db_to_amp / symmetric norm). Universal fallback for voices with no mel-matched neural vocoder. - "melgan" vocoder alias (multiband-melgan is a 1-output mel->audio ONNX). - voice_config_from_coqui(): build a native VoiceConfig from a coqui GlowTTS config ([pad,eos,bos]+chars/phonemes vocab; graphemes or espeak). - GlowTTSAdapter + model_manager: support a parametric vocoder (vocoder_type + config, no vocoder_url) so Griffin-Lim voices load via the standard path. - voice_index/glowtts.json: 58 voices (51 Larynx + 7 coqui official); vocoders 53 hifigan / 2 melgan / 3 griffinlim. Acoustic + HiFi-GAN/MelGAN vocoders are converted by standalone exporters that vendor only coqui's pure-torch model code (no coqui-tts dependency). Verified: voices load from the index (auto-download model + vocoder) and synthesize. Full suite 182 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Multiband-MelGAN expects stats-normalized mels (scale_stats.npy mean/std), while GlowTTS emits dB-scale mels — feeding one to the other produced garbage. Add a config-flagged _preprocess_mel step on BaseVocoder so a converted vocoder declares its input convention: - stats_norm + mel_mean/mel_std -> standard-scale the mel (Coqui StandardScaler). The melgan vocoder.json carries the stats (from the vocoder's scale_stats.npy), so the runtime applies (mel - mean)/std before the ONNX. Opt-in per flag — HiFi-GAN voices (no stats) are untouched. en/ljspeech + uk/mai are neural MelGAN again (no Griffin-Lim fallback). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add docs/vocoders.md documenting the shared vocoder registry used by GlowTTS, Matcha-TTS and OptiSpeech: the vocoder families (vocos/wavenext/hifigan/melgan/ raw/griffinlim), how a voice links its vocoder in the index, the config-driven mel preprocessing flags (stats_norm), and how to use, replace, swap, and add vocoders. Cross-linked from glowtts.md and matcha.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ff37adb to
0a3bd38
Compare
Systems nominal. Checks complete. 🛸I've aggregated the results of the automated checks for this PR below. 🔍 LintChecking if everything is still on track. 🛤️ ❌ ruff: issues found — see job log 📊 CoverageCalculating the safety margins of your changes. 📐 ❌ 38.8% total coverage Files below 80% coverage (37 files)
Full report: download the 🔒 Security (pip-audit)Checking for any potential privacy concerns. 🕶️ ✅ No known vulnerabilities found (61 packages scanned). 🏷️ Release PreviewEnsuring the release schedule is still on track. 🗓️ Current:
🚀 Release Channel Compatibility Predicted next version:
⚖️ License CheckScanning for any non-commercial-only restrictions. 💰 ❌ License violations detected (43 packages) — review required before merging. License distribution: 14× MIT License, 7× Apache Software License, 5× MIT, 3× Apache-2.0, 2× BSD-3-Clause, 2× ISC License (ISCL), 1× 3-Clause BSD License, 1× Apache Software License; BSD License, +8 more Full breakdown — 43 packages
Copyright (c) 2022 Phil Ewels Permission is hereby granted, free of charge, to any person obtaining a copy The above copyright notice and this permission notice shall be included in all THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. 🔨 Build TestsEnsuring the gears are properly lubricated. 💧 ✅ All versions pass
📋 Repo HealthScanning for any signs of 'comment' bad breath. 🌬️ Latest Version: ✅ Keeping the repository healthy and happy. 😊 |
librosa lives in the [train] extra, not core, so a core install hits ModuleNotFoundError when a Griffin-Lim voice loads, and CI build_tests failed on test_griffinlim_mel_to_audio. Give GriffinLimVocoder a clear ImportError with an install hint, and skip the GL synthesis test when librosa is absent. Neural vocoders (HiFi-GAN/MelGAN) and all other engines are unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds GlowTTS support — the flow-based engine behind Larynx, the mimic3/piper precursor. GlowTTS is two-stage (text→mel + a separate vocoder), so it reuses the vocoder registry built for Matcha-TTS.
What's in
input/input_lengths/scales=[noise_scale,length_scale]→ mel; finds the mel by itsn_melsaxis (Larynx emits an extra output) and runs the vocoder fromengine_params.glowtts_config.py—voice_config_from_larynx()builds a nativeVoiceConfigfrom a Larynxconfig.json+phonemes.txt(gruut phonemizer, blank-interspersed, 46-symbol table).Engine.GLOWTTS+ registration. Priority: GlowTTS shares thescalesinput with VITS, so it's probed first — distinguished by its mel (not waveform) output. VITS/Matcha detection unaffected.cmu_aew,ljspeech) →OpenVoiceOS/phoonnx-glowttswith modernized native configs; the HiFi-GAN vocoder →OpenVoiceOS/phoonnx-vocoders.voice_index/glowtts.jsonlinks them (vocoder_url).docs/glowtts.md.Verified
Voices load from the index (auto-download model + vocoder) and synthesize end-to-end (en-US, gruut → mel → HiFi-GAN). 9 unit tests; full suite 176 passed, 1 skipped.
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests