feat(tts): add Qwen3-TTS as third TTS backend#311
Open
basnijholt wants to merge 18 commits intomainfrom
Open
Conversation
Add Qwen3-TTS as a new TTS backend option alongside Kokoro and Piper. Features: - High-quality multilingual neural TTS with 10+ language support - 9 built-in voices (Vivian, Ryan, Serena, Dylan, Eric, Aiden, etc.) - OpenAI voice name mapping (alloy, echo, fable, etc. → Qwen speakers) - Subprocess isolation for clean memory management on TTL unload - GPU acceleration (CUDA/MPS) with CPU fallback Usage: pip install "agent-cli[qwen-tts]" agent-cli server tts --backend qwen The model (~3GB) auto-downloads from HuggingFace on first use.
- Include qwen-tts in @requires_extras decorator so users can install only qwen-tts without piper or kokoro - Remove unused QWEN_SAMPLE_RATE constant (sample rate comes from model) - Add TestQwenBackend test class with 7 tests covering initialization, voice mapping, language support, and streaming status - Add test_create_qwen_backend to TestBackendFactory - Regenerate auto-generated docs to show qwen in --backend options
- Pass cache_dir to Qwen3TTSModel.from_pretrained() so --cache-dir works - Remove no-op try/except around flash attention dict assignment - Remove unused SUPPORTED_LANGUAGES list and associated test
The Qwen backend gets sample rate dynamically from model output, so this constant was never used.
basnijholt
commented
Jan 25, 2026
agent_cli/server/cli.py
Outdated
| if backend == "qwen": | ||
| from huggingface_hub import snapshot_download # noqa: PLC0415 | ||
|
|
||
| from agent_cli.server.tts.backends.base import ( # noqa: PLC0415 |
Owner
Author
There was a problem hiding this comment.
First party imports at top
| audio_int16 = (audio * 32767).astype(np.int16) | ||
|
|
||
| buffer = io.BytesIO() | ||
| with wave.open(buffer, "wb") as wav: |
agent_cli/_extras.json
Outdated
| "vad": ["Voice Activity Detection (silero-vad)", ["silero_vad"]], | ||
| "faster-whisper": ["Whisper ASR (CUDA/CPU)", ["faster_whisper"]], | ||
| "mlx-whisper": ["Whisper ASR (Apple Silicon)", ["mlx_whisper"]] | ||
| "audio": [ |
- Use existing pcm_to_wav helper instead of duplicating WAV conversion - Remove redundant librosa/soundfile deps (already transitive from qwen-tts) - Restore compact JSON format for _extras.json - Remove unused io/wave imports from qwen.py
Restore original format for _extras.json - just add qwen-tts entry without changing existing entries or ordering.
Adds torch.compile() with reduce-overhead mode for 20-30% faster inference on CUDA. First few inferences will be slower due to JIT compilation warmup. Based on optimizations from Qwen3-TTS-Openai-Fastapi repo.
Automatically enables Flash Attention 2 if flash-attn package is installed, providing ~10% additional speedup on top of torch.compile(). Install with: pip install flash-attn
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Features
Test plan
--backend qwen