Add Qwen3-ASR support with OpenAI-compatible transcriptions endpoint#6
Add Qwen3-ASR support with OpenAI-compatible transcriptions endpoint#6
Conversation
- Add /v1/audio/transcriptions endpoint for speech-to-text - Support Qwen3-ASR-0.6B and Qwen3-ASR-1.7B models - Add ASR_MODEL_PATH environment variable - Rename env vars: CUSTOMVOICE_MODEL_PATH -> TTS_CUSTOMVOICE_MODEL_PATH, BASE_MODEL_PATH -> TTS_BASE_MODEL_PATH - Update Docker image name to qwen3-audio-api - Add ffmpeg dependency for audio format conversion - Fix Docker CMD to use venv python directly (avoid uv sync on start) - Suppress nagisa SyntaxWarning via PYTHONWARNINGS env var - Add CI tests for ASR (Phase 4) and TTS+ASR round-trip (Phase 5) - Update documentation for ASR usage Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds Qwen3-ASR (Automatic Speech Recognition) support to the existing Qwen3-TTS API server, transforming it into a comprehensive audio processing API that provides both text-to-speech and speech-to-text capabilities through OpenAI-compatible endpoints.
Changes:
- Implements
/v1/audio/transcriptionsendpoint following OpenAI API format for speech-to-text functionality - Renames environment variables for clarity (BASE_MODEL_PATH → TTS_BASE_MODEL_PATH, CUSTOMVOICE_MODEL_PATH → TTS_CUSTOMVOICE_MODEL_PATH) and adds ASR_MODEL_PATH
- Adds new dependencies (qwen-asr, python-multipart, av, nagisa, soynlp) and includes ffmpeg for audio format conversion
- Updates Docker configuration to suppress nagisa SyntaxWarning and prevent ruff download at container startup
- Expands CI test coverage with Phase 4 (ASR-only) and Phase 5 (TTS→ASR round-trip) tests
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| python/uv.lock | Adds dependencies for ASR support (qwen-asr, av, nagisa, etc.) and pins transformers to 4.57.6 |
| python/pyproject.toml | Updates project description and adds qwen-asr, python-multipart dependencies with transformers override |
| python/main.py | Implements ASR endpoint with audio conversion helpers and model loading logic |
| python/TEST_PLAN.md | Adds Phase 4 (ASR-only) and Phase 5 (TTS→ASR round-trip) test scenarios |
| python/README.md | Documents new ASR features, endpoints, and usage examples |
| python/Dockerfile.cuda | Adds PYTHONWARNINGS environment variable and changes CMD to use venv python directly |
| python/Dockerfile | Same Docker improvements as CUDA version |
| README.md | Updates project description to include ASR capabilities |
| .github/workflows/ci.yml | Adds ffmpeg installation, ASR model downloads, and new test phases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| prompt: str | None = Form(default=None), | ||
| response_format: str = Form(default="json"), | ||
| temperature: float = Form(default=0.0), |
There was a problem hiding this comment.
The parameters 'prompt' and 'temperature' are accepted in the create_transcription function but are not used in the implementation. While this is mentioned in the inline comment as "not currently used", accepting parameters without using them can be confusing for API consumers. Consider either implementing support for these parameters or documenting in the API reference that they are accepted for OpenAI compatibility but currently ignored by the Qwen3-ASR model.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
/v1/audio/transcriptionsendpoint following OpenAI API formatBASE_MODEL_PATH→TTS_BASE_MODEL_PATH,CUSTOMVOICE_MODEL_PATH→TTS_CUSTOMVOICE_MODEL_PATHqwen-tts-apitoqwen3-audio-apiTest plan
docker build -t qwen3-audio-api .curl -X POST http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{"input":"Hello world","voice":"Vivian"}' --output test.wavcurl -X POST http://localhost:8000/v1/audio/transcriptions -F file=@test.wav🤖 Generated with Claude Code