Add voice cloning support via Base model by juntao · Pull Request #5 · second-state/qwen3_audio_api

juntao · 2026-01-27T23:50:28Z

Summary

Add support for loading both CustomVoice and Base model families
Enable voice cloning from reference audio via audio_sample and audio_sample_text request parameters
Require at least one of CUSTOMVOICE_MODEL_PATH or BASE_MODEL_PATH at startup
Add GitHub Actions CI workflow

Test plan

Verify server starts with only CUSTOMVOICE_MODEL_PATH set
Verify server starts with only BASE_MODEL_PATH set
Verify server starts with both model paths set
Verify server fails to start with neither model path set
Test speech generation with a predefined voice (CustomVoice model)
Test voice cloning with audio_sample parameter (Base model)
Verify HTTP 400 when requesting voice cloning without Base model loaded
Verify HTTP 400 when requesting predefined voice without CustomVoice model loaded

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Support loading both CustomVoice and Base model families. The Base model enables voice cloning from a reference audio sample via the new audio_sample and audio_sample_text request parameters. At least one of CUSTOMVOICE_MODEL_PATH or BASE_MODEL_PATH must be set at startup. Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

Add --project python to uv commands and use python/main.py path so the workflow runs correctly from the repository root. Co-Authored-By: Claude <noreply@anthropic.com>

Superseded by ci.yml. Co-Authored-By: Claude <noreply@anthropic.com>

The audio_sample field now expects base64-encoded audio data so that clients can talk to a remote server. The underlying model already supports base64 strings natively. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull request overview

Adds voice cloning support by allowing the server to load and route requests between CustomVoice and Base Qwen3-TTS model families, and updates documentation/CI accordingly.

Changes:

Load CustomVoice and/or Base models at startup via CUSTOMVOICE_MODEL_PATH and BASE_MODEL_PATH (requires at least one).
Add voice cloning request parameters (audio_sample, audio_sample_text) and route /v1/audio/speech accordingly.
Replace the prior GitHub Actions workflow with a multi-phase integration CI workflow that uploads generated audio artifacts.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
python/main.py	Adds dual-model loading and request routing for voice cloning vs. preset voices.
python/README.md	Documents model families, new env vars, and new request parameters with examples.
python/TEST_PLAN.md	Adds an integration test plan describing the 3 server configurations and expected behaviors.
python/EXAMPLE_CC_SESSION.md	Adds a transcript-style session log describing how changes were produced/tested.
.github/workflows/test-tts-api.yml	Removes the previous single-scenario API workflow.
.github/workflows/ci.yml	Adds a new multi-phase CI workflow that downloads models, runs integration calls, and uploads wav artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T00:17:38Z

.github/workflows/ci.yml

+on:
+  push:
+  pull_request:
+


This workflow now runs on every push/PR (no branch or path filtering). Given it downloads multi-GB model weights and does long-running inference, it should be scoped (e.g., to main and/or python/**) to avoid expensive CI runs on unrelated changes.

Copilot · 2026-01-28T00:17:38Z

.github/workflows/ci.yml

+
+      - uses: astral-sh/setup-uv@v6
+        with:
+          version: "latest"


setup-uv is configured without a pinned Python version. Since ubuntu-latest can change Python versions over time, explicitly set python-version: "3.12" (matching the project requirement) to prevent CI breakage when the runner image updates.

Suggested change

version: "latest"

version: "latest"

python-version: "3.12"

Copilot · 2026-01-28T00:17:38Z

.github/workflows/ci.yml

+          version: "latest"
+
+      - name: Install dependencies
+        run: uv sync --project python


CI installs dependencies with uv sync --project python without --frozen (and without excluding dev deps). For reproducible builds and faster CI, consider using uv sync --project python --frozen --no-dev so the workflow fails if uv.lock/pyproject.toml drift and avoids installing unnecessary packages.

Suggested change

run: uv sync --project python

run: uv sync --project python --frozen --no-dev

Copilot · 2026-01-28T00:17:38Z

python/EXAMPLE_CC_SESSION.md

+Me
+
+```
+Review the README.md file. Install dependencies. Download the 0.6B CustomVoice model (exclude it from git).          
+  Start an API server. Make API requests to generate chinese and english audio files (also exclude the WAV files       
+  from git). 
+```


This file appears to be a raw Claude Code session transcript (and includes outdated guidance like MODEL_PATH=...). It’s likely not suitable to keep in-repo; consider removing it or converting the relevant parts into durable documentation (e.g., README/TEST_PLAN) instead of a chat log.

Copilot · 2026-01-28T00:17:38Z

python/main.py

+    if request.audio_sample:
+        base_model: Qwen3TTSModel | None = app.state.base_model
+        if base_model is None:
+            raise HTTPException(
+                status_code=400,
+                detail=(
+                    "audio_sample requires a base model. "
+                    "Set BASE_MODEL_PATH to enable voice cloning."
+                ),
+            )
+        use_icl = request.audio_sample_text is not None
+        with _inference_lock:
+            wavs, sr = base_model.generate_voice_clone(
+                text=request.input,
+                language=request.language,
+                ref_audio=request.audio_sample,
+                ref_text=request.audio_sample_text,


audio_sample_text is accepted but will be silently ignored when audio_sample is omitted (request routes to CustomVoice branch). Also, if request.audio_sample: treats an empty string as "not provided". Consider validating the parameter combination explicitly (e.g., reject audio_sample_text without audio_sample, and enforce non-empty audio_sample/audio_sample_text when provided) so clients get a clear 400 instead of unexpected routing.

Pipe base64 output into jq to build the JSON payload in a file, then pass it to curl via -d @file. This avoids expanding the large base64 string as a shell argument. Co-Authored-By: Claude <noreply@anthropic.com>

The endpoint now handles both JSON and multipart/form-data. Use multipart with curl -F to upload audio_sample as a binary file, avoiding base64 encoding and shell argument length limits. Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

juntao and others added 4 commits January 27, 2026 23:49

Add example session and test plan docs

c9bad40

Co-Authored-By: Claude <noreply@anthropic.com>

Fix CI to reference files in python/ directory

2624e89

Add --project python to uv commands and use python/main.py path so the workflow runs correctly from the repository root. Co-Authored-By: Claude <noreply@anthropic.com>

Remove old test-tts-api.yml workflow

54e9e56

Superseded by ci.yml. Co-Authored-By: Claude <noreply@anthropic.com>

juntao requested a review from Copilot January 28, 2026 00:09

Copilot started reviewing on behalf of juntao January 28, 2026 00:09 View session

Send base64-encoded audio content instead of file paths

3da5e3a

The audio_sample field now expects base64-encoded audio data so that clients can talk to a remote server. The underlying model already supports base64 strings natively. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI reviewed Jan 28, 2026

View reviewed changes

juntao and others added 3 commits January 28, 2026 00:22

Use jq + file payload to avoid argument list too long

f616652

Pipe base64 output into jq to build the JSON payload in a file, then pass it to curl via -d @file. This avoids expanding the large base64 string as a shell argument. Co-Authored-By: Claude <noreply@anthropic.com>

Remove EXAMPLE_CC_SESSION.md

568f29b

Co-Authored-By: Claude <noreply@anthropic.com>

juntao merged commit cd34d97 into main Jan 28, 2026
2 checks passed

juntao deleted the add-voice-cloning branch January 28, 2026 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add voice cloning support via Base model#5

Add voice cloning support via Base model#5
juntao merged 8 commits intomainfrom
add-voice-cloning

juntao commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Copilot AI Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	run: uv sync --project python
	run: uv sync --project python --frozen --no-dev

Conversation

juntao commented Jan 27, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant