🤗 Hugging Face | 🤖 ModelScope | 📑 Blog | 📑 Paper
Qwen3-TTS Streaming Server is a high-performance, production-ready FastAPI wrapper for Qwen3-TTS. This project focuses on delivering ultra-low latency audio streaming for real-time human-computer interaction.
- Commercial-Grade Text Normalization: Integrated WeTextProcessing and a specialized English abbreviation handler. Automatically converts numbers (e.g., "2025" -> "二零二五"), units, and injects spaces into all-caps abbreviations (e.g., "RISC-V" -> "R I S C V") to ensure 100% correct pronunciation and eliminate hallucination.
- Extreme Performance Optimization: Utilizes a specialized monkey-patching technique to intercept model forward passes, enabling the delivery of the first audio chunk almost instantly.
- Smart Queue Management: Multiple requests from the same
client_id(e.g., from an LLM stream) are automatically queued and processed in order, ensuring a seamless multi-sentence speaking experience. - On-Demand Interruption: New
/tts/interruptendpoint allows users to stop the current speech and automatically flush all queued requests, perfect for handling user interruptions in voice chat. - Integrated WebUI Service: The
webui.htmlis now served directly at the root URL (/), no manual file opening required. - Raw Binary Streaming: Replaced SSE/Base64 with raw PCM 16-bit binary streaming, reducing bandwidth overhead by ~33% and lowering client-side CPU usage.
- Sliding Window Audio Reconstruction: An advanced algorithm ensures seamless audio stitching and high-quality output during streaming.
- Production Configuration: Full support for CLI arguments and Environment Variables (Model path, Host, Port, Reference Audio).
- Batch Decoding: Configurable
chunk-sizeon the server to balance RTF performance and real-time feel.
The following metrics were captured using the included webui.html on a system equipped with an NVIDIA RTX 4070 GPU with chunk_size: 6 enabled.
- Time to First Byte (TTFB): ~380ms (From request sent to first audio data arrival)
- Real-Time Factor (RTF): 0.7 - 0.9 (Generating 1s of audio in 0.7s - 0.9s)
- Throughput: Zero-overhead binary delivery suitable for production-scale interaction.
- Sub-500ms Latency: Industry-leading response time for real-time voice interaction.
- Smart Text Front-end: Powered by WeTextProcessing, handling complex Chinese-English mixed text, dates, and numbers with production-level accuracy.
- Abbreviation "Spell-out": Automatically handles technical terms like RISC-V, AI, and LLM by injecting spaces to guide the model's pronunciation.
- X-Vector Only Mode: Support for pure speaker embedding cloning, which 100% eliminates prompt leakage (no more hallucinating or repeating reference text).
- Configurable Streaming Buffer: Prevent audio stuttering during network fluctuations by buffering tokens before sending (
chunk_size). - Server-Side Pre-Buffering: Accumulate initial chunks before sending the first packet (
pre_buffer) to provide a stable initial stream. - High-Fidelity Voice Cloning: Supports rapid 3-second voice cloning with the 12Hz-1.7B-Base model.
- Multi-Language Support: Native support for CN, EN, JP, KR, DE, FR, RU, PT, ES, IT.
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
# Install the project and all dependencies (including WeTextProcessing and FastAPI)
pip install -e .
# Highly recommended: FlashAttention 2 for faster inference
pip install -U flash-attn --no-build-isolationDownload the Qwen3-TTS-12Hz-1.7B-Base model (or other variants) to your local directory:
# Example using ModelScope
pip install -U modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --local_dir ./Qwen3-TTS-12Hz-1.7B-Base# Recommended: X-Vector mode is enabled by default in server.py to prevent prompt leakage
# Now includes global control over temperature and repetition penalty via CLI
python server.py \
--model-path ./Qwen3-TTS-12Hz-1.7B-Base \
--ref-audio reference.wav \
--temperature 0.8 \
--repetition-penalty 1.1 \
--chunk-size 6Endpoint: POST /tts/stream
Request Body:
{
"text": "Hello world.",
"language": "Auto",
"client_id": "user_123"
}Note:
temperatureandrepetition_penaltyare no longer accepted in the request body. They are now managed globally on the server side to ensure synthesis stability and prevent repetition loops.
Response: audio/l16;rate=24000
- Returns a raw PCM 16-bit, 24,000Hz byte stream.
- Queuing: Multiple requests with the same
client_idwill queue up and play sequentially (ideal for LLM streaming outputs).
Endpoint: POST /tts/interrupt
Query Parameters:
client_id(Optional, default:"default"): The ID of the client to interrupt.
Action:
- Immediately stops the current inference for the specified client.
- Flushes the queue: All other requests for this
client_idthat are currently waiting will be discarded, allowing new speech to start instantly.
| CLI Argument | Environment Variable | Default | Description |
|---|---|---|---|
--model-path |
MODEL_PATH |
./Qwen3-TTS-12Hz-1.7B-Base |
Path to model weights |
--ref-audio |
REF_AUDIO_PATH |
None |
Path to reference audio for cloning (Required) |
--host |
HOST |
0.0.0.0 |
Server host |
--port |
PORT |
9000 |
Server port |
--temperature |
None | 0.8 |
Sampling temperature (Server-side global) |
--repetition-penalty |
None | 1.1 |
Repetition penalty (Server-side global) |
--chunk-size |
None | 1 |
Global tokens to buffer before decoding/sending (Higher = better RTF) |
--pre-buffer |
None | 0 |
Number of chunks to buffer on server before sending the first packet |
client_id |
API Only | "default" |
Unique ID per user to enable parallel processing |
The server now hosts a ready-to-use HTML client directly.
- Start the server (e.g., on port 9000).
- Open your browser and navigate to
http://localhost:9000. - Configure your Client ID and Reference Audio.
- Click Play to start synthesis and observe real-time metrics.
- Request: Send a
POSTrequest to/tts/stream. - Stream Consumption: Use
fetchandresponse.body.getReader(). - Conversion: Convert the incoming
Uint8Array(bytes) toInt16Array, then normalize toFloat32for Web Audio. - Playback: Schedule buffers using
AudioContext.createBufferSource().
async function speakBinary(text) {
const audioCtx = new AudioContext({ sampleRate: 24000 });
let nextStartTime = audioCtx.currentTime;
const response = await fetch('http://localhost:9000/tts/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text, language: "Chinese" })
});
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read(); // value is Uint8Array
if (done) break;
// Convert PCM16 bytes to Float32
const int16Array = new Int16Array(value.buffer, value.byteOffset, value.byteLength / 2);
const float32Array = new Float32Array(int16Array.length);
for (let i = 0; i < int16Array.length; i++) {
float32Array[i] = int16Array[i] / 32768.0;
}
// Create and play buffer
const buffer = audioCtx.createBuffer(1, float32Array.length, 24000);
buffer.getChannelData(0).set(float32Array);
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
const startTime = Math.max(audioCtx.currentTime, nextStartTime);
source.start(startTime);
nextStartTime = startTime + buffer.duration;
}
}To implement a "Stop-to-Talk" feature (e.g., when a user interrupts the AI), the client must do two things:
- Call the server interrupt API: This stops the server-side inference and flushes the queue.
- Clear local audio context: Immediately stop the browser's audio playback.
// Global state for playback control
let audioCtx = new AudioContext({ sampleRate: 24000 });
let activeSources = [];
async function interrupt(clientId = "default") {
// 1. Tell the server to stop and flush the queue
await fetch(`http://localhost:9000/tts/interrupt?client_id=${clientId}`, { method: 'POST' });
// 2. Stop all currently scheduled audio chunks in the browser
activeSources.forEach(source => {
try { source.stop(); } catch(e) {}
});
activeSources = [];
// 3. Reset the playback timer
nextStartTime = audioCtx.currentTime;
}
// During playback, keep track of sources:
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
source.start(startTime);
activeSources.push(source); // Track this source to stop it later if neededDeveloped based on the excellent work of the Qwen Team. This extension aims to provide the community with a high-speed serving alternative for real-time AI applications.
Adheres to the original License provided by the Qwen team.
