Qwen3-TTS Streaming Server

🤗 Hugging Face | 🤖 ModelScope | 📑 Blog | 📑 Paper

Overview

Qwen3-TTS Streaming Server is a high-performance, production-ready FastAPI wrapper for Qwen3-TTS. This project focuses on delivering ultra-low latency audio streaming for real-time human-computer interaction.

What's New in This Fork?

Commercial-Grade Text Normalization: Integrated WeTextProcessing and a specialized English abbreviation handler. Automatically converts numbers (e.g., "2025" -> "二零二五"), units, and injects spaces into all-caps abbreviations (e.g., "RISC-V" -> "R I S C V") to ensure 100% correct pronunciation and eliminate hallucination.
Extreme Performance Optimization: Utilizes a specialized monkey-patching technique to intercept model forward passes, enabling the delivery of the first audio chunk almost instantly.
Smart Queue Management: Multiple requests from the same client_id (e.g., from an LLM stream) are automatically queued and processed in order, ensuring a seamless multi-sentence speaking experience.
On-Demand Interruption: New /tts/interrupt endpoint allows users to stop the current speech and automatically flush all queued requests, perfect for handling user interruptions in voice chat.
Integrated WebUI Service: The webui.html is now served directly at the root URL (/), no manual file opening required.
Raw Binary Streaming: Replaced SSE/Base64 with raw PCM 16-bit binary streaming, reducing bandwidth overhead by ~33% and lowering client-side CPU usage.
Sliding Window Audio Reconstruction: An advanced algorithm ensures seamless audio stitching and high-quality output during streaming.
Production Configuration: Full support for CLI arguments and Environment Variables (Model path, Host, Port, Reference Audio).
Batch Decoding: Configurable chunk-size on the server to balance RTF performance and real-time feel.

Performance Benchmark

The following metrics were captured using the included webui.html on a system equipped with an NVIDIA RTX 4070 GPU with chunk_size: 6 enabled.

Real-World Metrics:

Time to First Byte (TTFB): ~380ms (From request sent to first audio data arrival)
Real-Time Factor (RTF): 0.7 - 0.9 (Generating 1s of audio in 0.7s - 0.9s)
Throughput: Zero-overhead binary delivery suitable for production-scale interaction.

Features

Sub-500ms Latency: Industry-leading response time for real-time voice interaction.
Smart Text Front-end: Powered by WeTextProcessing, handling complex Chinese-English mixed text, dates, and numbers with production-level accuracy.
Abbreviation "Spell-out": Automatically handles technical terms like RISC-V, AI, and LLM by injecting spaces to guide the model's pronunciation.
X-Vector Only Mode: Support for pure speaker embedding cloning, which 100% eliminates prompt leakage (no more hallucinating or repeating reference text).
Configurable Streaming Buffer: Prevent audio stuttering during network fluctuations by buffering tokens before sending (chunk_size).
Server-Side Pre-Buffering: Accumulate initial chunks before sending the first packet (pre_buffer) to provide a stable initial stream.
High-Fidelity Voice Cloning: Supports rapid 3-second voice cloning with the 12Hz-1.7B-Base model.
Multi-Language Support: Native support for CN, EN, JP, KR, DE, FR, RU, PT, ES, IT.

Quickstart

1. Environment Setup

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install the project and all dependencies (including WeTextProcessing and FastAPI)
pip install -e .

# Highly recommended: FlashAttention 2 for faster inference
pip install -U flash-attn --no-build-isolation

2. Model Preparation

Download the Qwen3-TTS-12Hz-1.7B-Base model (or other variants) to your local directory:

# Example using ModelScope
pip install -U modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --local_dir ./Qwen3-TTS-12Hz-1.7B-Base

3. Launch the Server

# Recommended: X-Vector mode is enabled by default in server.py to prevent prompt leakage
# Now includes global control over temperature and repetition penalty via CLI
python server.py \
  --model-path ./Qwen3-TTS-12Hz-1.7B-Base \
  --ref-audio reference.wav \
  --temperature 0.8 \
  --repetition-penalty 1.1 \
  --chunk-size 6

API Reference

Streaming TTS (Binary)

Endpoint: POST /tts/stream

Request Body:

{
  "text": "Hello world.",
  "language": "Auto",
  "client_id": "user_123"
}

Note: temperature and repetition_penalty are no longer accepted in the request body. They are now managed globally on the server side to ensure synthesis stability and prevent repetition loops.

Response: audio/l16;rate=24000

Returns a raw PCM 16-bit, 24,000Hz byte stream.
Queuing: Multiple requests with the same client_id will queue up and play sequentially (ideal for LLM streaming outputs).

Interrupt & Flush Queue

Endpoint: POST /tts/interrupt

Query Parameters:

client_id (Optional, default: "default"): The ID of the client to interrupt.

Action:

Immediately stops the current inference for the specified client.
Flushes the queue: All other requests for this client_id that are currently waiting will be discarded, allowing new speech to start instantly.

Configuration

CLI Argument	Environment Variable	Default	Description
`--model-path`	`MODEL_PATH`	`./Qwen3-TTS-12Hz-1.7B-Base`	Path to model weights
`--ref-audio`	`REF_AUDIO_PATH`	`None`	Path to reference audio for cloning (Required)
`--host`	`HOST`	`0.0.0.0`	Server host
`--port`	`PORT`	`9000`	Server port
`--temperature`	None	`0.8`	Sampling temperature (Server-side global)
`--repetition-penalty`	None	`1.1`	Repetition penalty (Server-side global)
`--chunk-size`	None	`1`	Global tokens to buffer before decoding/sending (Higher = better RTF)
`--pre-buffer`	None	`0`	Number of chunks to buffer on server before sending the first packet
`client_id`	API Only	`"default"`	Unique ID per user to enable parallel processing

WebUI Testing

The server now hosts a ready-to-use HTML client directly.

Start the server (e.g., on port 9000).
Open your browser and navigate to http://localhost:9000.
Configure your Client ID and Reference Audio.
Click Play to start synthesis and observe real-time metrics.

Client Integration Guide (Binary)

1. High-Level Architecture

Request: Send a POST request to /tts/stream.
Stream Consumption: Use fetch and response.body.getReader().
Conversion: Convert the incoming Uint8Array (bytes) to Int16Array, then normalize to Float32 for Web Audio.
Playback: Schedule buffers using AudioContext.createBufferSource().

2. JavaScript Implementation

async function speakBinary(text) {
    const audioCtx = new AudioContext({ sampleRate: 24000 });
    let nextStartTime = audioCtx.currentTime;

    const response = await fetch('http://localhost:9000/tts/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ text, language: "Chinese" })
    });

    const reader = response.body.getReader();

    while (true) {
        const { done, value } = await reader.read(); // value is Uint8Array
        if (done) break;

        // Convert PCM16 bytes to Float32
        const int16Array = new Int16Array(value.buffer, value.byteOffset, value.byteLength / 2);
        const float32Array = new Float32Array(int16Array.length);
        for (let i = 0; i < int16Array.length; i++) {
            float32Array[i] = int16Array[i] / 32768.0;
        }

        // Create and play buffer
        const buffer = audioCtx.createBuffer(1, float32Array.length, 24000);
        buffer.getChannelData(0).set(float32Array);
        
        const source = audioCtx.createBufferSource();
        source.buffer = buffer;
        source.connect(audioCtx.destination);
        
        const startTime = Math.max(audioCtx.currentTime, nextStartTime);
        source.start(startTime);
        nextStartTime = startTime + buffer.duration;
    }
}

4. Handling Interruption (Stop-to-Talk)

To implement a "Stop-to-Talk" feature (e.g., when a user interrupts the AI), the client must do two things:

Call the server interrupt API: This stops the server-side inference and flushes the queue.
Clear local audio context: Immediately stop the browser's audio playback.

// Global state for playback control
let audioCtx = new AudioContext({ sampleRate: 24000 });
let activeSources = [];

async function interrupt(clientId = "default") {
    // 1. Tell the server to stop and flush the queue
    await fetch(`http://localhost:9000/tts/interrupt?client_id=${clientId}`, { method: 'POST' });

    // 2. Stop all currently scheduled audio chunks in the browser
    activeSources.forEach(source => {
        try { source.stop(); } catch(e) {}
    });
    activeSources = [];
    
    // 3. Reset the playback timer
    nextStartTime = audioCtx.currentTime;
}

// During playback, keep track of sources:
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
source.start(startTime);
activeSources.push(source); // Track this source to stop it later if needed

Credits

Developed based on the excellent work of the Qwen Team. This extension aims to provide the community with a high-speed serving alternative for real-time AI applications.

License

Adheres to the original License provided by the Qwen team.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
assets		assets
examples		examples
finetuning		finetuning
qwen_tts		qwen_tts
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
qwen3-tts.service		qwen3-tts.service
server.py		server.py
webui.html		webui.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3-TTS Streaming Server

Overview

What's New in This Fork?

Performance Benchmark

Real-World Metrics:

Features

Quickstart

1. Environment Setup

2. Model Preparation

3. Launch the Server

API Reference

Streaming TTS (Binary)

Interrupt & Flush Queue

Configuration

WebUI Testing

Client Integration Guide (Binary)

1. High-Level Architecture

2. JavaScript Implementation

4. Handling Interruption (Stop-to-Talk)

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3-TTS Streaming Server

Overview

What's New in This Fork?

Performance Benchmark

Real-World Metrics:

Features

Quickstart

1. Environment Setup

2. Model Preparation

3. Launch the Server

API Reference

Streaming TTS (Binary)

Interrupt & Flush Queue

Configuration

WebUI Testing

Client Integration Guide (Binary)

1. High-Level Architecture

2. JavaScript Implementation

4. Handling Interruption (Stop-to-Talk)

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages