Skip to content

flyw/Qwen3-TTS-Streaming-Server

 
 

Repository files navigation

Qwen3-TTS Streaming Server

  🤗 Hugging Face   |   🤖 ModelScope   |   📑 Blog   |   📑 Paper  

Overview

Qwen3-TTS Streaming Server is a high-performance, production-ready FastAPI wrapper for Qwen3-TTS. This project focuses on delivering ultra-low latency audio streaming for real-time human-computer interaction.

What's New in This Fork?

  • Commercial-Grade Text Normalization: Integrated WeTextProcessing and a specialized English abbreviation handler. Automatically converts numbers (e.g., "2025" -> "二零二五"), units, and injects spaces into all-caps abbreviations (e.g., "RISC-V" -> "R I S C V") to ensure 100% correct pronunciation and eliminate hallucination.
  • Extreme Performance Optimization: Utilizes a specialized monkey-patching technique to intercept model forward passes, enabling the delivery of the first audio chunk almost instantly.
  • Smart Queue Management: Multiple requests from the same client_id (e.g., from an LLM stream) are automatically queued and processed in order, ensuring a seamless multi-sentence speaking experience.
  • On-Demand Interruption: New /tts/interrupt endpoint allows users to stop the current speech and automatically flush all queued requests, perfect for handling user interruptions in voice chat.
  • Integrated WebUI Service: The webui.html is now served directly at the root URL (/), no manual file opening required.
  • Raw Binary Streaming: Replaced SSE/Base64 with raw PCM 16-bit binary streaming, reducing bandwidth overhead by ~33% and lowering client-side CPU usage.
  • Sliding Window Audio Reconstruction: An advanced algorithm ensures seamless audio stitching and high-quality output during streaming.
  • Production Configuration: Full support for CLI arguments and Environment Variables (Model path, Host, Port, Reference Audio).
  • Batch Decoding: Configurable chunk-size on the server to balance RTF performance and real-time feel.

Performance Benchmark

The following metrics were captured using the included webui.html on a system equipped with an NVIDIA RTX 4070 GPU with chunk_size: 6 enabled.

Real-World Metrics:

  • Time to First Byte (TTFB): ~380ms (From request sent to first audio data arrival)
  • Real-Time Factor (RTF): 0.7 - 0.9 (Generating 1s of audio in 0.7s - 0.9s)
  • Throughput: Zero-overhead binary delivery suitable for production-scale interaction.

Features

  • Sub-500ms Latency: Industry-leading response time for real-time voice interaction.
  • Smart Text Front-end: Powered by WeTextProcessing, handling complex Chinese-English mixed text, dates, and numbers with production-level accuracy.
  • Abbreviation "Spell-out": Automatically handles technical terms like RISC-V, AI, and LLM by injecting spaces to guide the model's pronunciation.
  • X-Vector Only Mode: Support for pure speaker embedding cloning, which 100% eliminates prompt leakage (no more hallucinating or repeating reference text).
  • Configurable Streaming Buffer: Prevent audio stuttering during network fluctuations by buffering tokens before sending (chunk_size).
  • Server-Side Pre-Buffering: Accumulate initial chunks before sending the first packet (pre_buffer) to provide a stable initial stream.
  • High-Fidelity Voice Cloning: Supports rapid 3-second voice cloning with the 12Hz-1.7B-Base model.
  • Multi-Language Support: Native support for CN, EN, JP, KR, DE, FR, RU, PT, ES, IT.

Quickstart

1. Environment Setup

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install the project and all dependencies (including WeTextProcessing and FastAPI)
pip install -e .

# Highly recommended: FlashAttention 2 for faster inference
pip install -U flash-attn --no-build-isolation

2. Model Preparation

Download the Qwen3-TTS-12Hz-1.7B-Base model (or other variants) to your local directory:

# Example using ModelScope
pip install -U modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --local_dir ./Qwen3-TTS-12Hz-1.7B-Base

3. Launch the Server

# Recommended: X-Vector mode is enabled by default in server.py to prevent prompt leakage
# Now includes global control over temperature and repetition penalty via CLI
python server.py \
  --model-path ./Qwen3-TTS-12Hz-1.7B-Base \
  --ref-audio reference.wav \
  --temperature 0.8 \
  --repetition-penalty 1.1 \
  --chunk-size 6

API Reference

Streaming TTS (Binary)

Endpoint: POST /tts/stream

Request Body:

{
  "text": "Hello world.",
  "language": "Auto",
  "client_id": "user_123"
}

Note: temperature and repetition_penalty are no longer accepted in the request body. They are now managed globally on the server side to ensure synthesis stability and prevent repetition loops.

Response: audio/l16;rate=24000

  • Returns a raw PCM 16-bit, 24,000Hz byte stream.
  • Queuing: Multiple requests with the same client_id will queue up and play sequentially (ideal for LLM streaming outputs).

Interrupt & Flush Queue

Endpoint: POST /tts/interrupt

Query Parameters:

  • client_id (Optional, default: "default"): The ID of the client to interrupt.

Action:

  1. Immediately stops the current inference for the specified client.
  2. Flushes the queue: All other requests for this client_id that are currently waiting will be discarded, allowing new speech to start instantly.

Configuration

CLI Argument Environment Variable Default Description
--model-path MODEL_PATH ./Qwen3-TTS-12Hz-1.7B-Base Path to model weights
--ref-audio REF_AUDIO_PATH None Path to reference audio for cloning (Required)
--host HOST 0.0.0.0 Server host
--port PORT 9000 Server port
--temperature None 0.8 Sampling temperature (Server-side global)
--repetition-penalty None 1.1 Repetition penalty (Server-side global)
--chunk-size None 1 Global tokens to buffer before decoding/sending (Higher = better RTF)
--pre-buffer None 0 Number of chunks to buffer on server before sending the first packet
client_id API Only "default" Unique ID per user to enable parallel processing

WebUI Testing

The server now hosts a ready-to-use HTML client directly.

  1. Start the server (e.g., on port 9000).
  2. Open your browser and navigate to http://localhost:9000.
  3. Configure your Client ID and Reference Audio.
  4. Click Play to start synthesis and observe real-time metrics.


Client Integration Guide (Binary)

1. High-Level Architecture

  1. Request: Send a POST request to /tts/stream.
  2. Stream Consumption: Use fetch and response.body.getReader().
  3. Conversion: Convert the incoming Uint8Array (bytes) to Int16Array, then normalize to Float32 for Web Audio.
  4. Playback: Schedule buffers using AudioContext.createBufferSource().

2. JavaScript Implementation

async function speakBinary(text) {
    const audioCtx = new AudioContext({ sampleRate: 24000 });
    let nextStartTime = audioCtx.currentTime;

    const response = await fetch('http://localhost:9000/tts/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ text, language: "Chinese" })
    });

    const reader = response.body.getReader();

    while (true) {
        const { done, value } = await reader.read(); // value is Uint8Array
        if (done) break;

        // Convert PCM16 bytes to Float32
        const int16Array = new Int16Array(value.buffer, value.byteOffset, value.byteLength / 2);
        const float32Array = new Float32Array(int16Array.length);
        for (let i = 0; i < int16Array.length; i++) {
            float32Array[i] = int16Array[i] / 32768.0;
        }

        // Create and play buffer
        const buffer = audioCtx.createBuffer(1, float32Array.length, 24000);
        buffer.getChannelData(0).set(float32Array);
        
        const source = audioCtx.createBufferSource();
        source.buffer = buffer;
        source.connect(audioCtx.destination);
        
        const startTime = Math.max(audioCtx.currentTime, nextStartTime);
        source.start(startTime);
        nextStartTime = startTime + buffer.duration;
    }
}

4. Handling Interruption (Stop-to-Talk)

To implement a "Stop-to-Talk" feature (e.g., when a user interrupts the AI), the client must do two things:

  1. Call the server interrupt API: This stops the server-side inference and flushes the queue.
  2. Clear local audio context: Immediately stop the browser's audio playback.
// Global state for playback control
let audioCtx = new AudioContext({ sampleRate: 24000 });
let activeSources = [];

async function interrupt(clientId = "default") {
    // 1. Tell the server to stop and flush the queue
    await fetch(`http://localhost:9000/tts/interrupt?client_id=${clientId}`, { method: 'POST' });

    // 2. Stop all currently scheduled audio chunks in the browser
    activeSources.forEach(source => {
        try { source.stop(); } catch(e) {}
    });
    activeSources = [];
    
    // 3. Reset the playback timer
    nextStartTime = audioCtx.currentTime;
}

// During playback, keep track of sources:
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
source.start(startTime);
activeSources.push(source); // Track this source to stop it later if needed

Credits

Developed based on the excellent work of the Qwen Team. This extension aims to provide the community with a high-speed serving alternative for real-time AI applications.

License

Adheres to the original License provided by the Qwen team.

About

Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice cloning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 97.5%
  • HTML 2.5%