Real-time Speech-to-Text WebSocket server for conversational agents.
STT Server provides low-latency streaming transcription with turn-taking awareness. It processes continuous audio streams and returns transcription segments as they become available, distinguishing between intermediate (tentative) and final results.
The server solves the problem of integrating speech recognition into conversational AI systems where:
- Low latency is critical for natural conversation flow
- Turn boundaries need to be detected automatically
- Transcription should be progressive (showing partial results)
The system uses a multi-stage async pipeline:
Audio → [VAD] → [ASR] → [Sink] → WebSocket
│ │ │
│ │ └─ Serializes to JSON and sends to client
│ └─ Transcribes audio using Canary-Qwen-2.5B
└─ Detects speech/silence, segments by turn boundaries
VAD Stage - Uses Silero VAD to classify audio frames as speech or silence. Implements hysteresis-based state machine to prevent chatter at silence boundaries. Emits audio chunks on small gaps (0.3s) for continuous transcription, and end-of-turn signals on large gaps (1.5s).
ASR Stage - Transcribes audio chunks using Canary-Qwen-2.5B (2.5B parameter model). Maintains audio overlap for context continuity and uses semi-global alignment to merge overlapping transcriptions.
Sink Stage - Terminal stage that serializes transcription segments to JSON and sends them directly to the WebSocket client.
Lightweight inference-only wrapper for Canary-Qwen-2.5B. Located in nemo_lite/.
The official NVIDIA NeMo toolkit has heavy dependencies that are problematic for deployment:
lhotse,nv-one-logger-*,fiddle,lightning,hydra- none needed for inference
nemo_lite provides the same transcription capability with minimal dependencies:
torch,torchaudio- Core tensor operationstransformers,peft,safetensors- Model loadinglibrosa- Mel filterbank (matches NeMo exactly)
Key components:
AudioPreprocessor- Converts PCM to mel spectrogram (128 features, 16kHz)FastConformer- 32-layer encoder (1024 dim, 8x temporal downsampling)Qwen3-1.7B + LoRA- Text generation via HuggingFace transformers
Rust/Python library for word sequence merging. Located in strops-rs/.
Provides merge_by_overlap(prev, new) function that uses semi-global alignment to find where the suffix of previous transcription overlaps with the prefix of new transcription. This maintains context continuity when processing audio in overlapping chunks.
>>> from strops import merge_by_overlap
>>> merge_by_overlap(["The", "quick", "brown", "fox"], ["brown", "fox", "jumps"])
["The", "quick", "brown", "fox", "jumps"]Built with maturin and pyo3 for Python bindings.
- NVIDIA GPU with CUDA - Required for real-time performance (RTX 2070 or better recommended)
- CPU mode - Works but significantly slower, not suitable for real-time use
- Nix - For dependency management (or manually install Python dependencies)
- ~5GB disk space - For model weights (downloaded on first run)
Development mode:
# Enter the Nix development shell
nix develop
# Run the server
python stt_server/server.py --port 15751 --host 0.0.0.0
# Or with CPU mode
STT_DEVICE=cpu python stt_server/server.pyUsing the built package:
nix build .#stt-server
./result/bin/stt-server --port 15751The server exposes:
GET /health- Health check endpointWebSocket /ws/transcribe- Streaming transcription
Client sends AudioFrame messages:
{
"samples": "<base64-encoded 16-bit PCM>",
"sampleRate": 16000,
"channels": 1
}Server sends TranscriptionSegment messages:
{
"text": "transcribed text here",
"isFinal": false,
"isEndOfTurn": false
}Stream from microphone:
python -m stt_server.scripts.stt_clientInteractive device selection if multiple microphones are available.
Stream from audio file:
python -m stt_server.scripts.stt_client path/to/audio.mp3Supports WAV, FLAC, MP3, OGG with automatic resampling to 16kHz mono.
Step 1: Add stt-server to your flake inputs
# flake.nix
{
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
stt-server.url = "github:breakds/stt-server";
};
outputs = { nixpkgs, stt-server, ... }: {
nixosConfigurations.your-host = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
specialArgs = { inherit stt-server; };
modules = [ ./configuration.nix ];
};
};
}Step 2: Configure the service
# configuration.nix
{ stt-server, ... }:
{
imports = [ stt-server.nixosModules.default ];
nixpkgs.overlays = [ stt-server.overlays.default ];
# Required for CUDA support
nixpkgs.config.cudaSupport = true;
services.stt-server = {
enable = true;
port = 15751;
host = "0.0.0.0";
device = "cuda"; # or "cpu"
openFirewall = true; # for internal network access
};
}Configuration options:
port- Server port (default: 15751)host- Bind address (default: "0.0.0.0")device- "cuda" or "cpu" (default: "cuda")package- The stt-server package to useopenFirewall- Open TCP port in firewall (default: false)
Notes:
- Model weights are cached in
/var/cache/stt-server(managed by systemd) - For CUDA support, ensure
nixpkgs.config.cudaSupport = trueis set - After adding
openFirewall = true, you may need to runsudo systemctl restart firewallfor the port to open
Main development:
nix developPython environment with all ML dependencies (torch, transformers, etc.) and dev tools (basedpyright, ruff).
strops development:
nix develop .#stropsRust toolchain for developing the strops library.
python -m unittest discover -s stt_server/testsstt-server/
├── stt_server/ # Main Python package
│ ├── server.py # FastAPI WebSocket server
│ ├── session.py # Transcription session management
│ ├── pipeline.py # Async pipeline infrastructure
│ ├── data_types.py # Pydantic models for protocol
│ ├── stages/ # Pipeline stages (VAD, ASR, Sink)
│ ├── scripts/ # CLI tools (stt_client)
│ └── tests/ # Unit tests
├── nemo_lite/ # Lightweight Canary-Qwen wrapper
│ ├── model.py # Main CanaryQwen class
│ ├── preprocessing.py # Mel spectrogram extraction
│ ├── conformer_lite/ # FastConformer encoder
│ ├── qwen/ # Qwen3 LLM wrapper
│ └── weights.py # Weight loading utilities
├── strops-rs/ # Rust/Python sequence alignment
│ ├── src/ # Rust source
│ └── nix/ # Nix packaging
└── nix/ # Nix configuration
├── development.nix # Dev shell
├── release.nix # Package/module exports
├── packages/ # Nix packages
└── modules/ # NixOS modules
MIT