Skip to content

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6

Open
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
kikduck:feat/http-server
Open

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
kikduck:feat/http-server

Conversation

@kikduck
Copy link

@kikduck kikduck commented Mar 8, 2026

Summary

Add a standalone HTTP server (voxtral-server) with an OpenAI Whisper-compatible API for audio transcription, similar to what whisper.cpp and llama.cpp offer.

The model is loaded once at startup and reused across requests. Inference is serialized via std::mutex (voxtral_context is not thread-safe). This avoids the overhead of loading the ~2.8 GB model for each transcription.

New files

  • src/server.cpp — HTTP server (~550 lines)
  • CMakeLists.txt — New voxtral-server target + FetchContent for cpp-httplib v0.20.0

API Endpoints

Method Path Description
GET /health Health check → {"status":"ok"}
GET /v1/models List loaded model
POST /v1/audio/transcriptions Transcribe audio (OpenAI Whisper-compatible)

Transcription endpoint

Accepts two input methods:

  • Multipart file upload (curl -F "file=@audio.wav") — standard OpenAI Whisper format
  • JSON with base64 ({"audio_base64": "..."}) — convenient for programmatic clients

Returns {"text": "...", "duration": 2.301} (JSON) or plain text.

Additional changes

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release   # VOXTRAL_BUILD_SERVER=ON by default
cmake --build . -j$(nproc)
# Produces: voxtral, voxtral-server, voxtral-quantize

To disable: cmake .. -DVOXTRAL_BUILD_SERVER=OFF

Usage

./voxtral-server --model path/to/Q4_K_M.gguf --gpu auto --port 8090

# Test
curl http://localhost:8090/health
curl -X POST http://localhost:8090/v1/audio/transcriptions -F "file=@audio.wav"

Design decisions

  • Zero external dependency besides cpp-httplib (header-only, fetched at build time)
  • RAII temp files for uploaded audio (auto-cleaned)
  • CORS enabled for browser clients
  • Signal handling (SIGINT/SIGTERM) for graceful shutdown
  • Configurable: host, port, threads, max-tokens, GPU backend, log level

kikduck added 2 commits March 8, 2026 13:28
clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops)
on pointers returned by ggml_get_data(). When the KV cache is allocated
on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors,
these pointers are device addresses -- accessing them from the CPU causes
an immediate SIGSEGV.

The encoder was unaffected because it does not use a KV cache
(non-autoregressive). The crash occurred systematically at the decoder
prefill step when calling clear_kv_cache().

Replace:
- clear_kv_cache: memset -> ggml_backend_tensor_memset
- kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset

These ggml backend-agnostic APIs handle CPU and GPU transfers correctly.

Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8.

Made-with: Cursor
Add a standalone HTTP server (voxtral-server) with an OpenAI
Whisper-compatible API for audio transcription.

Features:
- POST /v1/audio/transcriptions (multipart file upload + JSON base64)
- GET /health, GET /v1/models
- Model loaded once at startup, inference serialized via mutex
- CORS support for browser clients
- Temporary files auto-cleaned via RAII
- Signal handling for graceful shutdown
- cpp-httplib (header-only) fetched via CMake FetchContent

Also adds --stdin interactive mode to the CLI (voxtral), allowing
the model to stay loaded between transcriptions when reading audio
paths from stdin.

Build: cmake .. -DVOXTRAL_BUILD_SERVER=ON (default: ON)
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant