feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible) by kikduck · Pull Request #6 · andrijdavid/voxtral.cpp

kikduck · 2026-03-08T12:30:01Z

Summary

Add a standalone HTTP server (voxtral-server) with an OpenAI Whisper-compatible API for audio transcription, similar to what whisper.cpp and llama.cpp offer.

The model is loaded once at startup and reused across requests. Inference is serialized via std::mutex (voxtral_context is not thread-safe). This avoids the overhead of loading the ~2.8 GB model for each transcription.

New files

src/server.cpp — HTTP server (~550 lines)
CMakeLists.txt — New voxtral-server target + FetchContent for cpp-httplib v0.20.0

API Endpoints

Method	Path	Description
`GET`	`/health`	Health check → `{"status":"ok"}`
`GET`	`/v1/models`	List loaded model
`POST`	`/v1/audio/transcriptions`	Transcribe audio (OpenAI Whisper-compatible)

Transcription endpoint

Accepts two input methods:

Multipart file upload (curl -F "file=@audio.wav") — standard OpenAI Whisper format
JSON with base64 ({"audio_base64": "..."}) — convenient for programmatic clients

Returns {"text": "...", "duration": 2.301} (JSON) or plain text.

Additional changes

CLI --stdin mode (src/main.cpp): reads audio paths from stdin, keeps model loaded between transcriptions, outputs with __VOXTRAL_END__ sentinel. Useful for integration with scripts or Python subprocesses.
Includes the GPU backend KV cache fix from fix: use backend-agnostic APIs for KV cache on GPU backends #5

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release   # VOXTRAL_BUILD_SERVER=ON by default
cmake --build . -j$(nproc)
# Produces: voxtral, voxtral-server, voxtral-quantize

To disable: cmake .. -DVOXTRAL_BUILD_SERVER=OFF

Usage

./voxtral-server --model path/to/Q4_K_M.gguf --gpu auto --port 8090

# Test
curl http://localhost:8090/health
curl -X POST http://localhost:8090/v1/audio/transcriptions -F "file=@audio.wav"

Design decisions

Zero external dependency besides cpp-httplib (header-only, fetched at build time)
RAII temp files for uploaded audio (auto-cleaned)
CORS enabled for browser clients
Signal handling (SIGINT/SIGTERM) for graceful shutdown
Configurable: host, port, threads, max-tokens, GPU backend, log level

clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops) on pointers returned by ggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors, these pointers are device addresses -- accessing them from the CPU causes an immediate SIGSEGV. The encoder was unaffected because it does not use a KV cache (non-autoregressive). The crash occurred systematically at the decoder prefill step when calling clear_kv_cache(). Replace: - clear_kv_cache: memset -> ggml_backend_tensor_memset - kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset These ggml backend-agnostic APIs handle CPU and GPU transfers correctly. Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8. Made-with: Cursor

Add a standalone HTTP server (voxtral-server) with an OpenAI Whisper-compatible API for audio transcription. Features: - POST /v1/audio/transcriptions (multipart file upload + JSON base64) - GET /health, GET /v1/models - Model loaded once at startup, inference serialized via mutex - CORS support for browser clients - Temporary files auto-cleaned via RAII - Signal handling for graceful shutdown - cpp-httplib (header-only) fetched via CMake FetchContent Also adds --stdin interactive mode to the CLI (voxtral), allowing the model to stay loaded between transcriptions when reading audio paths from stdin. Build: cmake .. -DVOXTRAL_BUILD_SERVER=ON (default: ON) Made-with: Cursor

kikduck added 2 commits March 8, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
kikduck:feat/http-server

kikduck commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kikduck commented Mar 8, 2026

Summary

New files

API Endpoints

Transcription endpoint

Additional changes

Build

Usage

Design decisions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant