feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6
Open
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
Open
feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6kikduck wants to merge 2 commits intoandrijdavid:mainfrom
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
Conversation
clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops) on pointers returned by ggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors, these pointers are device addresses -- accessing them from the CPU causes an immediate SIGSEGV. The encoder was unaffected because it does not use a KV cache (non-autoregressive). The crash occurred systematically at the decoder prefill step when calling clear_kv_cache(). Replace: - clear_kv_cache: memset -> ggml_backend_tensor_memset - kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset These ggml backend-agnostic APIs handle CPU and GPU transfers correctly. Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8. Made-with: Cursor
Add a standalone HTTP server (voxtral-server) with an OpenAI Whisper-compatible API for audio transcription. Features: - POST /v1/audio/transcriptions (multipart file upload + JSON base64) - GET /health, GET /v1/models - Model loaded once at startup, inference serialized via mutex - CORS support for browser clients - Temporary files auto-cleaned via RAII - Signal handling for graceful shutdown - cpp-httplib (header-only) fetched via CMake FetchContent Also adds --stdin interactive mode to the CLI (voxtral), allowing the model to stay loaded between transcriptions when reading audio paths from stdin. Build: cmake .. -DVOXTRAL_BUILD_SERVER=ON (default: ON) Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a standalone HTTP server (
voxtral-server) with an OpenAI Whisper-compatible API for audio transcription, similar to what whisper.cpp and llama.cpp offer.The model is loaded once at startup and reused across requests. Inference is serialized via
std::mutex(voxtral_context is not thread-safe). This avoids the overhead of loading the ~2.8 GB model for each transcription.New files
src/server.cpp— HTTP server (~550 lines)CMakeLists.txt— Newvoxtral-servertarget + FetchContent for cpp-httplib v0.20.0API Endpoints
GET/health{"status":"ok"}GET/v1/modelsPOST/v1/audio/transcriptionsTranscription endpoint
Accepts two input methods:
curl -F "file=@audio.wav") — standard OpenAI Whisper format{"audio_base64": "..."}) — convenient for programmatic clientsReturns
{"text": "...", "duration": 2.301}(JSON) or plain text.Additional changes
--stdinmode (src/main.cpp): reads audio paths from stdin, keeps model loaded between transcriptions, outputs with__VOXTRAL_END__sentinel. Useful for integration with scripts or Python subprocesses.Build
To disable:
cmake .. -DVOXTRAL_BUILD_SERVER=OFFUsage
Design decisions