fix: use backend-agnostic APIs for KV cache on GPU backends by kikduck · Pull Request #5 · andrijdavid/voxtral.cpp

kikduck · 2026-03-08T12:29:41Z

Summary

clear_kv_cache() and kv_cache_shift_left() use memset/memmove (CPU operations) on pointers returned by ggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors, these pointers are device addresses — accessing them from the CPU causes an immediate SIGSEGV.

The encoder is unaffected because it does not use a KV cache (non-autoregressive). The crash occurs systematically at the decoder prefill step when calling clear_kv_cache().

Changes

Function	Before (broken)	After (fixed)
`clear_kv_cache`	`memset(ggml_get_data(tensor), 0, size)`	`ggml_backend_tensor_memset(tensor, 0, 0, size)`
`kv_cache_shift_left`	`memmove`/`memset` on device pointer	`ggml_backend_tensor_get` → CPU buffer → `ggml_backend_tensor_set` + `ggml_backend_tensor_memset`

These ggml_backend_* APIs are backend-agnostic and handle CPU↔GPU transfers correctly.

Testing

Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA Toolkit 12.8
Encoder: ~146 ms, Adapter: ~1.4 ms, Decoder: ~500 ms (prefill + 35 autoregressive steps)
CPU-only mode still works identically (backend APIs fall back to memset/memcpy for CPU tensors)

Impact

This is a critical bug fix for anyone using --gpu auto|cuda|metal|vulkan. Without this fix, transcription crashes with SIGSEGV on the first decoder prefill. CPU-only mode (--gpu none) was unaffected.

clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops) on pointers returned by ggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors, these pointers are device addresses -- accessing them from the CPU causes an immediate SIGSEGV. The encoder was unaffected because it does not use a KV cache (non-autoregressive). The crash occurred systematically at the decoder prefill step when calling clear_kv_cache(). Replace: - clear_kv_cache: memset -> ggml_backend_tensor_memset - kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset These ggml backend-agnostic APIs handle CPU and GPU transfers correctly. Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8. Made-with: Cursor

kikduck mentioned this pull request Mar 8, 2026

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible) #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use backend-agnostic APIs for KV cache on GPU backends#5

fix: use backend-agnostic APIs for KV cache on GPU backends#5
kikduck wants to merge 1 commit intoandrijdavid:mainfrom
kikduck:fix/gpu-backend-kv-cache

kikduck commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kikduck commented Mar 8, 2026

Summary

Changes

Testing

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant