Skip to content

fix: use backend-agnostic APIs for KV cache on GPU backends#5

Open
kikduck wants to merge 1 commit intoandrijdavid:mainfrom
kikduck:fix/gpu-backend-kv-cache
Open

fix: use backend-agnostic APIs for KV cache on GPU backends#5
kikduck wants to merge 1 commit intoandrijdavid:mainfrom
kikduck:fix/gpu-backend-kv-cache

Conversation

@kikduck
Copy link

@kikduck kikduck commented Mar 8, 2026

Summary

clear_kv_cache() and kv_cache_shift_left() use memset/memmove (CPU operations) on pointers returned by ggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors, these pointers are device addresses — accessing them from the CPU causes an immediate SIGSEGV.

The encoder is unaffected because it does not use a KV cache (non-autoregressive). The crash occurs systematically at the decoder prefill step when calling clear_kv_cache().

Changes

Function Before (broken) After (fixed)
clear_kv_cache memset(ggml_get_data(tensor), 0, size) ggml_backend_tensor_memset(tensor, 0, 0, size)
kv_cache_shift_left memmove/memset on device pointer ggml_backend_tensor_get → CPU buffer → ggml_backend_tensor_set + ggml_backend_tensor_memset

These ggml_backend_* APIs are backend-agnostic and handle CPU↔GPU transfers correctly.

Testing

  • Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA Toolkit 12.8
  • Encoder: ~146 ms, Adapter: ~1.4 ms, Decoder: ~500 ms (prefill + 35 autoregressive steps)
  • CPU-only mode still works identically (backend APIs fall back to memset/memcpy for CPU tensors)

Impact

This is a critical bug fix for anyone using --gpu auto|cuda|metal|vulkan. Without this fix, transcription crashes with SIGSEGV on the first decoder prefill. CPU-only mode (--gpu none) was unaffected.

clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops)
on pointers returned by ggml_get_data(). When the KV cache is allocated
on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors,
these pointers are device addresses -- accessing them from the CPU causes
an immediate SIGSEGV.

The encoder was unaffected because it does not use a KV cache
(non-autoregressive). The crash occurred systematically at the decoder
prefill step when calling clear_kv_cache().

Replace:
- clear_kv_cache: memset -> ggml_backend_tensor_memset
- kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset

These ggml backend-agnostic APIs handle CPU and GPU transfers correctly.

Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant