fix: use backend-agnostic APIs for KV cache on GPU backends#5
Open
kikduck wants to merge 1 commit intoandrijdavid:mainfrom
Open
fix: use backend-agnostic APIs for KV cache on GPU backends#5kikduck wants to merge 1 commit intoandrijdavid:mainfrom
kikduck wants to merge 1 commit intoandrijdavid:mainfrom
Conversation
clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops) on pointers returned by ggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors, these pointers are device addresses -- accessing them from the CPU causes an immediate SIGSEGV. The encoder was unaffected because it does not use a KV cache (non-autoregressive). The crash occurred systematically at the decoder prefill step when calling clear_kv_cache(). Replace: - clear_kv_cache: memset -> ggml_backend_tensor_memset - kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset These ggml backend-agnostic APIs handle CPU and GPU transfers correctly. Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
clear_kv_cache()andkv_cache_shift_left()usememset/memmove(CPU operations) on pointers returned byggml_get_data(). When the KV cache is allocated on a GPU backend (CUDA, Metal, Vulkan) viaggml_backend_alloc_ctx_tensors, these pointers are device addresses — accessing them from the CPU causes an immediate SIGSEGV.The encoder is unaffected because it does not use a KV cache (non-autoregressive). The crash occurs systematically at the decoder prefill step when calling
clear_kv_cache().Changes
clear_kv_cachememset(ggml_get_data(tensor), 0, size)ggml_backend_tensor_memset(tensor, 0, 0, size)kv_cache_shift_leftmemmove/memseton device pointerggml_backend_tensor_get→ CPU buffer →ggml_backend_tensor_set+ggml_backend_tensor_memsetThese
ggml_backend_*APIs are backend-agnostic and handle CPU↔GPU transfers correctly.Testing
Impact
This is a critical bug fix for anyone using
--gpu auto|cuda|metal|vulkan. Without this fix, transcription crashes with SIGSEGV on the first decoder prefill. CPU-only mode (--gpu none) was unaffected.