Description
Name and Version
version: 5648 (ed52f36)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
3x RTX 3090
Models
Cohere Command-A Q4_K_S
Problem description & steps to reproduce
Requests with the same prompt do not properly utilize the KV cache:
If I request a text completion or chat completion once, and then request the same completion again with the same content, the context is processed from the beginning instead of utilizing the kv cache.
The relevant line in the log:
forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
Before the "first bad commit", regenerating requests worked as expected.
First Bad Commit
3600cc2 is the first bad commit
commit 3600cc2
Author: Georgi Gerganov ggerganov@gmail.com
Date: Sat May 31 15:57:44 2025 +0300
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
Relevant log output
./llama-server -m CohereForAI_c4ai-command-a-03-2025-Q4_K_S-00001-of-00002.gguf -c 4096 --temp 0.7 --min_p 0.1 -ctk q8_0 -ctv q8_0 -fa -ngl 999 --jinja
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 524
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 524, n_tokens = 524, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 524, n_tokens = 524
slot release: id 0 | task 0 | stop processing: n_past = 526, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 3220.10 ms / 524 tokens ( 6.15 ms per token, 162.73 tokens per second)
eval time = 177.28 ms / 3 tokens ( 59.09 ms per token, 16.92 tokens per second)
total time = 3397.38 ms / 527 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 4 | processing task
slot update_slots: id 0 | task 4 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 524
slot update_slots: id 0 | task 4 | n_past = 524, cache_tokens.size() = 526, seq_id = 0, pos_min = 0, n_swa = 4096
slot update_slots: id 0 | task 4 | forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 4 | kv cache rm [0, end)
slot update_slots: id 0 | task 4 | prompt processing progress, n_past = 524, n_tokens = 524, progress = 1.000000
slot update_slots: id 0 | task 4 | prompt done, n_past = 524, n_tokens = 524
slot release: id 0 | task 4 | stop processing: n_past = 527, truncated = 0
slot print_timing: id 0 | task 4 |
prompt eval time = 3233.55 ms / 524 tokens ( 6.17 ms per token, 162.05 tokens per second)
eval time = 239.97 ms / 4 tokens ( 59.99 ms per token, 16.67 tokens per second)
total time = 3473.52 ms / 528 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200