Eval bug: Command-A forces full-prompt re-processing due to lack of cache data

### Name and Version

version: 5648 (ed52f366)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

3x RTX 3090

### Models

Cohere Command-A [Q4_K_S](https://huggingface.co/bartowski/CohereForAI_c4ai-command-a-03-2025-GGUF/tree/main/CohereForAI_c4ai-command-a-03-2025-Q4_K_S)


### Problem description & steps to reproduce

Requests with the same prompt do not properly utilize the KV cache:

If I request a text completion or chat completion once, and then request the same completion *again* with the same content, the context is processed from the beginning instead of utilizing the kv cache.

The relevant line in the log:

`forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)`

Before the "first bad commit", regenerating requests worked as expected.

### First Bad Commit

3600cc2886956fc0a07ef6ad2f4128ccfdbc8c6f is the first bad commit
commit 3600cc2886956fc0a07ef6ad2f4128ccfdbc8c6f
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sat May 31 15:57:44 2025 +0300

    llama : use n_swa + n_ubatch cells for SWA cache (#13833)

    * llama : use n_swa + n_ubatch cells for SWA cache

    ggml-ci

    * llama : add warning about multi-sqeuence SWA contexts

### Relevant log output

```shell
./llama-server -m CohereForAI_c4ai-command-a-03-2025-Q4_K_S-00001-of-00002.gguf -c 4096 --temp 0.7 --min_p 0.1 -ctk q8_0 -ctv q8_0 -fa -ngl 999 --jinja

srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 524
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 524, n_tokens = 524, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 524, n_tokens = 524
slot      release: id  0 | task 0 | stop processing: n_past = 526, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    3220.10 ms /   524 tokens (    6.15 ms per token,   162.73 tokens per second)
       eval time =     177.28 ms /     3 tokens (   59.09 ms per token,    16.92 tokens per second)
      total time =    3397.38 ms /   527 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 4 | processing task
slot update_slots: id  0 | task 4 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 524
slot update_slots: id  0 | task 4 | n_past = 524, cache_tokens.size() = 526, seq_id = 0, pos_min = 0, n_swa = 4096
slot update_slots: id  0 | task 4 | forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 4 | kv cache rm [0, end)
slot update_slots: id  0 | task 4 | prompt processing progress, n_past = 524, n_tokens = 524, progress = 1.000000
slot update_slots: id  0 | task 4 | prompt done, n_past = 524, n_tokens = 524
slot      release: id  0 | task 4 | stop processing: n_past = 527, truncated = 0
slot print_timing: id  0 | task 4 |
prompt eval time =    3233.55 ms /   524 tokens (    6.17 ms per token,   162.05 tokens per second)
       eval time =     239.97 ms /     4 tokens (   59.99 ms per token,    16.67 tokens per second)
      total time =    3473.52 ms /   528 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Command-A forces full-prompt re-processing due to lack of cache data #14157

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Command-A forces full-prompt re-processing due to lack of cache data #14157

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions