Skip to content

Eval bug: Command-A forces full-prompt re-processing due to lack of cache data #14157

Closed
@schynce

Description

@schynce

Name and Version

version: 5648 (ed52f36)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

3x RTX 3090

Models

Cohere Command-A Q4_K_S

Problem description & steps to reproduce

Requests with the same prompt do not properly utilize the KV cache:

If I request a text completion or chat completion once, and then request the same completion again with the same content, the context is processed from the beginning instead of utilizing the kv cache.

The relevant line in the log:

forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

Before the "first bad commit", regenerating requests worked as expected.

First Bad Commit

3600cc2 is the first bad commit
commit 3600cc2
Author: Georgi Gerganov ggerganov@gmail.com
Date: Sat May 31 15:57:44 2025 +0300

llama : use n_swa + n_ubatch cells for SWA cache (#13833)

* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts

Relevant log output

./llama-server -m CohereForAI_c4ai-command-a-03-2025-Q4_K_S-00001-of-00002.gguf -c 4096 --temp 0.7 --min_p 0.1 -ctk q8_0 -ctv q8_0 -fa -ngl 999 --jinja

srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 524
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 524, n_tokens = 524, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 524, n_tokens = 524
slot      release: id  0 | task 0 | stop processing: n_past = 526, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    3220.10 ms /   524 tokens (    6.15 ms per token,   162.73 tokens per second)
       eval time =     177.28 ms /     3 tokens (   59.09 ms per token,    16.92 tokens per second)
      total time =    3397.38 ms /   527 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 4 | processing task
slot update_slots: id  0 | task 4 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 524
slot update_slots: id  0 | task 4 | n_past = 524, cache_tokens.size() = 526, seq_id = 0, pos_min = 0, n_swa = 4096
slot update_slots: id  0 | task 4 | forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 4 | kv cache rm [0, end)
slot update_slots: id  0 | task 4 | prompt processing progress, n_past = 524, n_tokens = 524, progress = 1.000000
slot update_slots: id  0 | task 4 | prompt done, n_past = 524, n_tokens = 524
slot      release: id  0 | task 4 | stop processing: n_past = 527, truncated = 0
slot print_timing: id  0 | task 4 |
prompt eval time =    3233.55 ms /   524 tokens (    6.17 ms per token,   162.05 tokens per second)
       eval time =     239.97 ms /     4 tokens (   59.99 ms per token,    16.67 tokens per second)
      total time =    3473.52 ms /   528 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions