kv-cache : fix SWA checks + disable cacheless iSWA #15811

ggerganov · 2025-09-05T05:21:01Z

Support for iSWA models without constructing a KV cache would need a bit more work since the existing llm_graph_input_attn_no_cache assumes only a single KQ mask, while to support iSWA we need 2 masks - one for the SWA and one for the non-SWA layers.

Also fix a regression for iSWA models introduced in #15798 - the problem is that when we mask the attention we should not use hparams.swa_type for all layers - only for the SWA layers. This was handled by the KV cache and that is why it had its own swa_type to differentiate from the one in hparams.

ggml-ci

ggerganov · 2025-09-05T05:24:14Z

src/llama-hparams.h

+    // note that this function uses different SWA parameters from those in the hparams
+    // TODO: think of a better place for this function
+    // TODO: pack the SWA params in a struct?
+    static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1);


Changed this to a static function.

Maybe it should become a member like this:

Suggested change

static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1);

bool is_masked_swa(uint32_t il, llama_pos p0, llama_pos p1) const;

But let's refactor this after the master stabilized.

ggerganov · 2025-09-05T07:39:18Z

Merging to fix regular SWA models such as gpt-oss. We can improve EmbeddingGemma support from master.

…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...

…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)

ggml-ci

…15811)" This reverts commit c610b6c.

kv-cache : fix SWA checks + disable cacheless iSWA

43b78f1

ggml-ci

ggerganov mentioned this pull request Sep 5, 2025

Eval bug: gpt-oss incoherent output #15808

Closed

ggerganov commented Sep 5, 2025

View reviewed changes

ggerganov requested a review from danbev September 5, 2025 05:24

danbev approved these changes Sep 5, 2025

View reviewed changes

ggerganov merged commit c610b6c into master Sep 5, 2025
55 checks passed

ggerganov deleted the gg/kv-cache-fix-swa branch September 5, 2025 07:39

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)

b883c5e

ggml-ci

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 26, 2025

Revert "kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#…

1c70d78

…15811)" This reverts commit c610b6c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

kv-cache : fix SWA checks + disable cacheless iSWA #15811

kv-cache : fix SWA checks + disable cacheless iSWA #15811

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

ggerganov Sep 5, 2025

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1);
	bool is_masked_swa(uint32_t il, llama_pos p0, llama_pos p1) const;

Uh oh!

kv-cache : fix SWA checks + disable cacheless iSWA #15811

kv-cache : fix SWA checks + disable cacheless iSWA #15811

Uh oh!

Conversation

ggerganov commented Sep 5, 2025

Uh oh!

ggerganov Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants