llama : use n_swa + n_ubatch cells for SWA cache #13833

ggerganov · 2025-05-27T17:16:45Z

Overview

SWA cache now uses less memory
Enable SWA speculative decoding
Allow short SWA rollbacks (avoids cache recalculations caused by whitespace truncation of the last response)

./scripts/compare-commits.sh master gg/swa-optimize -m models/gemma-3-4b/ggml-model-q8_0.gguf -d 8192 -p 0 -b 512,1024,2048,4096,8192 -n 32 -fa 0,1 -t 1

Model	Batch size	FA	Test	t/s master	t/s gg/swa-optimize	Speedup
gemma3 4B Q8_0	512	No	tg32@d8192	75.63	75.83	1.00
gemma3 4B Q8_0	512	Yes	tg32@d8192	75.12	75.13	1.00
gemma3 4B Q8_0	1024	No	tg32@d8192	73.68	76.03	1.03
gemma3 4B Q8_0	1024	Yes	tg32@d8192	74.94	75.15	1.00
gemma3 4B Q8_0	2048	No	tg32@d8192	69.73	75.92	1.09
gemma3 4B Q8_0	2048	Yes	tg32@d8192	74.62	75.14	1.01
gemma3 4B Q8_0	4096	No	tg32@d8192	62.43	75.92	1.22
gemma3 4B Q8_0	4096	Yes	tg32@d8192	74.39	75.15	1.01
gemma3 4B Q8_0	8192	No	tg32@d8192	54.37	76.00	1.40
gemma3 4B Q8_0	8192	Yes	tg32@d8192	73.25	75.11	1.03

aviallon · 2025-05-28T10:15:14Z

I'll try testing.
Edit: I got distracted, and forgot to test it. Oops.

ngxson · 2025-05-30T14:56:17Z

tools/server/server.cpp

                                const auto pos_min = llama_kv_self_seq_pos_min(ctx, slot.id);
-                                if (pos_min > 0) {
-                                    SLT_WRN(slot, "n_past = %d, cache_tokens.size() = %d, seq_id = %d, pos_min = %d\n", slot.n_past, (int) slot.cache_tokens.size(), slot.id, pos_min);
+                                if (pos_min == -1 || pos_min > slot.n_past - n_swa) {


pos_min == -1 meaning the seq is empty. In this case, I think the behavior of setting n_past = 0 is expected, so we don't necessary need to log the warning

If the sequence is not present in the KV cache (i.e. pos_min == -1), but we somehow decided that slot.n_past > 0 (see the condition above) then this is still unexpected. I think we might even want to abort in such cases, because it means there is a bug somewhere.

ggml-ci

github-actions bot added examples server labels May 27, 2025

ggerganov changed the title ~~llama : use n_swa + n_ubatch cells for SWA cache + auto-batch~~ llama : use n_swa + n_ubatch cells for SWA cache May 28, 2025

ggerganov force-pushed the gg/swa-optimize branch from 1bce7e8 to 6468631 Compare May 28, 2025 07:43

ggerganov changed the base branch from gg/kv-cache-simplify-part3 to gg/auto-batch May 28, 2025 07:52

This was referenced May 28, 2025

kv-cache : refactor + add llama_memory_state_i #13746

Merged

Feature Request: --swa-extra parameter needed to restore speculative decode function with SWA #13747

Closed

ggerganov force-pushed the gg/auto-batch branch from 1adcd4b to ca69f32 Compare May 30, 2025 13:28

ggerganov force-pushed the gg/swa-optimize branch from 6468631 to ef5bb61 Compare May 30, 2025 13:48

ggerganov marked this pull request as ready for review May 30, 2025 14:36

ggerganov requested a review from ngxson as a code owner May 30, 2025 14:36

ngxson approved these changes May 30, 2025

View reviewed changes

ggerganov force-pushed the gg/auto-batch branch from ca69f32 to a059161 Compare May 31, 2025 09:20

Base automatically changed from gg/auto-batch to master May 31, 2025 09:56

llama : use n_swa + n_ubatch cells for SWA cache

4a9253a

ggml-ci

ggerganov force-pushed the gg/swa-optimize branch from ef5bb61 to 4a9253a Compare May 31, 2025 10:09

llama : add warning about multi-sqeuence SWA contexts

855b397

ggerganov force-pushed the gg/swa-optimize branch from 8342295 to 855b397 Compare May 31, 2025 10:48

ggerganov merged commit 3600cc2 into master May 31, 2025
46 checks passed

ggerganov deleted the gg/swa-optimize branch May 31, 2025 12:57

ggerganov mentioned this pull request Jun 11, 2025

Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes #14113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : use n_swa + n_ubatch cells for SWA cache #13833

llama : use n_swa + n_ubatch cells for SWA cache #13833

Uh oh!

ggerganov commented May 27, 2025 •

edited

Loading

Uh oh!

aviallon commented May 28, 2025 •

edited

Loading

Uh oh!

ngxson May 30, 2025

Uh oh!

ggerganov May 30, 2025

Uh oh!

Uh oh!

Uh oh!

llama : use n_swa + n_ubatch cells for SWA cache #13833

llama : use n_swa + n_ubatch cells for SWA cache #13833

Uh oh!

Conversation

ggerganov commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

aviallon commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson May 30, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 27, 2025 •

edited

Loading

aviallon commented May 28, 2025 •

edited

Loading