Skip to content

llama : use n_swa + n_ubatch cells for SWA cache #13833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 31, 2025
Merged

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented May 27, 2025

target #13845

Overview

  • SWA cache now uses less memory
  • Enable SWA speculative decoding
  • Allow short SWA rollbacks (avoids cache recalculations caused by whitespace truncation of the last response)
./scripts/compare-commits.sh master gg/swa-optimize -m models/gemma-3-4b/ggml-model-q8_0.gguf -d 8192 -p 0 -b 512,1024,2048,4096,8192 -n 32 -fa 0,1 -t 1
Model Batch size FA Test t/s master t/s gg/swa-optimize Speedup
gemma3 4B Q8_0 512 No tg32@d8192 75.63 75.83 1.00
gemma3 4B Q8_0 512 Yes tg32@d8192 75.12 75.13 1.00
gemma3 4B Q8_0 1024 No tg32@d8192 73.68 76.03 1.03
gemma3 4B Q8_0 1024 Yes tg32@d8192 74.94 75.15 1.00
gemma3 4B Q8_0 2048 No tg32@d8192 69.73 75.92 1.09
gemma3 4B Q8_0 2048 Yes tg32@d8192 74.62 75.14 1.01
gemma3 4B Q8_0 4096 No tg32@d8192 62.43 75.92 1.22
gemma3 4B Q8_0 4096 Yes tg32@d8192 74.39 75.15 1.01
gemma3 4B Q8_0 8192 No tg32@d8192 54.37 76.00 1.40
gemma3 4B Q8_0 8192 Yes tg32@d8192 73.25 75.11 1.03

@ggerganov ggerganov changed the title llama : use n_swa + n_ubatch cells for SWA cache + auto-batch llama : use n_swa + n_ubatch cells for SWA cache May 28, 2025
@ggerganov ggerganov changed the base branch from gg/kv-cache-simplify-part3 to gg/auto-batch May 28, 2025 07:52
@aviallon
Copy link
Contributor

aviallon commented May 28, 2025

I'll try testing.
Edit: I got distracted, and forgot to test it. Oops.

@ggerganov ggerganov marked this pull request as ready for review May 30, 2025 14:36
@ggerganov ggerganov requested a review from ngxson as a code owner May 30, 2025 14:36
const auto pos_min = llama_kv_self_seq_pos_min(ctx, slot.id);
if (pos_min > 0) {
SLT_WRN(slot, "n_past = %d, cache_tokens.size() = %d, seq_id = %d, pos_min = %d\n", slot.n_past, (int) slot.cache_tokens.size(), slot.id, pos_min);
if (pos_min == -1 || pos_min > slot.n_past - n_swa) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pos_min == -1 meaning the seq is empty. In this case, I think the behavior of setting n_past = 0 is expected, so we don't necessary need to log the warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sequence is not present in the KV cache (i.e. pos_min == -1), but we somehow decided that slot.n_past > 0 (see the condition above) then this is still unexpected. I think we might even want to abort in such cases, because it means there is a bug somewhere.

Base automatically changed from gg/auto-batch to master May 31, 2025 09:56
@ggerganov ggerganov merged commit 3600cc2 into master May 31, 2025
46 checks passed
@ggerganov ggerganov deleted the gg/swa-optimize branch May 31, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants