Load Sliding Window Attention (SWA) pattern from GGUF metadata (qwen3) #17597

taylorchu · 2025-11-29T17:58:09Z

The SWA pattern key was originally introduced in #14400.

This PR changes the behavior so that llama.cpp now reads the sliding window attention pattern directly from the GGUF file (e.g. attention.sliding_window_pattern metadata when present).

This brings the following benefits:

Allows newer models to ship their correct SWA pattern without requiring a code change in llama.cpp
Keeps backward compatibility with existing models
Still permits specific model overrides in code (as before) for cases where the metadata is absent or needs to be corrected

The loading precedence is:

Model-specific hard-coded override (if any)
Value from GGUF metadata

taylorchu · 2025-11-29T18:11:02Z

The tests pass locally.

CISC · 2025-11-29T20:36:57Z

I'm not sure why the sliding_window_pattern metadata was introduced in that PR (@ngxson?) and never used, but it was most likely a mistake, the same thing can be achieved by adding hparams.n_swa_pattern as in #15641 that can be passed to set_swa_pattern.

ngxson · 2025-11-29T21:25:28Z

@CISC It was initially made to reflect the layer_types in the config: https://huggingface.co/google/gemma-3n-E2B-it/blob/main/config.json#L140

IIRC it was a last-minute change because ollama engine requires this KV (so that users can freely use the same GGUF on either ollama or llama.cpp)

Around that time, there was another model was also using this metadata, but it was never made to the public. So does the code to handle it.

So honestly I don't have strong opinion either we should merge this PR or not. It seems like adding it doesn't breaking anything anw (because for most models, the pattern will be overwritten with set_swa_pattern), so probably fine to merge and wait until some models actually use it.

taylorchu · 2025-11-29T22:01:42Z

I was thinking between:

set_swa_pattern with that metadata
just passing the raw swa_types

I choose option 2 because swa_types already exists, and we don't have to choose full attention first or last as a separate param.

ngxson · 2025-11-29T22:05:29Z

realistically, there is no models to this day use more than 1 type of swa_type (excluding the NONE value)

so swa_types is not a good option

taylorchu · 2025-11-29T22:14:47Z

yeah the type is bool (sliding attention or full attention). Only LLM_ARCH_LLAMA4 uses chunked attention. It seems like the trend is linear attention instead of swa in the future, so I am going to keep bool as is instead of enum.

ngxson · 2025-11-29T22:19:57Z

src/llama-model.cpp


+    ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false);
+    if (hparams.n_swa > 0) {
+        hparams.swa_type = LLAMA_SWA_TYPE_STANDARD;


I don't think we can assume anything about swa_type type here. Please remove this line

This line is required; otherwise swa won't activate. This similar structure is used below.

I don't get it. how does swa not get activate?

swa_type should be set per-model. Please remove this line.

case LLAMA_SWA_TYPE_STANDARD: { if (p1 - p0 >= (int32_t) n_swa) { return true; } } break;

swa requires specific masking to work. Without it, it will generate gibberish. The default semantic "standard". I think specific models can set to chunked or symmetric.

context: https://github.com/search?q=repo%3Ahuggingface%2Ftransformers+sliding_attention&type=code&p=4

ngxson · 2025-11-29T22:20:05Z

src/llama-model.cpp

+    } else {
+        hparams.swa_type = LLAMA_SWA_TYPE_NONE;
+    }


Remove this branch too, it's redundant

I can remove this because it is LLAMA_SWA_TYPE_NONE by default, but it seems like LLAMA_SWA_TYPE_NONE is added too below. I added it just in case.

ngxson · 2025-11-29T22:24:47Z

Can also test this PR with one of the models using SWA (for example, gemma3n) to confirm if it does not break anything?

ngxson · 2025-11-29T22:40:47Z

hmm, on second thought, I think swa_layers should only be read by model that use it, the same with existing n_swa logic.

I don't quite like this change as it read everything at global level. this affects all models and potentially break things, because we cannot test all GGUF in the wild

so unless you can say otherwise, I prefer closing this PR and only implement this when a new model actually need it

taylorchu · 2025-11-29T23:19:36Z

I tested with gemma3n: https://huggingface.co/unsloth/gemma-3n-E2B-it-GGUF/blob/main/gemma-3n-E2B-it-Q4_0.gguf

llama_model_loader: - kv  23:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  24:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  25:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  26:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  27:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  28:         gemma3n.attention.shared_kv_layers u32              = 10
llama_model_loader: - kv  29:          gemma3n.activation_sparsity_scale arr[f32,30]      = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv  30:   gemma3n.attention.sliding_window_pattern arr[bool,30]     = [true, true, true, true, false, true,...
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal, is_swa = 1
load_tensors: layer   1 assigned to device Metal, is_swa = 1
load_tensors: layer   2 assigned to device Metal, is_swa = 1
load_tensors: layer   3 assigned to device Metal, is_swa = 1
load_tensors: layer   4 assigned to device Metal, is_swa = 0
load_tensors: layer   5 assigned to device Metal, is_swa = 1
load_tensors: layer   6 assigned to device Metal, is_swa = 1
load_tensors: layer   7 assigned to device Metal, is_swa = 1
load_tensors: layer   8 assigned to device Metal, is_swa = 1
load_tensors: layer   9 assigned to device Metal, is_swa = 0
load_tensors: layer  10 assigned to device Metal, is_swa = 1
load_tensors: layer  11 assigned to device Metal, is_swa = 1
load_tensors: layer  12 assigned to device Metal, is_swa = 1
load_tensors: layer  13 assigned to device Metal, is_swa = 1
load_tensors: layer  14 assigned to device Metal, is_swa = 0
load_tensors: layer  15 assigned to device Metal, is_swa = 1
load_tensors: layer  16 assigned to device Metal, is_swa = 1
load_tensors: layer  17 assigned to device Metal, is_swa = 1
load_tensors: layer  18 assigned to device Metal, is_swa = 1
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 1
load_tensors: layer  21 assigned to device Metal, is_swa = 1
load_tensors: layer  22 assigned to device Metal, is_swa = 1
load_tensors: layer  23 assigned to device Metal, is_swa = 1
load_tensors: layer  24 assigned to device Metal, is_swa = 0
load_tensors: layer  25 assigned to device Metal, is_swa = 1
load_tensors: layer  26 assigned to device Metal, is_swa = 1
load_tensors: layer  27 assigned to device Metal, is_swa = 1
load_tensors: layer  28 assigned to device Metal, is_swa = 1
load_tensors: layer  29 assigned to device Metal, is_swa = 0
load_tensors: layer  30 assigned to device Metal, is_swa = 0

taylorchu · 2025-11-29T23:27:18Z

This should be low-risk because we only read swa_pattern from metadata if n_swa > 0. If the swa_pattern metadata is not present, it is false (full attention).

taylorchu · 2025-11-30T00:48:43Z

@ngxson I technically only need qwen3. I updated the pr to address your concern.

ngxson · 2025-11-30T10:00:13Z

Why? Qwen3 doesn't need it. IMO you shouldn't just push something only because you need it

CISC · 2025-11-30T12:47:29Z

@ngxson I technically only need qwen3. I updated the pr to address your concern.

Are you working on a Qwen3 based model with SWA?

taylorchu · 2025-11-30T15:39:22Z

@ngxson I technically only need qwen3. I updated the pr to address your concern.

Are you working on a Qwen3 based model with SWA?

Yes

ngxson · 2025-11-30T16:02:14Z

If that's the case, you should have mentioned explicitly in the description. Otherwise we have no idea what this is used for.

Just note that we generally don't accept these kinds of changes unless the model is public, or about to be public. We cannot verify of this actually work if we don't have access to the model.

taylorchu · 2025-11-30T16:02:34Z

technically qwen > 1.5 should already test with swa. QwenLM/Qwen3#168. This is also included in huggingface.
Btw, I hope this is supported in all qwen > 1.5 models here, but I only tested with qwen3. If someone needs it in older versions, they can add this later.

ngxson · 2025-11-30T16:07:39Z

This is also included in huggingface.

Can you specify exactly which model? Qwen3 does not have such setting in the config.json. Honestly it will be way easier for us if you just link to the exact model and config.json having it.

Also, just want to note that you also need to handle this in the conversion script for qwen model.

ngxson · 2025-11-30T16:13:58Z

technically qwen > 1.5 should already test with swa. QwenLM/Qwen3#168

The mentioned discussion is from 2024 and it was about qwen2. They reused the same repo, just rename it. There is no explicit confirmation whether qwen3 support SWA or not.

Honestly, the motivation of this PR is unclear and there is nothing to test it.

taylorchu · 2025-11-30T16:59:08Z

https://huggingface.co/docs/transformers/main/en/model_doc/qwen3#transformers.Qwen3Config.max_window_layers
https://huggingface.co/docs/transformers/main/en/model_doc/qwen2#transformers.Qwen2Config.max_window_layers

The goal of the pr is stated in the description; the suggetion done later makes it confusing. I wanted to support all models. You claimed that there will be a risk that it will break other models, so I reduced the scope to only qwen3 model. You asked me to tested with gemma3n before, I tested it. Now you are asking for a specific qwen3 model where there is a clear huggingface support for swa but qwen team did not release a model with it. We have been discussing for non-technical view points, and I found it non-productive. In the end, this pr is further away from being merged after all the work.

Lets take a step back. what are the reasonable steps to support swa in general? what do you think should be included from here?

ngxson · 2025-11-30T21:46:39Z

@taylorchu I got frustrated because you only mentioned about Qwen and the max_window_layers when I asked about it. Had you mentioned it from the beginning, it would have saved me time and make everything more productive.

I wanted to support all models.

And no, we cannot "automatically" support SWA for all models. You already discovered the reason: each model graph have to manually make a call to build_attn_inp_kv_iswa, and not all classes do this. That's why SWA config are loaded per-model, not globally like what you assumed initially.

And obviously changes you included here won't work without handling it in GGUF conversion script. So that's why I asked for the Qwen config. There is nowhere in the discussion that I mentioned about a non-technical thing.

You can't either assume that Qwen will eventually release a model supporting SWA, they do not explicitly confirm that. If they never release one (or maybe yes but with a different implementation), then this will effectively become a dead code.

So no, it is not a good practice to implement features that we cannot test (in other words, Qwen model with SWA does not exist at this moment)

taylorchu · 2025-12-01T00:45:26Z

First of all, thanks for reviewing anyway.

I refer to the non-technical view point as "we only support a public, already-released model actually uses them; support is never added preemptively just because Hugging Face Transformers code or a config field exists." It is a fine policy if you clearly state so in the readme.

The test model can be trained with huggingface with small datasets, and can be verified after.

None of what you said is a true technical blocker, even build_attn_inp_kv_iswa.

taylorchu added 2 commits November 29, 2025 09:35

read swa pattern for all models

29878b2

add missing template

5a0c228

taylorchu requested review from CISC and ggerganov as code owners November 29, 2025 17:58

loci-dev mentioned this pull request Nov 29, 2025

UPSTREAM PR #17597: Load Sliding Window Attention (SWA) pattern from GGUF metadata auroralabs-loci/llama.cpp#367

Open

ngxson reviewed Nov 29, 2025

View reviewed changes

bool is not supported

b355891

qwen3 only

cbcd6f9

taylorchu changed the title ~~Load Sliding Window Attention (SWA) pattern from GGUF metadata~~ Load Sliding Window Attention (SWA) pattern from GGUF metadata (qwen3) Nov 30, 2025

github-actions bot added the model Model specific label Nov 30, 2025

ngxson closed this Nov 30, 2025

Load Sliding Window Attention (SWA) pattern from GGUF metadata (qwen3) #17597

Load Sliding Window Attention (SWA) pattern from GGUF metadata (qwen3) #17597

Conversation

taylorchu commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorchu commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 29, 2025

Uh oh!

ngxson commented Nov 29, 2025

Uh oh!

taylorchu commented Nov 29, 2025

Uh oh!

ngxson commented Nov 29, 2025

Uh oh!

taylorchu commented Nov 29, 2025

Uh oh!

ngxson Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

taylorchu Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

taylorchu Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

taylorchu Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

taylorchu Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Nov 29, 2025

Uh oh!

ngxson commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorchu commented Nov 29, 2025

Uh oh!

taylorchu commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorchu commented Nov 30, 2025

Uh oh!

ngxson commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 30, 2025

Uh oh!

taylorchu commented Nov 30, 2025

Uh oh!

ngxson commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorchu commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorchu commented Nov 30, 2025

Uh oh!

ngxson commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorchu commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

taylorchu commented Nov 29, 2025 •

edited

Loading

taylorchu commented Nov 29, 2025 •

edited

Loading

ngxson commented Nov 29, 2025 •

edited

Loading

taylorchu commented Nov 29, 2025 •

edited

Loading

ngxson commented Nov 30, 2025 •

edited

Loading

ngxson commented Nov 30, 2025 •

edited

Loading

taylorchu commented Nov 30, 2025 •

edited

Loading

ngxson commented Nov 30, 2025 •

edited

Loading

ngxson commented Nov 30, 2025 •

edited

Loading

ngxson commented Nov 30, 2025 •

edited

Loading