Skip to content

Add IndexCache support for GLM5 DSA#45424

Merged
vasqu merged 6 commits intohuggingface:mainfrom
louzongzhi:glm5
Apr 22, 2026
Merged

Add IndexCache support for GLM5 DSA#45424
vasqu merged 6 commits intohuggingface:mainfrom
louzongzhi:glm5

Conversation

@louzongzhi
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR implements IndexCache support for GLM5's DeepSeek Sparse Attention (DSA), enabling cross-layer index reuse to accelerate long-context inference.

IndexCache accelerates sparse attention by reusing top-k token indices across consecutive layers, removing ~75% of redundant indexer computations while maintaining accuracy.

Key implementation details:

  1. Configuration options: Added index_topk_freq, index_topk_pattern, and is_nextn to GlmMoeDsaConfig for flexible layer scheduling (Full/Shared pattern)
  2. Attention layer: Implemented skip_topk/next_skip_topk logic in GlmMoeDsaAttention to determine whether to compute new indices or reuse previous layer's indices
  3. State passing: Added prev_topk_indices parameter propagation through GlmMoeDsaDecoderLayer and GlmMoeDsaModel for cross-layer index sharing
  4. Compatibility: Maintains backward compatibility—returns standard 2-tuple when IndexCache disabled, 3-tuple with topk_indices when enabled

Performance impact:

  • 1.82× prefill speedup on 200K context (indexer time reduced by 75%)
  • Compatible with existing KV cache and beam search

Reference: https://github.com/THUDM/IndexCache

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez @vasqu

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Comment on lines +362 to +378
self.is_nextn = config.is_nextn
if self.is_nextn:
self.skip_topk = False
self.next_skip_topk = False
else:
self.index_topk_freq = config.index_topk_freq
self.index_topk_pattern = config.index_topk_pattern
if self.index_topk_pattern is None:
self.skip_topk = max(layer_idx - 1, 0) % self.index_topk_freq != 0
self.next_skip_topk = layer_idx % self.index_topk_freq != 0
else:
self.skip_topk = self.index_topk_pattern[layer_idx] == "S"
if layer_idx < len(self.index_topk_pattern) - 1:
self.next_skip_topk = self.index_topk_pattern[layer_idx + 1] == "S"
else:
self.next_skip_topk = False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this should never happen here. You should jsut be doing self.is_nextn = config.is_next_n[layer_idx].
This makes it explicit which layers are skipping topK, and which are not!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I'll refactor to use is_next_n: List[bool] in Config (similar to mlp_type ) instead of the complex derivation logic in init . Much cleaner.

Comment on lines +142 to +143
index_topk_freq: int = 1
index_topk_pattern: str | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept index_topk_freq and index_topk_pattern to align with the IndexCache paper terminology (Shared vs Full patterns). However, as per your first suggestion, these will only be used in the Config to construct the is_next_n list—there won't be any derivation logic in the layer init . The layer will simply read config.is_next_n[layer_idx] .

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep its much simpler, explicit and aligned with what we try to have !

Comment on lines 473 to 480
if self.next_skip_topk is None:
return attn_output, attn_weights
else:
if self.next_skip_topk:
return attn_output, attn_weights, topk_indices
else:
return attn_output, attn_weights, None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's always return topk maybe? let's simmplify our life

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the original implementation only returned 2 values, so forcing a 3-tuple return might break backward compatibility for existing code that expects (output, weights) . However, I agree that a consistent API is cleaner, so I'll refactor to always return 3 values and handle the compatibility aspect properly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool that looks simpler!

Comment on lines +142 to +145
self.index_topk_pattern = "F" * self.num_hidden_layers
else:
self.index_topk_pattern = "".join(
"F" if (i == 0 or (i - 1) % self.index_topk_freq == 0) else "S"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK what F stands for! + typing should reflex pattern should be tuple no? you can default it to `("skip", "meaningful name", ...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"F" stands for Full (layers that run the indexer independently to compute top-k indices) and "S" stands for Shared (layers that reuse cached indices from the nearest preceding Full layer). We keep this as a string pattern (e.g., "FFSF...") rather than a tuple to maintain consistency with the IndexCache paper (GLM-5) and the existing implementations in SGLang and vLLM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK what F stands for! + typing should reflex pattern should be tuple no? you can default it to `("skip", "meaningful name", ...)

Done! Changed from string pattern "FSFS..." to list format ["full", "shared", ...] with explicit naming. The generation logic now uses max(i - 1, 0) % freq to match the official IndexCache implementation exactly, and the type annotation is updated to list[str]. This should be much clearer while maintaining consistency with the reference implementation.

@louzongzhi louzongzhi force-pushed the glm5 branch 7 times, most recently from b43c161 to b246219 Compare April 15, 2026 13:28
@louzongzhi louzongzhi requested a review from ArthurZucker April 15, 2026 13:29
@louzongzhi
Copy link
Copy Markdown
Contributor Author

Moves index_topk_pattern generation from Attention.__init__ to
Config.__post_init__ as suggested. Layers now simply check
`config.index_topk_pattern[layer_idx]` instead of computing
skip conditions, matching the mlp_layer_types pattern for
consistent explicit configuration.
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for adding this!

Comment on lines +137 to +138
if self.index_topk_pattern is None:
self.index_topk_pattern = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use something similar to layer_types? instead of freq + pattern we just have a list that we default to the pattern? 🤗

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use something similar to layer_types? instead of freq + pattern we just have a list that we default to the pattern? 🤗

Updated as suggested:

if self.indexer_types is None:
    pattern = kwargs.pop("index_topk_pattern", None)
    freq = kwargs.pop("index_topk_freq", 1)
    if pattern is not None:
        self.indexer_types = [{"F": "full", "S": "shared"}[c] for c in pattern] if isinstance(pattern, str) else list(pattern)
    else:
        self.indexer_types = ["full" if (max(i - 1, 0) % freq) == 0 else "shared" for i in range(self.num_hidden_layers)]

The legacy fallbacks are kept because the official IndexCache repo's patches for vLLM and SGLang currently expose these exact kwargs to end users. For example, in SGLang users launch with:

--json-model-override-args '{"index_topk_freq": 2}'
# or
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'

And in vLLM:

--hf-overrides '{"index_topk_freq": 2}'
# or
--hf-overrides '{"index_topk_pattern": "FFSF..."}'

The official README documents index_topk_freq and index_topk_pattern as the two configuration parameters for both engines . Removing them outright would break existing deployments that rely on these patches. New usage can pass indexer_types directly; the old args are deprecated and only consulted as fallbacks.

If this looks good, I'll push the commit shortly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah of course! sounds

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

@vasqu vasqu enabled auto-merge April 22, 2026 12:53
@vasqu vasqu disabled auto-merge April 22, 2026 12:57
@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 22, 2026

@louzongzhi is this ready for merge? let us know 🤗

@louzongzhi
Copy link
Copy Markdown
Contributor Author

@louzongzhi is this ready for merge? let us know 🤗

Please give me a moment. Installing TileLang messed up my environment, so I'm reconfiguring it now. I'll submit the commit shortly.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm_moe_dsa

@louzongzhi
Copy link
Copy Markdown
Contributor Author

@vasqu @ArthurZucker All done. indexer_types is in place with backward-compatible fallback for index_topk_pattern/index_topk_freq, and modeling references are updated. Please take a look.

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating, merging in a second

@vasqu vasqu enabled auto-merge April 22, 2026 13:48
@vasqu vasqu added this pull request to the merge queue Apr 22, 2026
Merged via the queue into huggingface:main with commit bca7eee Apr 22, 2026
21 checks passed
@louzongzhi louzongzhi deleted the glm5 branch April 22, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants