Add IndexCache support for GLM5 DSA by louzongzhi · Pull Request #45424 · huggingface/transformers

louzongzhi · 2026-04-14T05:14:08Z

What does this PR do?

This PR implements IndexCache support for GLM5's DeepSeek Sparse Attention (DSA), enabling cross-layer index reuse to accelerate long-context inference.

IndexCache accelerates sparse attention by reusing top-k token indices across consecutive layers, removing ~75% of redundant indexer computations while maintaining accuracy.

Key implementation details:

Configuration options: Added index_topk_freq, index_topk_pattern, and is_nextn to GlmMoeDsaConfig for flexible layer scheduling (Full/Shared pattern)
Attention layer: Implemented skip_topk/next_skip_topk logic in GlmMoeDsaAttention to determine whether to compute new indices or reuse previous layer's indices
State passing: Added prev_topk_indices parameter propagation through GlmMoeDsaDecoderLayer and GlmMoeDsaModel for cross-layer index sharing
Compatibility: Maintains backward compatibility—returns standard 2-tuple when IndexCache disabled, 3-tuple with topk_indices when enabled

Performance impact:

1.82× prefill speedup on 200K context (indexer time reduced by 75%)
Compatible with existing KV cache and beam search

Reference: https://github.com/THUDM/IndexCache

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez @vasqu

ArthurZucker

Sounds good!

ArthurZucker · 2026-04-14T08:53:24Z

+        self.is_nextn = config.is_nextn
+        if self.is_nextn:
+            self.skip_topk = False
+            self.next_skip_topk = False
+        else:
+            self.index_topk_freq = config.index_topk_freq
+            self.index_topk_pattern = config.index_topk_pattern
+            if self.index_topk_pattern is None:
+                self.skip_topk = max(layer_idx - 1, 0) % self.index_topk_freq != 0
+                self.next_skip_topk = layer_idx % self.index_topk_freq != 0
+            else:
+                self.skip_topk = self.index_topk_pattern[layer_idx] == "S"
+                if layer_idx < len(self.index_topk_pattern) - 1:
+                    self.next_skip_topk = self.index_topk_pattern[layer_idx + 1] == "S"
+                else:
+                    self.next_skip_topk = False
+


all of this should never happen here. You should jsut be doing self.is_nextn = config.is_next_n[layer_idx].
This makes it explicit which layers are skipping topK, and which are not!

Agreed! I'll refactor to use is_next_n: List[bool] in Config (similar to mlp_type ) instead of the complex derivation logic in init . Much cleaner.

ArthurZucker · 2026-04-14T08:53:35Z

+    index_topk_freq: int = 1
+    index_topk_pattern: str | None = None


see my comment!

I kept index_topk_freq and index_topk_pattern to align with the IndexCache paper terminology (Shared vs Full patterns). However, as per your first suggestion, these will only be used in the Config to construct the is_next_n list—there won't be any derivation logic in the layer init . The layer will simply read config.is_next_n[layer_idx] .

yep its much simpler, explicit and aligned with what we try to have !

ArthurZucker · 2026-04-14T08:54:02Z

+        if self.next_skip_topk is None:
+            return attn_output, attn_weights
+        else:
+            if self.next_skip_topk:
+                return attn_output, attn_weights, topk_indices
+            else:
+                return attn_output, attn_weights, None



let's always return topk maybe? let's simmplify our life

My concern is that the original implementation only returned 2 values, so forcing a 3-tuple return might break backward compatibility for existing code that expects (output, weights) . However, I agree that a consistent API is cleaner, so I'll refactor to always return 3 values and handle the compatibility aspect properly.

HuggingFaceDocBuilderDev · 2026-04-14T13:10:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

cool that looks simpler!

ArthurZucker · 2026-04-14T13:39:35Z

+                self.index_topk_pattern = "F" * self.num_hidden_layers
+            else:
+                self.index_topk_pattern = "".join(
+                    "F" if (i == 0 or (i - 1) % self.index_topk_freq == 0) else "S"


IDK what F stands for! + typing should reflex pattern should be tuple no? you can default it to `("skip", "meaningful name", ...)

"F" stands for Full (layers that run the indexer independently to compute top-k indices) and "S" stands for Shared (layers that reuse cached indices from the nearest preceding Full layer). We keep this as a string pattern (e.g., "FFSF...") rather than a tuple to maintain consistency with the IndexCache paper (GLM-5) and the existing implementations in SGLang and vLLM.

IDK what F stands for! + typing should reflex pattern should be tuple no? you can default it to `("skip", "meaningful name", ...)

Done! Changed from string pattern "FSFS..." to list format ["full", "shared", ...] with explicit naming. The generation logic now uses max(i - 1, 0) % freq to match the official IndexCache implementation exactly, and the type annotation is updated to list[str]. This should be much clearer while maintaining consistency with the reference implementation.

louzongzhi · 2026-04-16T13:32:23Z

@ArthurZucker @Cyrilvallez @vasqu

Moves index_topk_pattern generation from Attention.__init__ to Config.__post_init__ as suggested. Layers now simply check `config.index_topk_pattern[layer_idx]` instead of computing skip conditions, matching the mlp_layer_types pattern for consistent explicit configuration.

ArthurZucker

LGTM thanks for adding this!

ArthurZucker · 2026-04-22T04:55:29Z

+        if self.index_topk_pattern is None:
+            self.index_topk_pattern = [


can we just use something similar to layer_types? instead of freq + pattern we just have a list that we default to the pattern? 🤗

can we just use something similar to layer_types? instead of freq + pattern we just have a list that we default to the pattern? 🤗

Updated as suggested:

if self.indexer_types is None: pattern = kwargs.pop("index_topk_pattern", None) freq = kwargs.pop("index_topk_freq", 1) if pattern is not None: self.indexer_types = [{"F": "full", "S": "shared"}[c] for c in pattern] if isinstance(pattern, str) else list(pattern) else: self.indexer_types = ["full" if (max(i - 1, 0) % freq) == 0 else "shared" for i in range(self.num_hidden_layers)]

The legacy fallbacks are kept because the official IndexCache repo's patches for vLLM and SGLang currently expose these exact kwargs to end users. For example, in SGLang users launch with:

--json-model-override-args '{"index_topk_freq": 2}' # or --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}'

And in vLLM:

--hf-overrides '{"index_topk_freq": 2}' # or --hf-overrides '{"index_topk_pattern": "FFSF..."}'

The official README documents index_topk_freq and index_topk_pattern as the two configuration parameters for both engines . Removing them outright would break existing deployments that rely on these patches. New usage can pass indexer_types directly; the old args are deprecated and only consulted as fallbacks.

If this looks good, I'll push the commit shortly.

yeah of course! sounds

ArthurZucker

ty!

vasqu · 2026-04-22T12:57:35Z

@louzongzhi is this ready for merge? let us know 🤗

louzongzhi · 2026-04-22T13:11:26Z

@louzongzhi is this ready for merge? let us know 🤗

Please give me a moment. Installing TileLang messed up my environment, so I'm reconfiguring it now. I'll submit the commit shortly.

github-actions · 2026-04-22T13:37:56Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm_moe_dsa

louzongzhi · 2026-04-22T13:42:22Z

@vasqu @ArthurZucker All done. indexer_types is in place with backward-compatible fallback for index_topk_pattern/index_topk_freq, and modeling references are updated. Please take a look.

vasqu

LGTM, thanks for iterating, merging in a second

louzongzhi force-pushed the glm5 branch from 21255a2 to db56a66 Compare April 14, 2026 08:14

ArthurZucker reviewed Apr 14, 2026

View reviewed changes

louzongzhi force-pushed the glm5 branch from aa1aca2 to 58674b7 Compare April 14, 2026 11:23

ArthurZucker reviewed Apr 14, 2026

View reviewed changes

louzongzhi force-pushed the glm5 branch 7 times, most recently from b43c161 to b246219 Compare April 15, 2026 13:28

louzongzhi requested a review from ArthurZucker April 15, 2026 13:29

louzongzhi force-pushed the glm5 branch from 14db1e9 to b246219 Compare April 16, 2026 13:59

louzongzhi added 2 commits April 21, 2026 00:01

Add IndexCache support for GLM5 DSA

3552a09

louzongzhi force-pushed the glm5 branch from b246219 to 41e9bd8 Compare April 20, 2026 16:01

ArthurZucker approved these changes Apr 22, 2026

View reviewed changes

vasqu added 2 commits April 22, 2026 14:53

fix

339b626

Merge branch 'main' into glm5

906e302

vasqu enabled auto-merge April 22, 2026 12:53

vasqu disabled auto-merge April 22, 2026 12:57

oof, typo

c279762

louzongzhi force-pushed the glm5 branch from 55649cf to c279762 Compare April 22, 2026 13:36

remove the exception as its now hidden behind kwargs for BC

136a280

vasqu approved these changes Apr 22, 2026

View reviewed changes

vasqu enabled auto-merge April 22, 2026 13:48

vasqu added this pull request to the merge queue Apr 22, 2026

Merged via the queue into huggingface:main with commit bca7eee Apr 22, 2026
21 checks passed

louzongzhi deleted the glm5 branch April 22, 2026 14:09

louzongzhi mentioned this pull request Apr 23, 2026

Add unified Cache-layer management for GLM-5 DSA Indexer keys #45595

Closed

6 tasks

		index_topk_freq: int = 1
		index_topk_pattern: str \| None = None

		if self.index_topk_pattern is None:
		self.index_topk_pattern = [

Conversation

louzongzhi commented Apr 14, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 14, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

louzongzhi commented Apr 16, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louzongzhi commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

louzongzhi commented Apr 22, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vasqu commented Apr 22, 2026 •

edited

Loading