[CB] Allow block sharing in hybrid models #42877

remi-or · 2025-12-15T15:45:43Z

Summary

This PR increases the granularity of KV cache sharing. Previously, sharing was only enabled for full-attention only models. With this PR, sharing is enabled for any model that has full attention layers. It can still be disabled with a flag.
This PR paves the way for parallel decoding, by making it more efficient in hybrid-architectures.

Performance

Attention	Version	Generated tokens	Duration (s)	Throughput (tok/s)
Flash attention 3	This PR	113149	16.61	6811.56
Flash attention 3	Main branch	112599	16.73	6729.27
Flash attention 2	This PR	113670	24.83	4578.74
Flash attention 2	Main branch	112822	24.61	4584.74
SDPA	This PR	112970	84.32	1339.76
SDPA	Main branch	113254	82.49	1373.00

Tests

No new failures for the CB tests.

Sanity check

Ran the command python examples/pytorch/continuous_batching.py --samples 20 --add-prefix --compile --compare and outputs were nearly the same and made sense.

HuggingFaceDocBuilderDev · 2025-12-15T16:02:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu · 2025-12-15T22:32:10Z

Looks cool, let me know if I can help add some docs for it here!

remi-or · 2025-12-17T11:18:37Z

@stevhliu I am pushing a lot of new features to CB right now, will let you know when things settle down so we can do a big push on docs! Thanks!

ArthurZucker

sounds good! The perf don't show diff, they should for models that were hybrid no?

ArthurZucker · 2025-12-17T13:22:40Z

src/transformers/generation/continuous_batching/cache.py

-        self.use_prefix_sharing = allow_prefix_sharing and group_types == ["full_attention"]
-        self._block_manager = BlockManager(num_blocks, self.block_size, self.use_prefix_sharing)
+        # We only use prefix sharing if the whole model has only full attention layers and block sharing is allowed
+        self.use_prefix_sharing = allow_block_sharing and group_types == ["full_attention"]


no rename on the self attr?

Good question! prefix_sharing is only possible if block sharing (which is more of a memory optimization) is enabled AND the model has no sliding window layers: if there are any, they will create sliding window groups with no shareable blocks, hence no prefix sharing.

ArthurZucker · 2025-12-17T13:23:28Z

src/transformers/generation/continuous_batching/cache.py

        for b in range(len(prompt_ids) // self.block_size):
            tokens = prompt_ids[b * self.block_size : (b + 1) * self.block_size]
-            current_hash = self._block_manager.compute_hash(current_hash, tokens)
+            # Prefix sharing is only supported when there is only one full attention layer group, so group_id=0.


is the comment still valid? thought this PR allowed different groups to exist, and only acts on 1

Yes, because prefix sharing is still only activated if there is only one group -- cf comment above

but then to mark as complete we loop on the groups no?

Yes because marking a block as complete is useful in the context of block sharing, which can happen in an hybrid model. But here we are in the context of prefix sharing, which is more restrictive, so we know there is only one group. Maybe I am missing something here

ArthurZucker · 2025-12-17T13:24:14Z

src/transformers/generation/continuous_batching/cache.py

-        a no-op."""
-        num_complete_blocks = 0 if not self.use_prefix_sharing else self.blocks_to_complete.pop(state.request_id)
+    def mark_shareable_blocks_as_complete(self, state: RequestState) -> None:
+        """Marks the shareable blocks that have been computed in the forward pass as complete. If block sharing is off,


Suggested change

"""Marks the shareable blocks that have been computed in the forward pass as complete. If block sharing is off,

"""Marks the shareable blocks that have been computed in the forward pass as complete (meaning it contains cache for tokens that are already processed, vs empty cache for futur new tokens). If block sharing is off,

Not entirely, but I can see the confusion. Will adapt

src/transformers/generation/continuous_batching/cache.py

ArthurZucker · 2025-12-17T13:25:21Z

src/transformers/generation/continuous_batching/cache_manager.py



-class Block:
+class Block:  # TODO: rename to ShareableBlock and update the docs


SlidingBlock(Block) are not sharable

Yes, so they wont create this kind of object! Hence the proposed name change.

ArthurZucker · 2025-12-17T13:26:04Z

src/transformers/generation/continuous_batching/cache_manager.py

+        # If the block is shareable, we keep track of the allocated blocks as partial blocks
+        if shareable:


in general with transformers we don't want if else, we want 2 classes, 1 for sharable, on for non sharable. Splitting the logic by class usually scales better

The thing is, for non shareable blocks, we dont need the block object at all: we just want the physical block_id. Hence no need to create a python object that will never be used and keep track of it. So less overhead!

ok got it !

ArthurZucker · 2025-12-17T13:26:41Z

src/transformers/generation/continuous_batching/cache_manager.py

            self._uninit_block_ids.extend(blocks)

-    def mark_blocks_as_complete(
+    def mark_shareable_blocks_as_complete(


only sharable blocks can be marked as complete and only sharable blocks need this logic (same as above comment) how can we better "encapsulate" this?

I guess this will make more sense when Block is renamed to ShareableBlock. In that case, I think the function name makes sense. The issue is I renammed one but not the other imo

ArthurZucker · 2025-12-17T13:27:27Z

src/transformers/generation/continuous_batching/continuous_api.py

+        self._allow_block_sharing = allow_block_sharing
+        self._use_prefix_sharing = allow_block_sharing  # approximation until the cache is created


kinda a weird to have 2 that do the same thing

They don't! The block sharing boolean allows hybrid models to to block sharing. The use_prefix_sharing bool is updated after we know if the model is hybrid or not. Hence the comment about the approximation!

tests/generation/test_continuous_batching.py

remi-or · 2025-12-17T14:21:00Z

sounds good! The perf don't show diff, they should for models that were hybrid no?

Not really sadly, this is more of a memory optimization. It will ensure the parallel decoding wont eat too much memory with hybrid models. The perf table is more here as a sanity check to make sure we did not incur any major overhead with these changes.

remi-or added 2 commits December 15, 2025 15:52

Allow block sharing in hybrid architectures

57217c9

nit and style

cc668d6

remi-or force-pushed the cb-block-sharing branch from e54a401 to cc668d6 Compare December 15, 2025 15:53

Merge branch 'main' into cb-block-sharing

2349f38

remi-or requested a review from ArthurZucker December 17, 2025 11:17

remi-or mentioned this pull request Dec 17, 2025

[CB] Support the num_return_sequences argument #42921

Draft

remi-or self-assigned this Dec 17, 2025

ArthurZucker approved these changes Dec 17, 2025

View reviewed changes

remi-or and others added 2 commits December 18, 2025 11:59

Merge branch 'main' into cb-block-sharing

65ee04a

Better docstring for mark_shareable_blocks_as_complete

5f1ded2

remi-or merged commit 04e78e6 into main Dec 18, 2025
26 checks passed

remi-or deleted the cb-block-sharing branch December 18, 2025 11:28

	"""Marks the shareable blocks that have been computed in the forward pass as complete. If block sharing is off,
	"""Marks the shareable blocks that have been computed in the forward pass as complete (meaning it contains cache for tokens that are already processed, vs empty cache for futur new tokens). If block sharing is off,



		class Block:
		class Block: # TODO: rename to ShareableBlock and update the docs

		# If the block is shareable, we keep track of the allocated blocks as partial blocks
		if shareable:

		self._allow_block_sharing = allow_block_sharing
		self._use_prefix_sharing = allow_block_sharing # approximation until the cache is created

[CB] Allow block sharing in hybrid models #42877

[CB] Allow block sharing in hybrid models #42877

Uh oh!

Conversation

remi-or commented Dec 15, 2025

Summary

Performance

Tests

Sanity check

Uh oh!

HuggingFaceDocBuilderDev commented Dec 15, 2025

Uh oh!

stevhliu commented Dec 15, 2025

Uh oh!

remi-or commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

remi-or commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

remi-or commented Dec 17, 2025 •

edited

Loading

remi-or commented Dec 17, 2025 •

edited

Loading