[core] Refactor the Cache logic to make it simpler and more general #39797

Cyrilvallez · 2025-07-30T16:38:03Z

What does this PR do?

Big simplifications everywhere, but most notably:

all caches are initialized lazily -> no more issues of devices with device_map, which would lead to breaking the Static dynamo addresses due to device movement + no issue of dimensions with TP + much simpler to prepare for generate (all properties are derived at first update time) -> simpler and more efficient (no device copies)
early_initialization provides a way to init everything before update is called -> this is needed for export as we can't trace correctly if initialization is lazy
removed CacheProcessor -> QuantizedProcessor should be QuantizedLayers instead, and offloading alone does not justify the Processor boilerplate -> much easier to have offloading as part of the Layer and Cache themselves (it's also much more robust now regarding devices)
Hybrid and HybridChunked now check for chunk_attention_size correctly again (it was lost before which would break Llama4)
code much easier to follow and understand -> more maintainable
this is also a big step towards completely removing the cache_position, which would simplify the library a lot, and will come in a follow-up PR

HuggingFaceDocBuilderDev · 2025-07-30T16:55:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez · 2025-08-04T10:40:25Z

Slow tests are the same on this PR and on main for llama (arguably most important model), mistral (most tested sliding window model), gemma2 (most tested hybrid model), gemma3 (hybrid model), llama4 (hybrid chunked) - so all scenarios are green!

gante · 2025-08-05T14:01:02Z

@Cyrilvallez 🔥🔥 PR

nano requests:

add early_initialization and lazy_initialization to the documented methods (in [[autodoc]] Cache)
We can't throw informative exceptions at compilation time. But we can mitigate related problems with comments: In the docstring of lazy_initialization, let's mention that this function can never be called at compilation time, and that early_initialization should be called ahead of compilation instead 🙏

1 + 2 = if users open issues about manual compilation, we can link to the docstrings :D

Cyrilvallez · 2025-08-05T15:22:59Z

Hey @gante! Done with 1. 👌
Concerning 2., we actually CAN correctly compile even lazy_initialization, that's the beauty of it! Of course, this is not efficient and leads to recompiles as prefill should not be compiled for performances, but it works!
You can try on the following snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-2-2b"
torch_device = 0

EXPECTED_COMPLETIONS = [
    " the people, the food, the culture, the history, the music, the art, the architecture",
    ", green, yellow, orange, purple, pink, brown, black, white, gray, silver",
]

input_text = [
    "This is a nice place. " * 800 + "I really enjoy the scenery,",  # This is larger than 4096 tokens
    "A list of colors: red, blue",  # This will almost all be padding tokens
]
tokenizer = AutoTokenizer.from_pretrained(model_id, padding="left")
inputs = tokenizer(input_text, padding=True, return_tensors="pt").to(torch_device)

model = AutoModelForCausalLM.from_pretrained(
    model_id, attn_implementation="sdpa", torch_dtype=torch.bfloat16, device_map=torch_device
)
# Compile the full forward! So call to `lazy_initialization` will be compiled as well
model.compile(fullgraph=True)

# Make sure prefill is larger than sliding window
input_size = inputs.input_ids.shape[-1]
assert input_size > model.config.sliding_window

out = model.generate(**inputs, max_new_tokens=20, cache_implementation="hybrid")[:, input_size:]
output_text = tokenizer.batch_decode(out)

assert output_text == EXPECTED_COMPLETIONS

and you'll see it works! Only compilation with cuda graph will fail, e.g. mode="reduce-overhead" will fail, but this is not guaranteed at all to give better perfs anyway!

gante · 2025-08-05T16:15:19Z

@Cyrilvallez fair! (I would still make a note of it though, for power users that don't rely on generate 😮 )

ArthurZucker

Nothing much to say: very very clean! 🤗

compiling the full call may not be the best: wrt accelerate and hooks in general, would rather just compile the call!

ArthurZucker · 2025-08-08T09:03:17Z

src/transformers/cache_utils.py

+    def offload(self):
+        """Offload this layer's data to CPU device."""
+        if self.keys is not None:
+            self.keys = self.keys.to("cpu", non_blocking=True)
+            self.values = self.values.to("cpu", non_blocking=True)


i think for cuda we needed / have better perfs with a different strea

ArthurZucker · 2025-08-08T09:08:18Z

src/transformers/cache_utils.py

+        offload_only_non_sliding (`bool`, *optional*, defaults to `True`):
+            If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because


nice comment

ArthurZucker · 2025-08-08T09:09:21Z

src/transformers/cache_utils.py

+        with self.prefetch_stream if _is_torch_greater_or_equal_than_2_7 else torch.cuda.stream(self.prefetch_stream):
+            self.layers[layer_idx].prefetch()


ah ok there's the stream

ArthurZucker · 2025-08-08T09:26:56Z

run-slow: bamba, dia, falcon_h1, gptj, granitemoehybrid, jamba, kyutai_speech_to_text, lfm2, musicgen, musicgen_melody, rag, zamba, zamba2

github-actions · 2025-08-08T09:28:19Z

This comment contains run-slow, running the specified jobs:

models: ['models/bamba', 'models/dia', 'models/falcon_h1', 'models/gptj', 'models/granitemoehybrid', 'models/jamba', 'models/kyutai_speech_to_text', 'models/lfm2', 'models/musicgen', 'models/musicgen_melody', 'models/rag', 'models/zamba', 'models/zamba2']
quantizations: [] ...

github-actions · 2025-08-08T12:34:15Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bamba, dia, falcon_h1, gptj, granitemoehybrid, jamba, kyutai_speech_to_text, lfm2, musicgen, musicgen_melody, rag, zamba, zamba2

Cyrilvallez · 2025-08-08T12:40:43Z

Slow tests are similar as main, merging!

Resolves current CI errors with prefix tuning. Due to some recent changes in transformers (surfaced by huggingface/transformers#39797), checking hasattr(cache, max_cache_len) results in an error: >>> cache = DynamicCache() >>> hasattr(cache, "max_cache_len") Traceback (most recent call last): File "/home/name/work/forks/transformers/foo.py", line 9, in <module> hasattr(cache, "max_cache_len") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/name/work/forks/transformers/src/transformers/cache_utils.py", line 916, in max_cache_len return max(values) ^^^^^^^^^^^ ValueError: max() iterable argument is empty This has been reported and will be fixed in transformers. On the PEFT side, it is safeest check the cache type and avoid accessing this attribute in the first place, which is what this PR does. Morever, that PR also changed the argument order to initialize HybridCache (will probably also be reverted in transformers), which is also taken into account in this PR by only using keyword arguments.

Resolves current CI errors with prefix tuning. Due to some recent changes in transformers (surfaced by huggingface/transformers#39797), checking hasattr(cache, max_cache_len) results in an error. This PR fixes it. Morever, that PR also changed the argument order to initialize HybridCache (will probably also be reverted in transformers), which is also taken into account in this PR by only using keyword arguments. Finally, HybridCache will be deprecated and later removed, so move the import inside a version guard.

Cyrilvallez force-pushed the explicit-cache branch from 5082551 to 59a3296 Compare July 31, 2025 11:19

Cyrilvallez changed the title ~~Simplify/make more explicit the caching logic~~ Rework the Cache logic to make it simpler and more general Jul 31, 2025

Cyrilvallez force-pushed the explicit-cache branch from f09bd1a to 71124e6 Compare August 1, 2025 13:45

Cyrilvallez changed the title ~~Rework the Cache logic to make it simpler and more general~~ Refactor the Cache logic to make it simpler and more general Aug 1, 2025

Cyrilvallez mentioned this pull request Aug 4, 2025

Remove super call from EncoderDecoderCache init #39786

Closed

hmellor mentioned this pull request Aug 5, 2025

Update transformers to v4.55 vllm-project/vllm#21931

Merged

Cyrilvallez changed the title ~~Refactor the Cache logic to make it simpler and more general~~ [core] Refactor the Cache logic to make it simpler and more general Aug 5, 2025

Cyrilvallez force-pushed the explicit-cache branch from 7617e35 to f4f7361 Compare August 7, 2025 11:43

ArthurZucker approved these changes Aug 8, 2025

View reviewed changes

Cyrilvallez added 14 commits August 8, 2025 11:57

Simplify the logic quite a bit

100ae2c

Update cache_utils.py

19ecbd8

continue work

7b3d65c

continue simplifying a lot

f385ac7

style

d54e338

Update cache_utils.py

2a7aac7

offloading much simpler

ec96c77

style

2081941

Update cache_utils.py

2592240

update inits

37bd555

Update cache_utils.py

c0c964f

consistemncy

9fd8803

Update cache_utils.py

2518e75

update generate

17ca71e

Cyrilvallez added 8 commits August 8, 2025 11:59

Update cache_utils.py

06fd9e4

Update cache_utils.py

b6eeae2

add lazy methods in autodoc

ca32e1f

typo

a173a64

better doc

1f7dd27

Add detailed docstring for lazy init

203ab69

CIs

48e78d0

style

236bf9d

Cyrilvallez force-pushed the explicit-cache branch from 197f330 to 236bf9d Compare August 8, 2025 09:59

fix

0630cd2

Cyrilvallez merged commit dc11a3c into main Aug 8, 2025
23 of 25 checks passed

Cyrilvallez deleted the explicit-cache branch August 8, 2025 12:47

Cyrilvallez mentioned this pull request Aug 8, 2025

[cache refactor] Move all the caching logic to a per-layer approach #39106

Merged

manueldeprada mentioned this pull request Aug 8, 2025

Update for new version of HF transformers. NVIDIA/kvpress#104

Closed

hmellor mentioned this pull request Aug 8, 2025

Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation vllm-project/vllm#22541

Merged

4 tasks

ArthurZucker mentioned this pull request Aug 11, 2025

Registers StaticCache serialization functions for torch.export.export #39931

Open

5 tasks

BenjaminBossan mentioned this pull request Aug 12, 2025

FIX: DynamicCache max_cache_len attribute error huggingface/peft#2735

Merged

This was referenced Aug 12, 2025

Fix QuantoQuantizedCache import issues #40109

Merged

Transformers compatability NVIDIA/kvpress#115

Merged

Cyrilvallez mentioned this pull request Aug 13, 2025

Fix Janus #40140

Merged

This was referenced Aug 15, 2025

Fix mamba caches #40196

Closed

Fix mamba caches #40198

Closed

Fix mamba caches #40203

Merged

jackzhxng mentioned this pull request Sep 15, 2025

Bump transformers to 4.56.1 huggingface/optimum-executorch#136

Merged

eginhard mentioned this pull request Sep 19, 2025

[Bug] Parameter mismatch for _prepare_cache_for_generation (transformers>=4.56) idiap/coqui-ai-TTS#496

Open

		offload_only_non_sliding (`bool`, optional, defaults to `True`):
		If `offloading` is `True`, this further decides if only the non-sliding layers will be offloaded (because

		with self.prefetch_stream if _is_torch_greater_or_equal_than_2_7 else torch.cuda.stream(self.prefetch_stream):
		self.layers[layer_idx].prefetch()

[core] Refactor the Cache logic to make it simpler and more general #39797

[core] Refactor the Cache logic to make it simpler and more general #39797

Conversation

Cyrilvallez commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 30, 2025

Uh oh!

Cyrilvallez commented Aug 4, 2025

Uh oh!

gante commented Aug 5, 2025

Uh oh!

Cyrilvallez commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Aug 5, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

Cyrilvallez commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Jul 30, 2025 •

edited

Loading

Cyrilvallez commented Aug 5, 2025 •

edited

Loading