[cache refactor] Move all the caching logic to a per-layer approach #39106

manueldeprada · 2025-06-29T09:54:19Z

This PR completes Part 1 of the cache refactor tracked in #38077.

Summary:

Now Cache is structured in a list of layers.
Ports all existing cache types (Static, Dynamic, Offloaded, Quantized, Hybrid, etc.) to use layer composition
Backward compatibility, tests passing.
Modification logic (e.g., reset(), crop(), batch_split()) now auto-propagates to layers.

Implementation details:

We emulate the properties cache.key_cacheand cache.value_cache through KVProxy to efficiently return a layer-indexed list of keys or values and keep BC.
Offloading and Quantizing caches are now defined as a CacheProcessor. In the future, it can be expanded to a CacheProcessorList if needed.
Diff will be cleaner once we merge Refactor MambaCache to modeling_mamba.py #38086.

…yeredCache (huggingface#38077) - Introduces CacheLayer and Cache base classes - Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers - Implements method/attr dispatch across layers to reduce boilerplate - Adds CacheProcessor hooks for offloading, quantization, etc. - Updates and passes tests

HuggingFaceDocBuilderDev · 2025-06-29T10:07:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

manueldeprada · 2025-06-30T14:06:00Z

@zucchini-nlp I noticed I was breaking QuantizedCaches. It was hard to spot because no test covered seq_len > QuantizedCacheConfig.residual_length (I added a new one). Could you review the quantized part of the PR in case I missed something else?

zucchini-nlp

Super super cool, glad to see the cache being refactored. Left a few comments in quant cache, I think it is not same as in current main

Can we also check that the generation with main vs with PR branch are identical when low-bit quantizing and generating much longer than residual length. Not in a test case, but as a sanity check. It's been long since I looked at this cache

src/transformers/cache_utils.py

zucchini-nlp · 2025-07-01T07:27:11Z

src/transformers/cache_utils.py

+                        "config and it's not set to None."
+                    )
+            # Adjust max_cache_len for sliding window layers (they can't be larger than sliding window)
+            max_cache_len = max_cache_len or config.max_position_embeddings


config.max_position_embeddings doesn't always reflect the actual max length a model can handle and I think sometimes it's filled with non-sense values

Maybe we should have a default max_cache_length instead, wdyt?

hmmm good catch. This comes from

transformers/src/transformers/cache_utils.py

Line 1615 in 1283877

self.max_cache_len = max_cache_len if max_cache_len is not None else config.max_position_embeddings

This is only relevant for StaticCaches, which are initialized purposedly for torch.compile, so probably is very uncommon for this param to be unitialized. The problem with setting a super-high default is that we will allocate an equally big tensor for the static cache. Are the non-sense values too big usually? Or what do you mean?

Yeah, I meant the values are super large. For example examining a random llama shows 10M tokens. Though technically with RoPE it can go to arbitrarily large sequence length, maybe we should better have a default of 1024/2048 tokens in Static Caches?

https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct/blob/main/config.json#L28

I realize it's not really an issue of this PR, but since we started the major clean-up it's time to bring up the discussion 😄

maybe its to set to something like 2048 with a warning? cc @gante

src/transformers/cache_utils.py

ArthurZucker

Thanks! A good improvement 🤗

src/transformers/cache_utils.py

ArthurZucker · 2025-07-01T16:47:00Z

src/transformers/cache_utils.py

+        ```
+    """
+
+    def __init__(


good for BC, but maybe we want to move forward with just specifying "cache_processor="offloaded" wdyt @Cyrilvallez

manueldeprada · 2025-07-02T15:19:09Z

all done @ArthurZucker ! thanks for the detailed review, learning a lot from you

gante

Good first steps! There's still some work to be done, I'm separating my feedback below by sections.

Dependencies: Let's merge #38086 before we continue this review, this diff is very polluted 👀

Complexity: This PR introduces a lot of complexity, which is the root of many long-term maintenance problems. Let's try to simplify as much as we can 🙏 (e.g. kill KVProxy)

Cache configs: From a model implementation perspective, all it needs to define is the type of cache layers it expects (a list) and, in some edge cases, a few additional cache kwargs (a dict). I think we would do ourselves a favor if we don't use CacheConfig classes: fewer classes to maintain, simpler cache-related model.config parameters, fewer tests. See complexity comment above. [Let's move this discussion to Slack]

Documentation: In the diff, we remove the docs for all methods of the caches. This means users won't be able to easily access information about available methods. Since we expect users to import and use Cache (or subclasses of it), we should make sure Cache has all its methods in the docs. We could also write something like See the documentation of Cache for shared methods on all other caches' docstrings, to avoid repeating the same information in our docs.

Docstrings: Some classes miss the __init__ arguments in their docstring (e.g. CacheLayer), and some non-dunder methods don't have docstrings with args/return (e.g. from_kv). Let's make sure we dot our i's and cross our t's after we settle on the interfaces :)

src/transformers/cache_utils.py

gante · 2025-07-02T15:36:28Z

@manueldeprada some comments above are regarding code blocks that no longer exist, feel free to just resolve them if they no longer apply 🤗

ArthurZucker

Thanks, looks nice now! Small nits

src/transformers/cache_utils.py

ArthurZucker · 2025-07-22T09:38:58Z

src/transformers/cache_utils.py

file is super long, we might split cache_utils/configuration_utils|utils|layers ? @gante

lmk! 500 out 2500 LOC are deprecated, including the configuration classes.

We could split the remaining 2000LOC into caches(1000 LOC) and layers (400 LOC) + processors (600 LOC)

…-refactor-1

Cyrilvallez

Last nit

src/transformers/cache_utils.py

github-actions · 2025-07-22T10:51:30Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bamba, bart, bigbird_pegasus, biogpt, blenderbot, blenderbot_small, dia, falcon_h1, gemma3n, gptj, granitemoehybrid, informer, jamba, lfm2, longt5, m2m_100

Cyrilvallez

Alright, merging! 🤗🚀
Nice work!

tdoublep · 2025-08-08T12:49:26Z

Our vLLM CI tests for hybrid models that compare against transformers are broken since we upgraded to latest transformers, and I've traced the issue to this PR. Are there any changes in this PR that could cause the generated tokens to be different?

Cyrilvallez · 2025-08-08T12:53:33Z

I just merged #39797 @tdoublep, could you retry on latest main? 🤗
Then if you still see some issues, I'll investigate further!

tdoublep · 2025-08-08T12:56:45Z

@Cyrilvallez I can confirm that using current main, the test I'm using to debug is passing. I will check the rest now, but looks good.

tdoublep · 2025-08-08T14:02:44Z

Can confirm all correctness tests pass using latest main. There is one test still failing but due to some other, different issue I will report separately.

tdoublep · 2025-08-08T14:33:52Z

Correction: it looks like all tests involving mamba2 are passing. But tests involving mamba1 still have mismatching output.

manueldeprada · 2025-08-08T15:16:47Z

Correction: it looks like all tests involving mamba2 are passing. But tests involving mamba1 still have mismatching output.

Can you provide the gen code that produces the mismatch?

manueldeprada · 2025-08-15T09:32:19Z

@tdoublep which mamba1 tests? I ran all in models/language/generation/test_hybrid.py::test_models and they pass with cyril's PR

tdoublep · 2025-08-15T09:44:13Z

@manueldeprada Apologies - I should have followed up here. I can't reproduce that mamba1 failure anymore. It must have been something else related to my dev env.

I can confirm that latest transformers main looks good from vLLM hybrid test perspective.

…uggingface#39106) * Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077) - Introduces CacheLayer and Cache base classes - Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers - Implements method/attr dispatch across layers to reduce boilerplate - Adds CacheProcessor hooks for offloading, quantization, etc. - Updates and passes tests * fix quantized, add tests * remove CacheProcessorList * raushan review, arthur review * joao review: minor things * remove cache configs, make CacheLayer a mixin (joaos review) * back to storage inside Cache() * remove cachebase for decorator * no more __getattr__ * fix tests * joaos review except docs * fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant` More verbose exceptions in `fix_docstring` on docstring formatting issues. * Revert "back to storage inside Cache()" This reverts commit 27916bc. * cyril review * simplify cache export * fix lfm2 cache * HybridChunked to layer * BC proxy object for cache.key_cache[i]=... * reorder classes * bfff come on LFM2 * better tests for hybrid and hybridChunked * complete coverage for hybrid chunked caches (prefill chunking) * reimplementing HybridChunked * cyril review * fix ci * docs for cache refactor * docs * oopsie * oopsie * fix after merge * cyril review * arthur review * opsie * fix lfm2 * opsie2

manueldeprada requested a review from gante June 29, 2025 09:54

fix quantized, add tests

04d7a0b

manueldeprada force-pushed the cache-refactor-1 branch from a2fe24c to 04d7a0b Compare June 30, 2025 13:51

remove CacheProcessorList

26c28af

manueldeprada force-pushed the cache-refactor-1 branch from d97a02d to 26c28af Compare June 30, 2025 16:24

manueldeprada requested a review from ArthurZucker June 30, 2025 16:41

zucchini-nlp reviewed Jul 1, 2025

View reviewed changes

manueldeprada mentioned this pull request Jul 1, 2025

Move get_mask_sizes from Cache to masking_utils and remove use of get_seq_length. #39142

Closed

ArthurZucker reviewed Jul 1, 2025

View reviewed changes

raushan review, arthur review

16a6624

manueldeprada requested a review from ArthurZucker July 2, 2025 15:18

gante reviewed Jul 2, 2025

View reviewed changes

joao review: minor things

aec9ccd

manueldeprada force-pushed the cache-refactor-1 branch 3 times, most recently from d684339 to 4c03e0f Compare July 4, 2025 15:09

remove cache configs, make CacheLayer a mixin (joaos review)

e80c68a

manueldeprada force-pushed the cache-refactor-1 branch 5 times, most recently from 5dc5fb4 to a6b7562 Compare July 9, 2025 15:55

back to storage inside Cache()

27916bc

manueldeprada force-pushed the cache-refactor-1 branch from a6b7562 to 27916bc Compare July 9, 2025 18:56

remove cachebase for decorator

fd83e14

ArthurZucker reviewed Jul 22, 2025

View reviewed changes

manueldeprada added 4 commits July 22, 2025 12:12

arthur review

e4878ad

Merge branch 'main' of github.com:huggingface/transformers into cache…

38fb99d

…-refactor-1

opsie

8df1595

fix lfm2

ad65a02

Cyrilvallez reviewed Jul 22, 2025

View reviewed changes

src/transformers/cache_utils.py Show resolved Hide resolved

opsie2

d9fbb04

Cyrilvallez approved these changes Jul 22, 2025

View reviewed changes

Cyrilvallez merged commit c338fd4 into huggingface:main Jul 22, 2025
25 checks passed

manueldeprada mentioned this pull request Jul 24, 2025

Update for new version of HF transformers. NVIDIA/kvpress#104

Closed

guangy10 mentioned this pull request Jul 28, 2025

CI test-phi-3-mini-runner-linux failing as a result of transformers v4.54.0 pytorch/executorch#12867

Open

jackzhxng mentioned this pull request Aug 1, 2025

Bump transformers and torch huggingface/optimum-executorch#117

Merged

zucchini-nlp mentioned this pull request Aug 20, 2025

Make cache_config not mandatory #40316

Merged

Dahlbomii mentioned this pull request Aug 21, 2025

Add Phi-3.5-vision #36036

Open

[cache refactor] Move all the caching logic to a per-layer approach #39106

[cache refactor] Move all the caching logic to a per-layer approach #39106

Uh oh!

Conversation

manueldeprada commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Uh oh!

HuggingFaceDocBuilderDev commented Jun 29, 2025

Uh oh!

manueldeprada commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada commented Jul 2, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gante commented Jul 2, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

manueldeprada commented Jun 29, 2025 •

edited

Loading

manueldeprada commented Jun 30, 2025 •

edited

Loading

Cyrilvallez commented Aug 8, 2025 •

edited

Loading

tdoublep commented Aug 8, 2025 •

edited

Loading

tdoublep commented Aug 8, 2025 •

edited

Loading