Cache System Refactor: Layered Architecture #38077

manueldeprada · 2025-05-12T08:00:44Z

Overview

This PR introduces a layered architecture for the cache system, allowing for better composition of caches.

Current class structure

graph TD
    B[Cache]
    B --> F[StaticCache]
    B --> I[HybridCache]
    B --> G[DynamicCache]
    F --> H[SlidingWindowCache]
    F --> P[OffloadedStaticCache]
    B --> M[EncoderDecoderCache]
    B --> N[HybridChunkedCache]
    N --> O[OffloadedHybridCache]
    G --> J[QuantizedCache]
    J --> Q[QuantoQuantizedCache]
    J --> R[HQQQuantizedCache]
    G --> K[OffloadedCache]
    B --> L[MambaCache]

Goals

Replace all existing cache implementations with layered caches
Enable model-specific cache configurations through layer composition
Improve test coverage and maintainability
Reduce code duplication through shared layer implementations

New structure

Cache
- Keeps a layers list of CacheLayer instances.
- Dynamically delegates method and attibute calls to the layers (e.g., crop(), reset(), is_compilable, etc)
CacheLayer
- Base type for all layers
- Examples:
  - StaticLayer
  - DynamicLayer
  - ...
StaticCache or DynamicCache are now empty shells for BC that just define their layer type:

    class DynamicCache(Cache):
          pattern_block = (DynamicLayer,)

Offloading, quantization, etc. are pluggable CacheProcessors that can wrap any cache, static or dynamic.

Layer patterns & method propagation

Every cache class is now defined just by a pattern_block, a tuple of CacheLayer subclasses that should repeat across depth. The base Cache instantiates layer_types = [pattern_block[i % len(pattern_block)] for i in range(config.num_layers)], so e.g. pattern_block = (StaticLayer, SlidingWindowLayer) yields an alternating Static/Sliding schedule.

Anything not found on the cache itself is forwarded automatically: if it’s an attribute, the cache returns the unique value across the first full pattern (or errors if they differ); if it’s a method, the cache builds a dispatcher that calls the method on every layer, threading a state object and respecting each layer’s return_early flag. This removes almost all boilerplate for ops like reset(), crop(), get_mask_sizes(), batch_split(), etc., and lets new layer types slot in without touching the main classes.

Progress

Part 1: Porting to layered classes. No new functionality, just refactor into new classes.

Part 2: Improvements, new incremental features

Mark cache_position as mandatory and start deprecation cycle.
Check if torch.cond optimization for small sentences speeds up torch.compiled generation (partial commit).
Check if casts are needed only for GPT-J, move that to model code. See this.
Refactor and document Llama4's ChunkedAttention and hybrid approach into layers.

Part 3: Config based cache composition

Design layer composition system based on configurations instead of Cache classes.
Port Hybrid caches to use file definitions.
Update documentation with new configuration options

Tests review

Note

The code builds on #37972

…-fix2

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

…-fix2

HuggingFaceDocBuilderDev · 2025-05-12T08:16:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…into main

…-refactor

…yeredCache (huggingface#38077) - Introduces CacheLayer and Cache base classes - Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers - Implements method/attr dispatch across layers to reduce boilerplate - Adds CacheProcessor hooks for offloading, quantization, etc. - Updates and passes tests

…39106) * Squash for refactor: Replace monolithic cache classes with modular LayeredCache (#38077) - Introduces CacheLayer and Cache base classes - Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers - Implements method/attr dispatch across layers to reduce boilerplate - Adds CacheProcessor hooks for offloading, quantization, etc. - Updates and passes tests * fix quantized, add tests * remove CacheProcessorList * raushan review, arthur review * joao review: minor things * remove cache configs, make CacheLayer a mixin (joaos review) * back to storage inside Cache() * remove cachebase for decorator * no more __getattr__ * fix tests * joaos review except docs * fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant` More verbose exceptions in `fix_docstring` on docstring formatting issues. * Revert "back to storage inside Cache()" This reverts commit 27916bc. * cyril review * simplify cache export * fix lfm2 cache * HybridChunked to layer * BC proxy object for cache.key_cache[i]=... * reorder classes * bfff come on LFM2 * better tests for hybrid and hybridChunked * complete coverage for hybrid chunked caches (prefill chunking) * reimplementing HybridChunked * cyril review * fix ci * docs for cache refactor * docs * oopsie * oopsie * fix after merge * cyril review * arthur review * opsie * fix lfm2 * opsie2

…uggingface#39106) * Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077) - Introduces CacheLayer and Cache base classes - Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers - Implements method/attr dispatch across layers to reduce boilerplate - Adds CacheProcessor hooks for offloading, quantization, etc. - Updates and passes tests * fix quantized, add tests * remove CacheProcessorList * raushan review, arthur review * joao review: minor things * remove cache configs, make CacheLayer a mixin (joaos review) * back to storage inside Cache() * remove cachebase for decorator * no more __getattr__ * fix tests * joaos review except docs * fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant` More verbose exceptions in `fix_docstring` on docstring formatting issues. * Revert "back to storage inside Cache()" This reverts commit 27916bc. * cyril review * simplify cache export * fix lfm2 cache * HybridChunked to layer * BC proxy object for cache.key_cache[i]=... * reorder classes * bfff come on LFM2 * better tests for hybrid and hybridChunked * complete coverage for hybrid chunked caches (prefill chunking) * reimplementing HybridChunked * cyril review * fix ci * docs for cache refactor * docs * oopsie * oopsie * fix after merge * cyril review * arthur review * opsie * fix lfm2 * opsie2

manueldeprada and others added 22 commits May 6, 2025 11:29

squash rebase

acb901e

ruff

4eacd7d

Merge branch 'main' into cache-fix2

05d2ce6

ruff

6b765bd

fix hybrid cache in torch compile

32cd5f6

Merge branch 'main' into cache-fix2

4ddd8d6

Merge branch 'main' of github.com:huggingface/transformers into cache…

9bfdcbc

…-fix2

joaos suggestions

ec26e69

Merge branch 'main' into cache-fix2

9858f2c

ruff

95805f3

Trigger Build

f08ea20

ruff

b3b0133

Merge branch 'main' into cache-fix2

016d9db

Update src/transformers/cache_utils.py

3de7505

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

suggestions

214e517

Merge branch 'main' of github.com:huggingface/transformers into cache…

468d887

…-fix2

ruff

36e07a2

revert naming change

deacc67

Merge branch 'main' of github.com:huggingface/transformers into cache…

8548b8f

…-fix2

Merge remote-tracking branch 'upstream/main' into cache-fix2

326d2b2

cache refactor initial commit

177ac80

ruff

689d9c2

manueldeprada added 2 commits May 12, 2025 16:11

temp

e272712

Refactor MambaCache to modeling_mamba.py (parity with Zamba)

1755d6f

manueldeprada mentioned this pull request May 12, 2025

Refactor MambaCache to modeling_mamba.py #38086

Merged

manueldeprada and others added 3 commits May 12, 2025 17:16

ruff

93f7b8a

Merge branch 'main' into main

be81dae

fix dummies

dbdf2cc

manueldeprada mentioned this pull request May 12, 2025

New cache tests and modular Hybrid Cache #37972

Merged

manueldeprada and others added 19 commits June 26, 2025 19:13

fix config docs

8c65d29

fix docs

fea393f

add export info

0182047

Merge branch 'modular_falcon_mamba' into cache-refactor

097e161

Merge branch 'modular_falcon_mamba' into main

1887d53

merge modular falcon branch

59be6d6

Merge branch 'main' into main

2477ebb

oopsie

abb9cd3

Merge branch 'main' of https://github.com/manueldeprada/transformers …

203f103

…into main

Merge branch 'main' into cache-refactor

d1e6941

Merge branch 'main' of github.com:huggingface/transformers into cache…

f82a1b5

…-refactor

update

5e21827

Merge branch 'main' into cache-refactor

40f7b6f

oopsie

5ea9be8

ruff

2bdf6dc

remove stateful propagate, remove bloat

507ac93

remove dead code

bf63614

Merge branch 'main' of github.com:huggingface/transformers into cache…

17b10ce

…-refactor

fix docstring

46ca0da

manueldeprada mentioned this pull request Jun 29, 2025

[cache refactor] Move all the caching logic to a per-layer approach #39106

Merged

manueldeprada closed this Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache System Refactor: Layered Architecture #38077

Cache System Refactor: Layered Architecture #38077

Uh oh!

manueldeprada commented May 12, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cache System Refactor: Layered Architecture #38077

Cache System Refactor: Layered Architecture #38077

Uh oh!

Conversation

manueldeprada commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Current class structure

Goals

New structure

Layer patterns & method propagation

Progress

Part 1: Porting to layered classes. No new functionality, just refactor into new classes.

Part 2: Improvements, new incremental features

Part 3: Config based cache composition

Tests review

Note

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

manueldeprada commented May 12, 2025 •

edited

Loading