Skip to content

Conversation

manueldeprada
Copy link
Contributor

@manueldeprada manueldeprada commented May 12, 2025

Overview

This PR introduces a layered architecture for the cache system, allowing for better composition of caches.

Current class structure

graph TD
    B[Cache]
    B --> F[StaticCache]
    B --> I[HybridCache]
    B --> G[DynamicCache]
    F --> H[SlidingWindowCache]
    F --> P[OffloadedStaticCache]
    B --> M[EncoderDecoderCache]
    B --> N[HybridChunkedCache]
    N --> O[OffloadedHybridCache]
    G --> J[QuantizedCache]
    J --> Q[QuantoQuantizedCache]
    J --> R[HQQQuantizedCache]
    G --> K[OffloadedCache]
    B --> L[MambaCache]
Loading

Goals

  1. Replace all existing cache implementations with layered caches
  2. Enable model-specific cache configurations through layer composition
  3. Improve test coverage and maintainability
  4. Reduce code duplication through shared layer implementations

New structure

  • Cache
    • Keeps a layers list of CacheLayer instances.
    • Dynamically delegates method and attibute calls to the layers (e.g., crop(), reset(), is_compilable, etc)
  • CacheLayer
    • Base type for all layers
    • Examples:
      • StaticLayer
      • DynamicLayer
      • ...
  • StaticCache or DynamicCache are now empty shells for BC that just define their layer type:
    class DynamicCache(Cache):
          pattern_block = (DynamicLayer,)
  • Offloading, quantization, etc. are pluggable CacheProcessors that can wrap any cache, static or dynamic.

Layer patterns & method propagation

Every cache class is now defined just by a pattern_block, a tuple of CacheLayer subclasses that should repeat across depth. The base Cache instantiates layer_types = [pattern_block[i % len(pattern_block)] for i in range(config.num_layers)], so e.g. pattern_block = (StaticLayer, SlidingWindowLayer) yields an alternating Static/Sliding schedule.

Anything not found on the cache itself is forwarded automatically: if it’s an attribute, the cache returns the unique value across the first full pattern (or errors if they differ); if it’s a method, the cache builds a dispatcher that calls the method on every layer, threading a state object and respecting each layer’s return_early flag. This removes almost all boilerplate for ops like reset(), crop(), get_mask_sizes(), batch_split(), etc., and lets new layer types slot in without touching the main classes.

Progress

Part 1: Porting to layered classes. No new functionality, just refactor into new classes.

  • New base CacheLayer and (layered) Cache classes
  • StaticCache port
  • DynamicCache port and tests
  • Stripe functionality from DynamicCache and StaticCache into the layers, add error handling.
  • kill SinkCache
  • Define hook system for offloading, quantization without specialized classes.
  • OffloadedCache
  • SlidingWindowCache port
  • make cache exportable initializing correct types of layers
  • QuantizedCache port
  • EncoderDecoderCache port
  • HybridCache port
  • Replace Cache with LayeredCache
  • test mllama, layerskip-llama.
  • bring back SinkCache as custom decode on the Hub.
  • test Avoid incorrect generations for KV caches containing more than sliding_window tokens #38156 on SlidingCache
  • run and fix all models tests.
  • run some benchmarks to confirm no perf degrade.

Part 2: Improvements, new incremental features

  • Mark cache_position as mandatory and start deprecation cycle.
  • Check if torch.cond optimization for small sentences speeds up torch.compiled generation (partial commit).
  • Check if casts are needed only for GPT-J, move that to model code. See this.
  • Refactor and document Llama4's ChunkedAttention and hybrid approach into layers.

Part 3: Config based cache composition

  • Design layer composition system based on configurations instead of Cache classes.
  • Port Hybrid caches to use file definitions.
  • Update documentation with new configuration options

Tests review

  • StaticCache tests
  • DynamicCache tests
  • OffloadedCache tests
  • SlidingWindowCache tests
  • QuantizedCache tests
  • MambaCache tests
  • EncoderDecoderCache tests
  • HybridCache tests

Note

The code builds on #37972

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

manueldeprada added a commit to manueldeprada/transformers that referenced this pull request Jun 29, 2025
…yeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests
Cyrilvallez pushed a commit that referenced this pull request Jul 22, 2025
…39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
…uggingface#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (huggingface#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants