fix: sliding window KV cache for Gemma-3 models (issue #2145) #2153
+62
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #2145
Overview
This PR addresses a critical issue affecting Gemma-3 models (
4B-IT,12B-IT,27B-IT) that caused them to produce gibberish or repetitive text after approximately 800-1000 tokens during continuous long-form generation.The fix introduces correct sliding window KV cache management for models using hybrid attention architectures.
Key Fixes
1. Sliding Window KV Cache Limiting (
litgpt/model.py-build_kv_cache)sliding_window_size(1024 tokens) for sliding-window layers instead of full sequence length.2. Circular Buffer Implementation (
litgpt/model.py-KVCacheclass)3. Attention Mask Dimension Fix (
litgpt/model.py-CausalSelfAttention.forward)Why This Matters
The Hidden Bug
This issue was difficult to detect in production:
Testing Summary:
Gemma-3-4B-IT: All three long-form test prompts passed.Llama-3.2-3B: All three long-form test prompts passed.Configuration:
chat_generate()(low-level)max_new_tokens=1500Known Limitations
LLM.generate()(high-level API) still requires integration work.chat_generate()API is fully functional.chat_generate()directly (example in full PR description).Notes
litgpt/model.py(~40 lines)runlitgpt_v1.py - this file uses the high-level API implementation. There are atill some bugs in the
litgpt/api.py, which will be addresses in another PR.runlitgpt.py - this code uses the low-level API implementation and it works for both the models now.