Skip to content

Conversation

@Chelsi-create
Copy link

@Chelsi-create Chelsi-create commented Nov 2, 2025

fixes #2145

Overview

This PR addresses a critical issue affecting Gemma-3 models (4B-IT, 12B-IT, 27B-IT) that caused them to produce gibberish or repetitive text after approximately 800-1000 tokens during continuous long-form generation.
The fix introduces correct sliding window KV cache management for models using hybrid attention architectures.

Key Fixes

1. Sliding Window KV Cache Limiting (litgpt/model.py - build_kv_cache)

  • Use sliding_window_size (1024 tokens) for sliding-window layers instead of full sequence length.
  • Allocate correct cache size during initialization.
  • Global attention layers continue to use the full-sequence cache.

2. Circular Buffer Implementation (litgpt/model.py - KVCache class)

  • Implement modulo-based position indexing for sequences beyond the window size.
  • Ensure the KV cache never exceeds 1024 tokens for sliding-window layers.

3. Attention Mask Dimension Fix (litgpt/model.py - CausalSelfAttention.forward)

  • Adjust attention mask dimensions to align with actual KV cache sizes.
  • Prevents dimension mismatch errors when cache is smaller than the expected sequence length.

Why This Matters

The Hidden Bug

This issue was difficult to detect in production:

  • Interactive chat sessions (100–300 tokens) always reset the KV cache between turns, so no visible problem.
  • Continuous generation (>1024 tokens) triggered the bug, leading to repetitive text.

Testing Summary:

  • Gemma-3-4B-IT: All three long-form test prompts passed.
  • Llama-3.2-3B: All three long-form test prompts passed.
  • Before fix: severe repetition (150-200 token loops).
  • After fix: coherent outputs with limited repetition (7-9 tokens).

Configuration:

  • API: chat_generate() (low-level)
  • max_new_tokens=1500

Known Limitations

  • LLM.generate() (high-level API) still requires integration work.
  • Low-level chat_generate() API is fully functional.
  • Temporary workaround: use chat_generate() directly (example in full PR description).

Notes

  • Single file modified: litgpt/model.py (~40 lines)
  • No breaking changes: Fully backward compatible
  • Validated: Affected models tested

runlitgpt_v1.py - this file uses the high-level API implementation. There are atill some bugs in the litgpt/api.py, which will be addresses in another PR.
runlitgpt.py - this code uses the low-level API implementation and it works for both the models now.

@Chelsi-create Chelsi-create changed the title Fix sliding window KV cache for Gemma-3 models fix: sliding window KV cache for Gemma-3 models (issue #2145) Nov 2, 2025
@KaelanDt
Copy link
Contributor

KaelanDt commented Nov 5, 2025

Hi @Chelsi-create , thank you for the PR!
There are still a few tests failing related to your changes, could you check that the tests pass?
https://github.com/Lightning-AI/litgpt/actions/runs/19088536192/job/54534011522?pr=2153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

litgpt model responses using simple "out-of-box" code example become incoherent / repetitive after a few hundred tokens

3 participants