Research-quality implementation of novel KV cache eviction policies for efficient long-context LLM inference. This project implements and comprehensively evaluates attention-score-based eviction (H2O), semantic clustering, and learned eviction policies to achieve 50-87.5% KV cache compression while maintaining generation quality.
This repository contains a complete experimental framework for evaluating KV cache compression strategies, addressing the critical memory bottleneck in long-context language model inference. Through 55 experimental configurations across 5 setups, we demonstrate that attention-score-based eviction (H2O) achieves superior compression-performance trade-offs compared to baseline methods.
- 8x Compression (87.5% memory savings): H2O-256-80/20 achieves 19.23 perplexity (only 12.3% degradation from baseline)
- 4x Compression (75% savings): All methods achieve baseline performance (17.12 perplexity) with zero degradation
- Computational Overhead: H2O adds only 10.33% overhead, making it practical for real-time deployment
- Document Type Stability: Consistent performance across narrative, code, and QA domains
This work addresses five key research questions:
- RQ1: Can attention-score-based eviction (H2O) maintain quality with 50-80% cache compression?
- RQ2: Does semantic clustering improve upon pure attention-based eviction?
- RQ3: Can learned eviction policies outperform hand-crafted heuristics?
- RQ4: How do different document types (code, narrative, QA) affect optimal eviction strategies?
- RQ5: What is the trade-off between eviction overhead and memory savings?
# Clone the repository
git clone https://github.com/suhasramanand/kv-cache-compression.git
cd kv-cache-compression
# Install dependencies
pip install -r requirements.txtfrom transformers import AutoTokenizer, AutoModelForCausalLM
from kv_cache_compression.ipynb import H2OCache
# Load model
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Create H2O cache with 8x compression
cache = H2OCache(
budget=256, # Cache budget (8x compression for 2048 context)
heavy_ratio=0.8, # 80% for heavy hitters (attention sinks)
recent_ratio=0.2, # 20% for recent tokens
per_layer=True # Per-layer cache management
)
cache.initialize_trackers(model.config.num_hidden_layers)
# Generate with compressed cache
input_ids = tokenizer("Your prompt here", return_tensors="pt").input_ids
outputs = model.generate(
input_ids=input_ids,
past_key_values=cache,
max_new_tokens=256,
output_attentions=True # Required for H2O attention tracking
)| Method | Budget | Compression | Perplexity | Memory (GB) | Overhead |
|---|---|---|---|---|---|
| Full Cache | 2048 | 1.0x | 17.12 | 0.8 | - |
| H2O-256-80/20 | 256 | 8.0x | 19.23 | 0.8 | 10.33% |
| LRU-256 | 256 | 8.0x | 18.96 | 0.8 | 6.20% |
| SemanticCluster-256-K32 | 256 | 8.0x | 29.28 | 0.8 | 97.84% |
| SlidingWindow-256 | 256 | 8.0x | 44.38 | 0.8 | 0.31% |
| Method | Eviction Time (ms) | Overhead % | Tokens/Sec |
|---|---|---|---|
| SlidingWindow | 0.017 ± 0.003 | 0.31% | 33,927 |
| LRU | 0.361 ± 0.013 | 6.20% | 31,477 |
| H2O | 0.620 ± 0.022 | 10.33% | 30,563 |
| SemanticCluster | 295.603 ± 24.233 | 97.84% | 607 |
Measured through 100 runs per method (10 runs for SemanticCluster)
| Document Type | H2O-256-80/20 | SemanticCluster-256-K32 | LRU-256 | SlidingWindow-256 |
|---|---|---|---|---|
| Narrative (WikiText-2) | 17.21 | 29.28 | 18.96 | 44.38 |
| Code (Python) | 1.39 | 1.39 | 1.39 | N/A |
| QA (narrativeqa) | 415.59* | 1611.73* | 165.71* | N/A |
Note: Perplexity is not appropriate for QA evaluation; F1/EM metrics recommended
- SlidingWindowCache: Simple truncation-based eviction
- LRUCache: Least-recently-used eviction with hash table
- H2OCache: Attention-score-based eviction with heavy-hitter and recent token pools
- SemanticClusterCache: K-means clustering of key vectors for semantic-aware eviction
- LearnedEvictionCache: MLP-based learned policy trained on H2O oracle labels
- Per-layer cache management: Independent cache budgets per transformer layer
- Attention tracking: Cumulative attention score accumulation for H2O
- Memory profiling: Peak GPU memory measurement
- Comprehensive evaluation: Perplexity, compression ratio, overhead analysis
- Visualization: Attention pattern analysis, cluster quality metrics
kv-cache-compression/
├── kv_cache_compression.ipynb # Main Jupyter notebook with all experiments
├── requirements.txt # Python dependencies
├── README.md # This file
├── .gitignore # Git ignore rules
├── images/ # Visualization figures
│ ├── budget_analysis_*.png
│ ├── strategy_comparison_*.png
│ ├── h2o_config_analysis_*.png
│ └── ...
└── results/ # Experimental results
└── latest/
├── setup1_baseline_*.csv
├── setup2_h2o_*.csv
├── setup3_semantic_*.csv
├── setup4_sequence_length_*.csv
├── overhead_measurements.json
└── ablation_study.json
The notebook contains 5 comprehensive experimental setups:
- Setup 1: Baseline Configurations - SlidingWindow and LRU across 4 budgets (256, 512, 1024, 2048)
- Setup 2: H2O Configurations - 20 configurations across 4 budgets and 5 ratio settings (50/50, 70/30, 30/70, 80/20, 20/80)
- Setup 3: Semantic Clustering - 12 configurations across 3 budgets and 4 cluster counts (K=8, 16, 32, 64)
- Setup 4: Sequence Length Analysis - Scalability analysis for 1024, 2048, 4096 sequence lengths
- Setup 5: Comprehensive Evaluation - Representative configurations from all setups
- Experiment A: Computational Overhead - Actual timing measurements (100 runs per method)
- Experiment B: Document Type Evaluation - Performance across narrative, code, and QA datasets
- Analysis D: Attention Pattern Visualization - Attention sink phenomenon analysis
- Analysis E: Cluster Quality Metrics - Silhouette score correlation with perplexity
- At 8x compression, H2O-256-80/20 achieves 19.23 perplexity vs 44.38 for SlidingWindow
- Optimal ratio is 80/20 (heavy/recent), demonstrating importance of attention sinks
- Ablation study shows 30,219% degradation when removing attention tracking
- Heavy hitters are far more important than recency alone
- Weak correlation (0.131) between cluster quality and perplexity
- Very high computational overhead (97.84%) makes it impractical
- K-means optimizes for distance, not token importance
- All methods achieve very low perplexity (1.39) on code
- Structured syntax makes eviction strategy less critical for code generation
- H2O adds only 10.33% overhead despite attention tracking
- SlidingWindow has minimal overhead (0.31%) but poor quality at high compression
- Model: TinyLlama-1.1B-Chat-v1.0 (22 layers, 32 heads, 2048 hidden dim)
- Context Length: 2048 tokens (evaluated up to 4096)
- GPU: Tesla T4 (16GB VRAM) or compatible
- Framework: PyTorch 2.1+, HuggingFace Transformers 4.40+
The system uses output_attentions=True during generation to track attention weights. For T4 GPUs, we use SDPA (scaled dot product attention) which supports attention output, avoiding Flash Attention 2 incompatibility.
Uses torch.index_select with sorted indices to avoid creating unnecessary tensor copies during eviction operations.
If you use this code in your research, please cite:
@misc{kv-cache-compression2024,
title={KV Cache Compression for Long-Context LLM Inference: A Comprehensive Evaluation},
author={Suhas Reddy Baluvanahally Ramananda and Gautam Raju and Namratha Tiptur Manjunath},
year={2024},
howpublished={\url{https://github.com/suhasramanand/kv-cache-compression}}
}- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- TinyLlama: An Open-Source Small Language Model
This is a research project for a Master's thesis. Contributions, issues, and pull requests are welcome! Please feel free to open an issue for bugs or feature requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Suhas Reddy Baluvanahally Ramananda - Northeastern University
- Gautam Raju - Northeastern University
- Namratha Tiptur Manjunath - Syracuse University
- HuggingFace for the Transformers library
- The H2O paper authors for the attention-score-based eviction concept
- TinyLlama team for the open-source model
Note: This project is part of ongoing research. Results may vary with different models, hardware, or evaluation settings. For production deployment, we recommend running your own evaluations with your specific use case.