nanoGPT-inference-optimization

A specialized fork of nanoGPT focused on inference optimization, starting with Key-Value (KV) Caching.

🚀 Overview

Large Language Models (LLMs) like GPT-2 are autoregressive: they generate text one token at a time. A naive implementation re-computes the attention for all previous tokens at every step, leading to O(N²) complexity.

This repository implements KV Caching, a critical optimization that caches the Key and Value vectors of past tokens. This allows the model to compute attention only for the new token, reducing the complexity of generation to O(N).

📊 Benchmark Analysis

Per-Token Latency Trace

To visualize the "why", here is the latency for every single token generated in a sequence (Batch Size 1), averaged over 5 runs:

No Cache (Red): Latency increases with every new token (linear growth). The model has to re-process the entire history every time.
KV Cache (Blue): After the first token (prefill), latency is flat and constant. The model only processes the one new token.

We benchmarked the implementation on a standard CPU environment using GPT-2 (124M). The benchmark script (benchmark_kv.py) measures three key metrics across various batch sizes:

TTFT (Time To First Token): The latency to process the prompt and generate the first token (prefill phase).
TTPT (Time Per Token): The average latency to generate each subsequent token (decoding phase).
Throughput: The total number of tokens generated per second.

Results (CPU)

Key Findings:

Throughput: KV Cache (Blue) maintains high throughput as batch size increases, reaching ~635 tokens/sec. Without cache (Red), throughput collapses because the O(N^2) complexity dominates.
TTPT Stability: With KV Cache, the time per token remains constant. Without it, TTPT grows linearly with sequence length, making generation progressively slower.
Memory: KV Cache requires memory (Purple line), which grows linearly with batch size. This is the trade-off for speed.

🧠 Visualization

To understand how the KV cache works, we built an animated dashboard (visualize_kv.py).

Query (Top): The vector for the current token being generated.
Key Cache (Left): The stored keys for all previous tokens. The heatmap shows the activation patterns.
Attention (Right): The computed attention scores. Note how the model "attends" to specific past tokens.
Value Cache (Center): The stored values that will be weighted by the attention scores to form the output.

🛠️ Implementation Details

The core changes were made in model.py:

CausalSelfAttention.forward: Modified to accept past_kv (the cache) and return new_kv (the updated cache).
GPT.generate: Updated the generation loop to:
- Pass the full prompt for the first step (prefill).
- Pass only the last generated token for subsequent steps (decoding).
- Maintain the past_kv state across steps.

💡 Technical Insights and Learnings

The "Prefill" Cost is Unavoidable: Even with KV Cache, the first token (Time-To-First-Token) is always slow because the model must process the entire prompt in parallel. The cache only speeds up the subsequent tokens (decoding).
Prompt Length Matters: To truly see the performance gap, the sequence length must be sufficient. With No Cache, latency grows linearly with every new token (O(N)). With KV Cache, latency remains flat (O(1)) regardless of how long the prompt or generated text becomes.
Warm-up is Critical: Our initial benchmarks showed "spikes" in latency. We learned that performing 3-5 "warm-up" runs is essential to stabilize CPU/GPU clock speeds, memory allocators, and PyTorch kernels before measuring.
Batching Hides Overhead: On CPU (and GPU), dispatching individual kernels for a single token (Batch Size 1) is inefficient. Increasing the batch size allows us to process multiple streams in parallel, significantly increasing Throughput (tokens/sec) even if the latency per token (TTPT) stays roughly the same.
The Memory Trade-off: There is no free lunch. We gain speed by consuming memory. The KV cache grows linearly with sequence length and batch size. For long sequences (e.g., 1024 tokens) and large batches, this memory footprint can become the new bottleneck.
Production Safety: A robust implementation must handle the model's context limit (block_size). We added logic to detect when the sequence exceeds 1024 tokens and truncate/reset the cache to prevent crashes.

💻 Usage

1. Installation

pip install torch numpy matplotlib tiktoken

2. Run Benchmark

Run the comprehensive benchmark suite to generate the metrics and plots:

python benchmark_kv.py

3. Run Visualization

Generate the animated GIF dashboard:

python visualize_kv.py

📂 Repository Structure

model.py: The optimized GPT-2 model with KV Caching.
model_original.py: The original, unoptimized implementation (for reference).
benchmark_kv.py: Advanced benchmarking script (TTFT, TTPT, Throughput).
visualize_kv.py: Visualization tool for KV cache dynamics.
train.py: (Original) Training script.

Forked from karpathy/nanoGPT

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
assets		assets
config		config
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
benchmark_kv.py		benchmark_kv.py
benchmark_results.png		benchmark_results.png
configurator.py		configurator.py
kv_dashboard_l5_h0.gif		kv_dashboard_l5_h0.gif
model.py		model.py
model_original.py		model_original.py
sample.py		sample.py
scaling_laws.ipynb		scaling_laws.ipynb
speculative.py		speculative.py
token_latency.png		token_latency.png
train.py		train.py
transformer_sizing.ipynb		transformer_sizing.ipynb
visualize_kv.py		visualize_kv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nanoGPT-inference-optimization

🚀 Overview

📊 Benchmark Analysis

Per-Token Latency Trace

Results (CPU)

🧠 Visualization

🛠️ Implementation Details

💡 Technical Insights and Learnings

💻 Usage

1. Installation

2. Run Benchmark

3. Run Visualization

📂 Repository Structure

About

Uh oh!

Releases

Packages

Contributors 38

Uh oh!

Languages

License

shahjui2000/nanoGPT-inference-optimization

Folders and files

Latest commit

History

Repository files navigation

nanoGPT-inference-optimization

🚀 Overview

📊 Benchmark Analysis

Per-Token Latency Trace

Results (CPU)

🧠 Visualization

🛠️ Implementation Details

💡 Technical Insights and Learnings

💻 Usage

1. Installation

2. Run Benchmark

3. Run Visualization

📂 Repository Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 38

Uh oh!

Languages

Packages