GhostKV Lab

“A research simulator for query-time bounded elimination of reconstructable KV-cache witnesses in long-context transformer inference.”

GhostKV Lab is a lightweight Python repository for studying whether sketch-based bounded elimination can reduce KV-cache memory movement while preserving attention quality in long-context decode workloads. It is built as a synthetic evaluation harness first: no heavyweight model downloads, no kernel claims, and no fabricated benchmark results.

The current empirical emphasis is on failure analysis as much as success cases: the most important result in the repository today is that the current GPT-2 frontier sweep did not find safe-ish operating points with false_elimination_rate <= 5% and elimination_rate >= 30%.

Patent Notice

This repository is associated with Indian provisional patent application 202641062451, titled:

“GHOSTKV: A SYSTEM AND METHOD FOR QUERY-TIME BOUNDED ELIMINATION OF RECONSTRUCTABLE KEY-VALUE WITNESSES IN TRANSFORMER ATTENTION MECHANISMS”

Filed on 2026-05-17.

The repository is intended as a research and evaluation harness for exploring the underlying systems concepts. A concise note is available in docs/patent_notice.md.

Current Status

Current status:

Synthetic GhostKV simulator: working
GPT-2 real attention validation: working
False-elimination frontier analysis: working
Hierarchical elimination experiments: working
Synthetic result generation pipeline: working
Modern Llama/Mistral validation: pending
GPU kernel integration: pending
Production inference integration: not implemented

Current Research Focus

The current focus is false-elimination frontier analysis on real transformer attention tensors.

The key question:

Can GhostKV eliminate meaningful amounts of cold KV state while keeping false elimination acceptably low?

Current experiments focus on:

attention sketch preservation
bounded elimination behavior
layer/head sensitivity
hierarchical elimination
synthetic memory-traffic modeling

Latency reduction and production inference integration remain future work.

Current Headline Result

Under the current GPT-2 real-attention frontier sweep, GhostKV Lab did not find safe-ish operating points meeting:

false_elimination_rate <= 5%
elimination_rate >= 30%

This is currently the most important result in the repository because it shows that coarse ranking preservation alone is not enough. High rank correlation can coexist with weak extreme-rank preservation and unacceptable elimination tradeoffs.

See:

Run:

make frontier

Results are written to:

results/frontier/

What GhostKV Is

GhostKV is a systems-oriented hypothesis for KV-cache handling during decode:

Cold KV-cache entries are converted into compact ghost records.
Each ghost record stores an attention sketch vector, a semantic anchor identifier, and a residual uncertainty term.
At query time, the simulator computes a conservative attention upper bound for each ghost record:

AttnUB(Q, G_i) = sketch_sim(Q, G_i.sketch) + epsilon_res_i + sigma_anchor_i

Ghost tokens with an upper bound below theta_elim are eliminated.
Surviving ghost records are resurrected and included in exact attention.

The key property in this repository is exactness over survivors: approximation is confined to the elimination stage. Once candidates survive elimination, the simulator treats attention over hot + resurrected tokens as exact.

What GhostKV Is Not

Not a production LLM runtime
Not a CUDA kernel implementation
Not a proof of speedup
Not a substitute for real-model validation

This repository uses synthetic tensors first and now includes GPT-2 attention-tensor validation. Broader modern-model validation remains future work.

Architecture

KV Cache
  |
  +--> Hot / Warm / Ghost / Archive
                    |
Query --> Sketch --> Bound --> Eliminate or Resurrect --> Exact Attend

The working intuition is simple: eliminate before moving, but only if the elimination bound remains conservative enough to avoid unacceptable false elimination.

Repository Layout

ghostkv-lab/
  docs/
  src/ghostkv/
  experiments/
  tests/
  results/
  data/

Quickstart

PowerShell

These commands work from the repo root in Windows PowerShell:

python -m venv .venv
.venv\Scripts\activate
python -m pip install -e ".[dev]"
python -m pytest
python experiments/sketch_quality_audit.py
python experiments/elimination_tradeoff.py
python experiments/bandwidth_model_demo.py
python experiments/synthetic_decode_simulation.py
python experiments/generate_results.py
python experiments/real_attention_validation.py
python experiments/hierarchical_elimination.py
python experiments/false_elimination_frontier.py

WSL / Linux / macOS

WSL is recommended for reproducible experiment workflows, especially for the heavier plotting and HuggingFace-based validation scripts.

python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
pytest
make results
make frontier
python experiments/real_attention_validation.py

From Windows, the same workflow can be invoked explicitly through WSL:

wsl -e bash -c "pytest"
wsl -e bash -c "make results"
wsl -e bash -c "make frontier"
wsl -e bash -c "python experiments/real_attention_validation.py"

If you prefer not to create a virtual environment, the same install and run commands work with the active Python environment as long as it is Python 3.10+.

Core Idea

Ghost records are compact witnesses for cold KV entries:

attention sketch vector
semantic anchor id
residual uncertainty value

At each decode step:

Project the query into sketch space.
Compute conservative upper bounds for ghost records.
Eliminate records with bounds below theta_elim.
Resurrect survivors.
Run exact attention over hot tokens plus resurrected tokens.

Why This Repo Exists

Long-context inference can become bottlenecked by KV-cache movement rather than only by arithmetic throughput. This repository exists to evaluate whether bounded elimination can reduce the amount of KV state that must be moved or re-read on each decode step without aggressively approximating the final attention calculation.

Experiments

experiments/sketch_quality_audit.py: compares exact scores and sketch-space scores across sketch dimensions
experiments/elimination_tradeoff.py: sweeps elimination thresholds and sketch dimensions
experiments/bandwidth_model_demo.py: compares illustrative memory footprints for full KV, quantized KV, and GhostKV
experiments/synthetic_decode_simulation.py: runs a multi-step decode simulation and summarizes aggregate metrics
experiments/generate_results.py: regenerates synthetic CSV outputs, PNG plots, and RESULTS.md
experiments/real_attention_validation.py: captures GPT-2 Q/K tensors and evaluates ranking preservation on real attention states
experiments/hierarchical_elimination.py: compares flat and hierarchical elimination on real attention tensors
experiments/false_elimination_frontier.py: sweeps theta_elim on real attention tensors to map elimination versus false-elimination frontiers by layer and head

Synthetic and real-attention experiments are both intended to inform feasibility, not to claim production benefit.

Known Findings So Far

Random projections preserve global similarity structure more effectively than exact top-attention ranking.
Real transformer tensors behave differently from synthetic Gaussian tensors.
False elimination remains the primary technical challenge.
Some attention heads and layers appear substantially more sketch-preserving than others.
Hierarchical elimination may improve elimination behavior in principle, but the current naive clustering baseline does not yet outperform flat elimination consistently.
The current GPT-2 frontier sweep did not find safe-ish operating points with false elimination below 5% and elimination above 30%.

Generate Results

make demo

This runs the test suite and then generates synthetic CSV outputs, PNG plots, and a refreshed RESULTS.md summary. If you only want to regenerate artifacts, use make results.

Additional targets:

make real-validation
make hierarchical
make frontier
make all-results

If make is not available in your shell, the equivalent commands are:

python -m pytest
python experiments/generate_results.py

For reproducible experiment workflows on Windows, using WSL is recommended:

wsl -e bash -c "pytest"
wsl -e bash -c "make results"
wsl -e bash -c "make frontier"

Current State Of The Project

What currently works:

synthetic sketch-quality sweeps
elimination-threshold experiments
GPT-2 attention tensor capture on CPU
per-layer and per-head real attention metrics
flat versus hierarchical elimination comparisons
decode-step simulation with exact attention on surviving candidates
illustrative bandwidth and resurrection modeling
CSV, plot, and markdown result generation

What is currently simulated:

anchor and residual uncertainty terms
resurrection cost estimates
memory-traffic comparisons

What remains hypothetical or unvalidated:

quality retention on benchmark tasks
runtime overlap between resurrection and decode compute
end-to-end latency benefit in a production inference stack
generalization from GPT-2 to larger modern models such as Llama, Mistral, and GQA-based decoders

What is future work:

broader real-model Q/K capture
LongBench and retrieval-style validation
FlashAttention-compatible survivor paths
GPU and memory-tier experiments

Roadmap

Phase 1 — Synthetic Validation

synthetic sketch quality
elimination sweeps
bandwidth modeling

Phase 2 — Real Attention Validation

GPT-2 Q/K capture
layer/head frontier analysis
false elimination measurement

Phase 3 — Modern Model Validation

TinyLlama
Mistral
Llama-3 style architectures
grouped-query attention behavior

Phase 4 — Runtime Integration

FlashAttention-compatible survivor path
decode-side resurrection overlap
GPU kernel hooks
memory movement instrumentation

Phase 5 — Memory-System Exploration

hierarchical ghost indexes
learned sketch functions
CXL / near-memory filtering
memory-side elimination experiments

Additional detail is in docs/roadmap.md.

Development Notes

Python 3.10+
Main dependencies: numpy, matplotlib, torch, transformers
Test runner: pytest
Editable install supported via pip install -e ".[dev]"

License Clarification

The source code in this repository is available under the MIT License. That copyright license applies to the code itself; it does not by itself waive any separate patent rights that may be associated with related patent filings.

License

MIT. See LICENSE.

Limitations

GPT-2 is not representative of all modern LLMs.
The repository does not include a production decode kernel.
No real memory movement reduction is measured yet.
The resurrection pipeline is still simulated.
There is no FlashAttention integration.
There is no end-to-end throughput benchmark.
There is no proof of quality preservation on downstream tasks.

This repository currently explores feasibility and methodology, not production deployment.

Disclaimer

GhostKV Lab is an experimental research repository exploring systems concepts related to KV-cache memory movement and bounded elimination in transformer inference workloads.

Current experiments are synthetic or small-model analytical studies intended for methodology exploration. The repository does not currently implement a production transformer runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
data		data
docs		docs
experiments		experiments
results		results
scripts		scripts
src/ghostkv		src/ghostkv
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RESULTS.md		RESULTS.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

GhostKV Lab

Patent Notice

Current Status

Current Research Focus

Current Headline Result

What GhostKV Is

What GhostKV Is Not

Architecture

Repository Layout

Quickstart

PowerShell

WSL / Linux / macOS

Core Idea

Why This Repo Exists

Experiments

Known Findings So Far

Generate Results

Current State Of The Project

Roadmap

Phase 1 — Synthetic Validation

Phase 2 — Real Attention Validation

Phase 3 — Modern Model Validation

Phase 4 — Runtime Integration

Phase 5 — Memory-System Exploration

Development Notes

License Clarification

License

Limitations

Disclaimer

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages