GitHub - vcache-project/vCache: Reliable and Efficient Semantic Prompt Caching with vCache

vCache

Reliable and Efficient Semantic Prompt Caching

Semantic caching reduces LLM latency and cost by returning cached model responses for semantically similar prompts (not just exact matches). vCache is the first verified semantic cache that guarantees user-defined error rate bounds. vCache replaces static thresholds with online-learned, embedding-specific decision boundaries—no manual fine-tuning required. This enables reliable cached response reuse across any embedding model or workload.

[NOTE] vCache is currently in active development. Features and APIs may change as we continue to improve the system.

🚀 Quick Install

Install vCache in editable mode:

pip install -e .

Then, set your OpenAI key:

export OPENAI_API_KEY="your_api_key_here"

(Note: vCache uses OpenAI by default for both LLM inference and embedding generation, but you can configure any other backend)

Finally, use vCache in your Python code:

from vcache import VCache

vcache: VCache = VCache()
response: str = vcache.infer("Is the sky blue?")

By default, vCache uses:

OpenAIInferenceEngine
OpenAIEmbeddingEngine
HNSWLibVectorDB
InMemoryEmbeddingMetadataStorage
NoEvictionPolicy
StringComparisonSimilarityEvaluator
VerifiedDecisionPolicy with a maximum failure rate of 2%

⚙️ Advanced Configuration

vCache is modular and highly configurable. Below is an example showing how to customize key components:

from vcache import (
    HNSWLibVectorDB,
    InMemoryEmbeddingMetadataStorage,
    LLMComparisonSimilarityEvaluator,
    OpenAIEmbeddingEngine,
    OpenAIInferenceEngine,
    VCache,
    VCacheConfig,
    VCachePolicy,
    VerifiedDecisionPolicy,
)

# 1. Configure the components for vCache
config: VCacheConfig = VCacheConfig(
    inference_engine=OpenAIInferenceEngine(model_name="gpt-4.1-2025-04-14"),
    embedding_engine=OpenAIEmbeddingEngine(model_name="text-embedding-3-small"),
    vector_db=HNSWLibVectorDB(),
    embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
    similarity_evaluator=LLMComparisonSimilarityEvaluator(
        inference_engine=OpenAIInferenceEngine(model_name="gpt-4.1-nano-2025-04-14")
    ),
)

# 2. Choose a caching policy
policy: VCachePolicy = VerifiedDecisionPolicy(delta=0.03)

# 3. Initialize vCache with the configuration and policy
vcache: VCache = VCache(config, policy)

response: str = vcache.infer("Is the sky blue?")

You can swap out any component—such as the eviction policy or vector database—for your specific use case.

You can find complete working examples in the playground directory:

example_1.py - Basic usage with sample data processing
example_2.py - Advanced usage with cache hit tracking and timing

Eviction Policy

vCache supports FIFO, LRU, MRU, and a custom SCU eviction policy. See the Eviction Policy Documentation for further details.

🧠 What Is Semantic Caching?

Semantic caching reduces LLM latency and cost by returning cached model responses for semantically similar prompts (not just exact matches)—so you don’t pay for inference cost and latency on repeated questions that have the same answer.

vCache Architecture

Architecture Overview

Embed & Store
Each prompt is converted to a fixed-length vector (an “embedding”) and stored in a vector database along with its LLM response.
Nearest-Neighbor Lookup
When a new prompt arrives, the cache embeds it and finds its most similar stored prompt using a similarity metric (e.g., cosine similarity).
Similarity Score
The system computes a score between 0 and 1 that quantifies how “close” the new prompt is to the retrieved entry.
Decision: Exploit vs. Explore
- Exploit (cache hit): If the similarity is above a confidence bound, return the cached response.
- Explore (cache miss): Otherwise, infer the LLM for a response, add its embedding and answer to the cache, and return it.

vCache Architecture

Why Fixed Thresholds Fall Short

Existing semantic caches rely on a global static threshold to decide whether to reuse a cached response (exploit) or invoke the LLM (explore). If the similarity score exceeds this threshold, the cache reuses the response; otherwise, it infers the model. This strategy is fundamentally limited.

Uniform threshold, diverse prompts: A fixed threshold assumes all embeddings are equally distributed—ignoring that similarity is context-dependent.
Threshold too low → false positives: Prompts with low semantic similarity may be incorrectly treated as equivalent, resulting in reused responses that do not match the intended output.
Threshold too high → false negatives: Prompts with semantically equivalent meaning may fail the similarity check, forcing unnecessary LLM inference and reducing cache efficiency.
No correctness control: There is no mechanism to ensure or even estimate how often reused answers will be wrong.

In short, fixed thresholds trade correctness for simplicity and offer no guarantees. Please refer to the vCache paper for further details.

Introducing vCache

vCache overcomes these limitations with two ideas:

Per-Prompt Decision Boundary
vCache learns a custom decision boundary for each cached prompt, based on past observations of “how often similarity × actually matched the correct response.”
Built-In Error Constraint
You specify a maximum error rate (e.g., 1%). vCache adjusts every per-prompt decision boundary online. The algorithm enforces optimized cache hit rates and does not require offline training or manual fine-tuning.

Benefits

Reliability
Formally bounds the rate of incorrect cache hits to your chosen tolerance.
Performance
Matches or exceeds static-threshold systems in cache hit rate and end-to-end latency.
Simplicity
Plug in any embedding model; vCache learns and adapts automatically at runtime.

Please refer to the vCache paper for further details.

🛠 Developer Guide

For advanced usage and development setup, see the Developer Guide.

📊 Benchmarking vCache

vCache includes a benchmarking framework to evaluate:

Cache hit rate
Error rate
Latency improvement
...

We provide three open benchmarks:

SemCacheLmArena (chat-style prompts) - Dataset ↗
SemCacheClassification (classification queries) - Dataset ↗
SemCacheSearchQueries (real-world search logs) - Dataset ↗

See the Benchmarking Documentation for instructions.

📄 Citation

If you use vCache for your research, please cite our paper.

@article{schroeder2025adaptive,
  title={vCache: Verified Semantic Prompt Caching},
  author={Schroeder, Luis Gaspar and Desai, Aditya and Cuadron, Alejandro and Chu, Kyle and Liu, Shu and Zhao, Mark and Krusche, Stephan and Kemper, Alfons and Zaharia, Matei and Gonzalez, Joseph E},
  journal={arXiv preprint arXiv:2502.03771},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
benchmarks		benchmarks
docs		docs
playground		playground
tests		tests
vcache		vcache
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
ReadMe_Dev.md		ReadMe_Dev.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reliable and Efficient Semantic Prompt Caching

🚀 Quick Install

⚙️ Advanced Configuration

Eviction Policy

🧠 What Is Semantic Caching?

Architecture Overview

Why Fixed Thresholds Fall Short

Introducing vCache

Benefits

🛠 Developer Guide

📊 Benchmarking vCache

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

vcache-project/vCache

Folders and files

Latest commit

History

Repository files navigation

Reliable and Efficient Semantic Prompt Caching

🚀 Quick Install

⚙️ Advanced Configuration

Eviction Policy

🧠 What Is Semantic Caching?

Architecture Overview

Why Fixed Thresholds Fall Short

Introducing vCache

Benefits

🛠 Developer Guide

📊 Benchmarking vCache

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages