Semantic caching reduces LLM latency and cost by returning cached model responses for semantically similar prompts (not just exact matches). vCache is the first verified semantic cache that guarantees user-defined error rate bounds. vCache replaces static thresholds with online-learned, embedding-specific decision boundaries—no manual fine-tuning required. This enables reliable cached response reuse across any embedding model or workload.
[NOTE] vCache is currently in active development. Features and APIs may change as we continue to improve the system.
Install vCache in editable mode:
pip install -e .
Then, set your OpenAI key:
export OPENAI_API_KEY="your_api_key_here"
(Note: vCache uses OpenAI by default for both LLM inference and embedding generation, but you can configure any other backend)
Finally, use vCache in your Python code:
from vcache import VCache
vcache: VCache = VCache()
response: str = vcache.infer("Is the sky blue?")
By default, vCache uses:
OpenAIInferenceEngine
OpenAIEmbeddingEngine
HNSWLibVectorDB
InMemoryEmbeddingMetadataStorage
NoEvictionPolicy
StringComparisonSimilarityEvaluator
VerifiedDecisionPolicy
with a maximum failure rate of 2%
vCache is modular and highly configurable. Below is an example showing how to customize key components:
from vcache import (
HNSWLibVectorDB,
InMemoryEmbeddingMetadataStorage,
LLMComparisonSimilarityEvaluator,
OpenAIEmbeddingEngine,
OpenAIInferenceEngine,
VCache,
VCacheConfig,
VCachePolicy,
VerifiedDecisionPolicy,
)
# 1. Configure the components for vCache
config: VCacheConfig = VCacheConfig(
inference_engine=OpenAIInferenceEngine(model_name="gpt-4.1-2025-04-14"),
embedding_engine=OpenAIEmbeddingEngine(model_name="text-embedding-3-small"),
vector_db=HNSWLibVectorDB(),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
similarity_evaluator=LLMComparisonSimilarityEvaluator(
inference_engine=OpenAIInferenceEngine(model_name="gpt-4.1-nano-2025-04-14")
),
)
# 2. Choose a caching policy
policy: VCachePolicy = VerifiedDecisionPolicy(delta=0.03)
# 3. Initialize vCache with the configuration and policy
vcache: VCache = VCache(config, policy)
response: str = vcache.infer("Is the sky blue?")
You can swap out any component—such as the eviction policy or vector database—for your specific use case.
You can find complete working examples in the playground
directory:
example_1.py
- Basic usage with sample data processingexample_2.py
- Advanced usage with cache hit tracking and timing
vCache supports FIFO, LRU, MRU, and a custom SCU eviction policy. See the Eviction Policy Documentation for further details.
Semantic caching reduces LLM latency and cost by returning cached model responses for semantically similar prompts (not just exact matches)—so you don’t pay for inference cost and latency on repeated questions that have the same answer.
-
Embed & Store
Each prompt is converted to a fixed-length vector (an “embedding”) and stored in a vector database along with its LLM response. -
Nearest-Neighbor Lookup
When a new prompt arrives, the cache embeds it and finds its most similar stored prompt using a similarity metric (e.g., cosine similarity). -
Similarity Score
The system computes a score between 0 and 1 that quantifies how “close” the new prompt is to the retrieved entry. -
Decision: Exploit vs. Explore
- Exploit (cache hit): If the similarity is above a confidence bound, return the cached response.
- Explore (cache miss): Otherwise, infer the LLM for a response, add its embedding and answer to the cache, and return it.
Existing semantic caches rely on a global static threshold to decide whether to reuse a cached response (exploit) or invoke the LLM (explore). If the similarity score exceeds this threshold, the cache reuses the response; otherwise, it infers the model. This strategy is fundamentally limited.
- Uniform threshold, diverse prompts: A fixed threshold assumes all embeddings are equally distributed—ignoring that similarity is context-dependent.
- Threshold too low → false positives: Prompts with low semantic similarity may be incorrectly treated as equivalent, resulting in reused responses that do not match the intended output.
- Threshold too high → false negatives: Prompts with semantically equivalent meaning may fail the similarity check, forcing unnecessary LLM inference and reducing cache efficiency.
- No correctness control: There is no mechanism to ensure or even estimate how often reused answers will be wrong.
In short, fixed thresholds trade correctness for simplicity and offer no guarantees. Please refer to the vCache paper for further details.
vCache overcomes these limitations with two ideas:
-
Per-Prompt Decision Boundary
vCache learns a custom decision boundary for each cached prompt, based on past observations of “how often similarity × actually matched the correct response.” -
Built-In Error Constraint
You specify a maximum error rate (e.g., 1%). vCache adjusts every per-prompt decision boundary online. The algorithm enforces optimized cache hit rates and does not require offline training or manual fine-tuning.
-
Reliability
Formally bounds the rate of incorrect cache hits to your chosen tolerance. -
Performance
Matches or exceeds static-threshold systems in cache hit rate and end-to-end latency. -
Simplicity
Plug in any embedding model; vCache learns and adapts automatically at runtime.
Please refer to the vCache paper for further details.
For advanced usage and development setup, see the Developer Guide.
vCache includes a benchmarking framework to evaluate:
- Cache hit rate
- Error rate
- Latency improvement
- ...
We provide three open benchmarks:
- SemCacheLmArena (chat-style prompts) - Dataset ↗
- SemCacheClassification (classification queries) - Dataset ↗
- SemCacheSearchQueries (real-world search logs) - Dataset ↗
See the Benchmarking Documentation for instructions.
If you use vCache for your research, please cite our paper.
@article{schroeder2025adaptive,
title={vCache: Verified Semantic Prompt Caching},
author={Schroeder, Luis Gaspar and Desai, Aditya and Cuadron, Alejandro and Chu, Kyle and Liu, Shu and Zhao, Mark and Krusche, Stephan and Kemper, Alfons and Zaharia, Matei and Gonzalez, Joseph E},
journal={arXiv preprint arXiv:2502.03771},
year={2025}
}