A command-line utility that uses Anthropic's Claude AI with optional semantic caching powered by ScyllaDB Cloud or PostgreSQL pgvector with vector search. The semantic cache uses SentenceTransformer embeddings to identify similar prompts and return cached responses, reducing API calls and improving response times.
- Features
- Project Structure
- Requirements
- Installation
- Setting Up Cache Backends
- Usage
- Command-Line Options
- Environment Variables
- How Semantic Caching Works
- Database Schema
- Demo
- Examples
- Performance Considerations
- Cache Behavior
- Troubleshooting
- Benchmarking Cache Performance
- Contributing
- License
- Acknowledgments
- 🤖 Claude AI Integration: Uses Autogen with Anthropic's Claude models
- 🔍 Semantic Caching: Vector-based caching with ScyllaDB or PostgreSQL pgvector for similar prompt detection
- ⏱️ TTL Support: 1-hour default TTL (auto-expiration in ScyllaDB, query-time filtering in PostgreSQL)
- 🎯 Similarity Threshold: Enforced 0.95 default threshold to control cache hit quality
- ⚡ Fast Retrieval: Cosine similarity search using HNSW indexes
- 🎛️ Flexible Configuration: Command-line arguments or environment variables
- 🔧 Customizable Models: Configure both Claude and SentenceTransformer models
- 📊 Cache Control: Enable/disable caching on demand
- 🗄️ Multiple Backends: Choose between ScyllaDB Cloud or PostgreSQL pgvector
- 🧹 Automatic Cleanup: PostgreSQL supports manual cleanup of expired entries
The project consists of four main components:
- AI Agent (
ai_agent_with_cache.py) - Main CLI tool for querying Claude with semantic caching - Performance Benchmark (
benchmark.py) - Compare cache performance between backends- Comprehensive benchmark suite with multiple test scenarios
- Measures cache hits, semantic similarity, cache misses, and scale
- Exports results to JSON or CSV format
- ScyllaDB Cloud Management (
scylla-cloud/) - Deployment tool for managing ScyllaDB Cloud clustersdeploy-scylla-cloud.py- Create, destroy, and manage clusters with vector search- See scylla-cloud/README.md for detailed documentation
- PostgreSQL pgvector Docker Management (
postgres-pgvector-docker/) - Local PostgreSQL with pgvectordeploy-pgvector-docker.py- Manage local PostgreSQL containers with pgvector- See postgres-pgvector-docker/README.md for detailed documentation
- Python 3.8+
- Cache backend (optional, only needed when using semantic caching):
- ScyllaDB Cloud cluster, OR
- PostgreSQL with pgvector extension (can use the included Docker tool)
- Anthropic API key
- Clone the repository:
git clone <repository-url>
cd ai-agent-with-semantic-cache- Install dependencies:
pip install -r requirements.txtOr manually:
pip install autogen-ext autogen-core sentence-transformers scylla-driver psycopg[binary] pgvector numpy anthropicNote on ScyllaDB Driver: This project uses scylla-driver (not cassandra-driver). The scylla-driver is a fork of cassandra-driver optimized for ScyllaDB, but it still uses the cassandra namespace internally. When importing, use from cassandra.cluster import ... even though the package is scylla-driver.
- Set up your environment variables (optional):
export ANTHROPIC_API_KEY="your-api-key"
# For ScyllaDB:
export SCYLLA_CONTACT_POINTS="your-scylla-host.com"
export SCYLLA_USER="your-username"
export SCYLLA_PASSWORD="your-password"
# For PostgreSQL:
export POSTGRES_HOST="localhost"
export POSTGRES_PORT="5432"
export POSTGRES_USER="postgres"
export POSTGRES_PASSWORD="postgres"Use the included Docker management tool for quick local setup:
cd postgres-pgvector-docker
# Start PostgreSQL with pgvector
./deploy-pgvector-docker.py start --name pgvector-cache
# Check status
./deploy-pgvector-docker.py status --name pgvector-cache
# Get connection info
./deploy-pgvector-docker.py info --name pgvector-cacheFor detailed documentation, see postgres-pgvector-docker/README.md.
Deploy a ScyllaDB Cloud cluster with vector search:
cd scylla-cloud
# Set your ScyllaDB Cloud API key
export SCYLLA_CLOUD_API_KEY="your-cloud-api-key"
# Create a cluster with vector search
./deploy-scylla-cloud.py create \
--name my-vector-cache \
--cloud-provider AWS \
--region us-east-1
# Check status
./deploy-scylla-cloud.py status --name my-vector-cache
# Get connection information
./deploy-scylla-cloud.py info --name my-vector-cache --format jsonFor detailed documentation, see scylla-cloud/README.md.
./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache none \
--anthropic-api-key "your-api-key"./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache pgvector \
--anthropic-api-key "your-api-key" \
--postgres-host "localhost" \
--postgres-port 5432 \
--postgres-user "postgres" \
--postgres-password "postgres"./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache scylla \
--anthropic-api-key "your-api-key" \
--scylla-contact-points "your-host.com" \
--scylla-user "your-username" \
--scylla-password "your-password"# Use L2 distance with pgvector
./ai_agent_with_cache.py \
--prompt "Explain machine learning" \
--with-cache pgvector \
--similarity-function l2
# Use custom Claude and embedding models
./ai_agent_with_cache.py \
--prompt "Summarize this article" \
--with-cache scylla \
--anthropic-api-model "claude-opus-4-20250514" \
--sentence-transformer-model "paraphrase-MiniLM-L6-v2"Note: The current CLI uses a fixed cache TTL of 3600 seconds and a fixed similarity threshold of 0.95.
export ANTHROPIC_API_KEY="your-api-key"
# For PostgreSQL pgvector:
export POSTGRES_HOST="localhost"
export POSTGRES_USER="postgres"
export POSTGRES_PASSWORD="postgres"
./ai_agent_with_cache.py --prompt "What is the capital of France?" --with-cache pgvector
# For ScyllaDB:
export SCYLLA_CONTACT_POINTS="your-host.com"
export SCYLLA_USER="your-username"
export SCYLLA_PASSWORD="your-password"
./ai_agent_with_cache.py --prompt "What is the capital of France?" --with-cache scylla--prompt: The prompt to send to Claude
--with-cache {none,scylla,pgvector}: Type of semantic cache (default:scylla)none: Disable caching entirelyscylla: Use ScyllaDB with vector searchpgvector: Use PostgreSQL with pgvector extension
--postgres-host: PostgreSQL host (default:localhost)--postgres-port: PostgreSQL port (default:5432)--postgres-user: PostgreSQL username (default:postgres)--postgres-password: PostgreSQL password (default: empty string)--postgres-database: PostgreSQL database name (default:postgres)--postgres-schema: PostgreSQL schema name (default:llm_cache)--postgres-table: PostgreSQL table name (default:llm_responses)
--scylla-contact-points: Comma-separated list of ScyllaDB hosts (default:127.0.0.1)--scylla-user: ScyllaDB username (default:scylla)--scylla-password: ScyllaDB password (default: empty string)--scylla-keyspace: Keyspace name (default:llm_cache)--scylla-table: Table name (default:llm_responses)
--similarity-function {cosine,l2,inner_product,l1}: Vector similarity function for PostgreSQL pgvector (default:cosine)cosine: Cosine distance (default, best for normalized embeddings)l2: Euclidean (L2) distanceinner_product: Negative inner productl1: Manhattan (L1) distance
- ScyllaDB note: ScyllaDB currently uses cosine similarity for ANN index and threshold checks.
- Default TTL:
3600seconds (1 hour) - ScyllaDB: Automatic deletion after TTL expires (
USING TTL) - PostgreSQL: Time-based filtering in queries; expired rows can be removed with
cleanup_expired() - Current limitation: TTL is not exposed as a CLI flag or environment variable.
--anthropic-api-key: Anthropic API key (overridesANTHROPIC_API_KEYenv var)--anthropic-api-model: Claude model to use (default:claude-sonnet-4-5-20250929)--sentence-transformer-model: Embedding model (default:all-MiniLM-L6-v2)
The following environment variables are supported by the CLI:
ANTHROPIC_API_KEY: Anthropic API keyANTHROPIC_API_MODEL: Claude model nameSENTENCE_TRANSFORMER_MODEL: SentenceTransformer model nameSIMILARITY_FUNCTION: Vector similarity function for PostgreSQL pgvector
POSTGRES_HOST: PostgreSQL hostPOSTGRES_PORT: PostgreSQL portPOSTGRES_USER: PostgreSQL usernamePOSTGRES_PASSWORD: PostgreSQL passwordPOSTGRES_DATABASE: PostgreSQL database namePOSTGRES_SCHEMA: PostgreSQL schema namePOSTGRES_TABLE: PostgreSQL table name
SCYLLA_CONTACT_POINTS: ScyllaDB hostsSCYLLA_USER: ScyllaDB usernameSCYLLA_PASSWORD: ScyllaDB passwordSCYLLA_KEYSPACE: ScyllaDB keyspaceSCYLLA_TABLE: ScyllaDB table
Note: Command-line arguments always take precedence over environment variables. Similarity threshold and TTL are currently fixed in the CLI.
- Embedding Generation: When you submit a prompt, the tool generates a 384-dimension vector embedding using SentenceTransformer
- Vector Search:
- ScyllaDB: Uses ANN vector search with cosine similarity
- PostgreSQL: Uses the selected similarity function (
--similarity-function/SIMILARITY_FUNCTION)
- Similarity Threshold: Results are filtered by a fixed threshold of
0.95- ScyllaDB: Calculates cosine similarity in Python after retrieval
- PostgreSQL: Converts threshold to distance and filters in SQL
- TTL Filtering: Expired entries are excluded using a fixed TTL of
3600seconds- ScyllaDB: Uses native TTL (automatic deletion)
- PostgreSQL: Filters by
created_attimestamp in queries
- Cache Hit/Miss:
- Hit: If a similar prompt above threshold is found, the cached response is returned instantly
- Miss: The prompt is sent to Claude, and both the embedding and response are cached with TTL
- Storage: Cached entries include the prompt text, embedding vector, response, and timestamp
The tool automatically creates the following PostgreSQL schema:
CREATE SCHEMA llm_cache;
CREATE EXTENSION vector;
CREATE TABLE llm_cache.llm_responses (
prompt_hash TEXT PRIMARY KEY,
prompt TEXT NOT NULL,
embedding vector(384),
response TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX llm_responses_embedding_idx
ON llm_cache.llm_responses
USING hnsw (embedding vector_cosine_ops);The tool automatically creates the following ScyllaDB schema:
CREATE KEYSPACE llm_cache
WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3};
CREATE TABLE llm_responses (
prompt_hash text PRIMARY KEY,
prompt text,
embedding vector<float, 384>,
response text,
created_at timestamp
);
CREATE CUSTOM INDEX embedding_ann_index
ON llm_cache.llm_responses(embedding)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE'};The following demo shows how the semantic cache works with PostgreSQL and pgVector:
- Create the database in Docker:
$ ./postgres-pgvector-docker/deploy-pgvector-docker.py start \
--name pgvector-local- Setup some environment variables:
export ANTHROPIC_API_KEY="your-api-key"
# For PostgreSQL:
export POSTGRES_HOST="localhost"
export POSTGRES_PORT="5432"
export POSTGRES_USER="postgres"
export POSTGRES_PASSWORD="postgres"
# Suppresses some unnecessary messages
export TOKENIZERS_PARALLELISM=false- Ask Anthropic and the cache about the capital of France:
$ ./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache pgvector
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to PostgreSQL...
Checking cache...
[-] Cache miss - querying Claude...
[+] Response cached successfully (TTL: 3600s)
Claude's response:
--------------------------------------------------------------------------------
The capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache miss, because it was the first time that we asked the question.
- Ask Anthropic and the cache the same question again:
$ ./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache pgvector
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to PostgreSQL...
Checking cache...
[+] Cache hit! (similarity: 1.0000, distance: 0.0000)
[Using cached response]
Claude's response:
--------------------------------------------------------------------------------
The capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache hit, because it was an exact match.
- Ask Anthropic and the cache about the current capital of France, which is a different question with a similar meaning:
$ ./ai_agent_with_cache.py \
--prompt "What is the current capital of France?" \
--with-cache pgvector
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to PostgreSQL...
Checking cache...
[-] Cache miss - querying Claude...
[+] Response cached successfully (TTL: 3600s)
Claude's response:
--------------------------------------------------------------------------------
The current capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache miss, because the similarity (0.9438) was lower than the threshold (0.95).
- Ask Anthropic and the cache about the capital of France right now, which is another different question with a similar meaning:
$ ./ai_agent_with_cache.py \
--prompt "What is the capital of France right now?" \
--with-cache pgvector
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to PostgreSQL...
Checking cache...
[+] Cache hit! (similarity: 0.9570, distance: 0.0430)
[Using cached response]
Claude's response:
--------------------------------------------------------------------------------
The current capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache hit, because the similarity (0.9570) was higher than the threshold (0.95).
- Stop database container:
$ ./postgres-pgvector-docker/deploy-pgvector-docker.py stop \
--name pgvector-local- Cleanup database container:
$ ./postgres-pgvector-docker/deploy-pgvector-docker.py destroy \
--name pgvector-local \
--remove-volumesThe following demo shows how the semantic cache works with ScyllaDB Cloud with Vector Search:
- Create the cluster in ScyllaDB Cloud:
$ export SCYLLA_CLOUD_API_KEY="your-api-key"
$ ./scylla-cloud/deploy-scylla-cloud.py create \
--name my-vector-cache \
--allowed-ips "MY_IP"- Wait for the cluster to become available:
$ ./scylla-cloud/deploy-scylla-cloud.py status \
--name my-vector-cache- Wait for the vector search nodes to become available.
The vector search nodes are provisioned after the cluster itself is ready, so wait a little longer for them to be ready.
- Obtain connection information:
$ ./scylla-cloud/deploy-scylla-cloud.py info \
--name my-vector-cache- Setup some environment variables:
export ANTHROPIC_API_KEY="your-api-key"
# For ScyllaDB:
export SCYLLA_CONTACT_POINTS="your-host.com"
export SCYLLA_USER="your-username"
export SCYLLA_PASSWORD="your-password"
# Suppresses some unnecessary messages
export TOKENIZERS_PARALLELISM=false- Ask Anthropic and the cache about the capital of France:
$ ./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache scylla
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to ScyllaDB...
Waiting for vector index to initialize...
Checking cache...
[-] Cache miss - querying Claude...
[+] Response cached successfully (TTL: 3600s)
Claude's response:
--------------------------------------------------------------------------------
The capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache miss, because it was the first time that we asked the question.
- Ask Anthropic and the cache the same question again:
$ ./ai_agent_with_cache.py \
--prompt "What is the capital of France?" \
--with-cache scylla
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to ScyllaDB...
Waiting for vector index to initialize...
Checking cache...
[+] Cache hit! (similarity: 1.0000)
[Using cached response]
Claude's response:
--------------------------------------------------------------------------------
The capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache hit, because it was an exact match.
- Ask Anthropic and the cache about the current capital of France, which is a different question with a similar meaning:
$ ./ai_agent_with_cache.py \
--prompt "What is the current capital of France?" \
--with-cache scylla
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to ScyllaDB...
Waiting for vector index to initialize...
Checking cache...
[-] Similarity too low (0.9438 < 0.95)
[-] Cache miss - querying Claude...
[+] Response cached successfully (TTL: 3600s)
Claude's response:
--------------------------------------------------------------------------------
The current capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache miss, because the similarity (0.9438) was lower than the threshold (0.95).
- Ask Anthropic and the cache about the capital of France right now, which is another different question with a similar meaning:
$ ./ai_agent_with_cache.py \
--prompt "What is the capital of France right now?" \
--with-cache scylla
Loading SentenceTransformer model...
Generating embedding for prompt...
Embedding dimension: 384
Connecting to ScyllaDB...
Waiting for vector index to initialize...
Checking cache...
[+] Cache hit! (similarity: 0.9570)
[Using cached response]
Claude's response:
--------------------------------------------------------------------------------
The current capital of France is Paris.
--------------------------------------------------------------------------------This triggered a cache hit, because the similarity (0.9570) was higher than the threshold (0.95).
- Cleanup cluster:
$ ./scylla-cloud/deploy-scylla-cloud.py destroy \
--name my-vector-cacheFirst run (cache miss):
./ai_agent_with_cache.py --prompt "Explain quantum computing"
# Output: [-] Cache miss - querying Claude...
# Response time: ~2-3 secondsSecond run (cache hit):
./ai_agent_with_cache.py --prompt "Explain quantum computing"
# Output: [+] Cache hit! (similarity: 1.0000)
# Response time: ~100-200msPostgreSQL uses time-based filtering for TTL, so expired entries remain in the database until cleaned up. To remove expired entries:
from ai_agent_with_cache import PgVectorCache
# Create cache instance
cache = PgVectorCache(
host="localhost",
user="postgres",
password="postgres",
ttl_seconds=3600 # 1 hour
)
# Connect and cleanup
await cache.connect()
await cache.cleanup_expired() # Logs how many rows were removed (if any)
await cache.close()You can schedule this as a periodic task (e.g., cron job) or run manually when needed. ScyllaDB does not need this as it automatically deletes expired entries.
# Use L2 distance instead of cosine
./ai_agent_with_cache.py \
--prompt "Explain machine learning" \
--with-cache pgvector \
--similarity-function l2
# Use inner product for normalized vectors
./ai_agent_with_cache.py \
--prompt "Explain machine learning" \
--with-cache pgvector \
--similarity-function inner_product# Use Claude Opus
./ai_agent_with_cache.py \
--prompt "Write a poem" \
--anthropic-api-model "claude-opus-4-20250514"
# Use different embedding model
./ai_agent_with_cache.py \
--prompt "Summarize this text" \
--sentence-transformer-model "paraphrase-MiniLM-L6-v2"./ai_agent_with_cache.py \
--prompt "What's the weather like?" \
--with-cache none- First Request: Includes model loading time (~1-2 seconds for SentenceTransformer)
- Cache Hit: Typically 50-150ms (local PostgreSQL)
- Cache Miss: Depends on Claude API response time (~2-5 seconds)
- Index Type: Uses HNSW for better query performance
- Embedding Dimension: 384 for default model (all-MiniLM-L6-v2)
- TTL Overhead: Minimal - timestamp filtering in WHERE clause
- Cleanup: Manual via
cleanup_expired()method (scheduled or on-demand)
- First Request: Includes model loading time (~1-2 seconds for SentenceTransformer)
- Cache Hit: Typically 100-200ms (depends on ScyllaDB latency)
- Cache Miss: Depends on Claude API response time (~2-5 seconds)
- Embedding Dimension: 384 for default model (all-MiniLM-L6-v2)
- TTL Overhead: None - native TTL with automatic deletion
- Cleanup: Automatic - no maintenance required
- Default: 3600 seconds (1 hour)
- Current CLI behavior: TTL is fixed at 3600 seconds
- ScyllaDB: Uses native
USING TTLclause - entries automatically deleted after expiration - PostgreSQL: Uses time-based filtering in queries - expired entries remain until cleanup
- Default: 0.95 (95% similarity for cosine)
- Current CLI behavior: Threshold is fixed at 0.95
- ScyllaDB: Calculates cosine similarity in Python after retrieval, filters results
- PostgreSQL: Converts to distance threshold, filters in SQL WHERE clause
- Cache Output: Shows similarity score on cache hits:
[+] Cache hit! (similarity: 0.9876)
A cached response is returned when:
- A semantically similar prompt is found (via vector search)
- Similarity meets or exceeds the threshold (default: 0.95)
- The entry has not expired (TTL check passes)
A new Claude query is made when:
- No similar prompts found in cache
- Similar prompts exist but similarity < threshold
- All similar prompts have expired (past TTL)
ValueError: ANTHROPIC_API_KEY must be set either via --anthropic-api-key or as an environment variable
Solution: Set the API key via command-line or environment variable.
psycopg.OperationalError: connection failed
Solution:
- Verify PostgreSQL is running:
./postgres-pgvector-docker/deploy-pgvector-docker.py status --name your-container - Check connection parameters (host, port, username, password)
- Ensure pgvector extension is installed
NoHostAvailable: Unable to connect to any servers
Solution: Verify your ScyllaDB contact points, credentials, and network connectivity. If using ScyllaDB Cloud, ensure your cluster is active using ./scylla-cloud/deploy-scylla-cloud.py status --name your-cluster.
✗ Vector index not ready yet. Try again in a few seconds.
or
Cache lookup error: Error from server: code=2200 [Invalid query] message="ANN ordering by vector requires the column to be indexed using 'vector_index'"
Solution: The vector index is still initializing. This is most common when:
- First connecting to a new ScyllaDB Cloud cluster
- Creating a new keyspace for the first time
- Cloud deployments with higher network latency
The tool automatically waits 5 seconds for index initialization. If you still see this error:
- Wait 10-15 seconds and try your query again
- For cloud deployments, initialization may take longer
- Verify the cluster has vector search enabled:
./scylla-cloud/deploy-scylla-cloud.py info --name your-cluster
Compare the performance of ScyllaDB and PostgreSQL pgvector backends using the included benchmark script:
With local PostgreSQL:
./benchmark.py --backends both \
--postgres-password postgres \
--scylla-contact-points "your-host.com" \
--scylla-user scylla \
--scylla-password "your-password"With remote PostgreSQL:
./benchmark.py --backends both \
--postgres-host db.example.com \
--postgres-port 5432 \
--postgres-user myuser \
--postgres-password mypassword \
--postgres-database mydb \
--scylla-contact-points "node-0.scylla.cloud,node-1.scylla.cloud" \
--scylla-user scylla \
--scylla-password "your-password"The benchmark tests four key scenarios:
- Cache Hit Performance: Queries the same cached prompt 100 times to measure pure retrieval speed
- Semantic Similarity Matching: Tests whether semantically similar prompts trigger cache hits
- Cache Miss Performance: Measures lookup + write latency for new prompts
- Concurrency Testing (optional): Tests read/write performance under concurrent load
# Test only PostgreSQL (local)
./benchmark.py --backends pgvector
# Test PostgreSQL with remote instance
./benchmark.py --backends pgvector \
--postgres-host db.example.com \
--postgres-port 5432 \
--postgres-user myuser \
--postgres-password mypassword \
--postgres-database mydb
# Test only ScyllaDB
./benchmark.py --backends scylla
# Save results to JSON
./benchmark.py --backends both --output json
# Use custom prompts
./benchmark.py --prompts-file my_prompts.txtTest cache performance under concurrent load to measure throughput (QPS) and latency at different concurrency levels:
# Run concurrency tests with default levels (1, 5, 10, 25)
./benchmark.py --backends both \
--concurrency-test \
--postgres-password postgres \
--scylla-contact-points "your-host.com" \
--scylla-user scylla \
--scylla-password "your-password"
# Custom concurrency levels and operation count
./benchmark.py --backends pgvector \
--concurrency-test \
--concurrency-levels "1,10,50,100" \
--concurrent-operations 200Concurrency tests measure:
- Concurrent Reads: Multiple simultaneous cache lookups (tests read scalability)
- Concurrent Writes: Multiple simultaneous cache inserts (tests write contention)
- Mixed Workload: Realistic 80% read / 20% write ratio (tests real-world performance)
Each test reports:
- QPS (Queries Per Second): Throughput at the given concurrency level
- Latency Percentiles: p50, p95, p99 latencies under concurrent load
- Success/Failure Rates: Error rate under load
Use Cases:
- Comparing local vs remote database performance under load
- Determining optimal concurrency for your workload
- Identifying bottlenecks and scaling limits
- Testing connection pool configurations
Edit benchmark_prompts.txt to customize the prompts used in benchmarks. The file contains 1,189 prompts (excluding comments and blank lines) organized into categories:
- Base prompts (for cache population)
- Semantically similar variants (for similarity testing)
- Programming concepts, data structures, algorithms (200 prompts)
- Database, web development, cloud/DevOps (300 prompts)
- Security, ML/AI, business topics (250 prompts)
- Science, general knowledge, and miscellaneous (400+ prompts)
- Short queries, long-form questions, and edge cases (50+ prompts)
Performance comparison (local PostgreSQL vs ScyllaDB Cloud):
| Metric | PostgreSQL pgvector | ScyllaDB Cloud |
|---|---|---|
| Cache Hit (p50) | 1.15ms | 169.39ms |
| Semantic Hit Rate | 100% | Varies |
| Cache Miss Write (p50) | ~0ms | 307.98ms |
Note: Network latency significantly impacts cloud-based backends. For fair comparisons, deploy both backends in the same environment (both local or both cloud).
ScyllaDB shows 0% semantic similarity hit rate:
- The vector index needs time to initialize (5-10 seconds)
- The benchmark automatically waits and verifies the index before testing
- If you see errors about "ANN ordering requires indexed column", wait longer and retry
Different results between runs:
- First run includes model loading time (~1-2 seconds for SentenceTransformer)
- Cold vs warm cache affects initial query performance
- Network conditions vary for cloud backends
ScyllaDB: The tool waits 5 seconds for index initialization. This is sufficient for most scenarios, but cloud deployments or large existing caches may require additional time. The ScyllaDB cache includes:
- Configurable connection pooling (default: 10 connections per host, 1024 max requests per connection)
- Prepared statement caching for INSERT operations
- Proper core and max connection pool configuration for optimal concurrent performance
PostgreSQL: HNSW indexes are created automatically and are immediately usable. For large datasets, you may want to create indexes after loading initial data for better performance. The PostgreSQL cache includes:
- AsyncConnectionPool with configurable size (default: 10 connections)
- Parameterized SQL queries for lookup and insert operations
- Connection pooling for efficient concurrent operations
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
- Anthropic for Claude AI
- ScyllaDB for high-performance vector search
- pgvector for PostgreSQL vector similarity search
- SentenceTransformers for semantic embeddings
- Autogen for AI agent framework