ArchAI - CypherMind is an advanced natural language to Cypher query translation system that democratizes access to Neo4j graph databases. It combines cutting-edge LLM technology with intelligent caching, template-based query generation, and graph data science capabilities to provide a production-grade solution for querying graph databases using natural language.
- Multi-Strategy Query Resolution: 6-level intelligent query resolution cascade (recent cache → templates → exact match → semantic signature → similarity search → fuzzy matching → LLM)
- Graph Data Science Integration: Execute advanced graph algorithms (PageRank, Betweenness Centrality, Louvain Community Detection, Node Similarity) directly from natural language
- Template-Based Query System: Zero-latency responses for common query patterns with NLP-powered parameter extraction
- Multi-LLM Support: Provider-agnostic architecture via LiteLLM (Gemini, GPT-4, Claude, and more)
- Advanced Semantic Caching: Qdrant-powered vector similarity search with quantization, indexing, and multiple caching layers
- Async Operations: Full async support for concurrent query processing and batch operations
- Production-Ready: Comprehensive test suite with 74+ test cases, extensive error handling, and performance monitoring
- Sub-100ms response time for cached and templated queries
- 8x-16x memory reduction through vector quantization
- Intelligent fallback handling query variations without LLM invocation
- Multi-layer caching (recent results, templates, frequent queries, vector database)
- Language: Python 3.8+
- Web Framework: Streamlit
- Graph Database: Neo4j 5.x
- Vector Database: Qdrant
- LLM Integration: LiteLLM (multi-provider support)
- neo4j-driver: Official Neo4j Python driver
- litellm (1.70.4): Multi-provider LLM abstraction layer
- qdrant-client (1.13.3): Vector database client with quantization support
- fastembed: Fast local text embeddings
- spacy (3.7+): NLP for entity extraction and query analysis
- rapidfuzz (3.0+): Fuzzy string matching for query variations
- pandas: Data manipulation and tabular display
- python-dotenv: Environment configuration
- pytest (7.0+): Testing framework
- pytest-mock (3.10+): Mocking support
- pytest-asyncio (0.21+): Async test support
├── src/
│ ├── app_streamlit.py - Main Streamlit application with enhanced UI
│ ├── main.py - Data import script
│ ├── backend/
│ │ ├── llm.py - Multi-LLM integration (Gemini/GPT/Claude)
│ │ ├── semantic_cache.py - Advanced 6-strategy semantic caching
│ │ ├── gds_manager.py - Graph Data Science algorithm execution
│ │ ├── import_data.py - Graph data import from CSV
│ │ └── utils/
│ │ └── streamlit_app_utils.py - Utility functions for UI
├── tests/
│ ├── backend/
│ │ ├── test_llm.py - 11 tests for LLM integration
│ │ ├── test_semantic_cache.py - 23 tests for caching strategies
│ │ ├── test_gds_manager.py - 10 tests for GDS algorithms
│ │ ├── test_import_data.py - 9+ tests for data import
│ │ └── utils/
│ │ └── test_streamlit_app_utils.py
│ ├── test_app_streamlit.py - 20+ tests for UI
│ └── test_main.py - 10+ tests for main script
├── data/ - Data files for import
├── data_fake/
│ └── query_template.json - Query template library
├── img/
│ ├── logo_cyphermind.png
│ └── component_diagram.png
├── .env.example - Environment variables template
├── docker-compose.yml - Docker orchestration
├── Dockerfile - Application container
├── pytest.ini - Test configuration
└── requirements_streamlit.txt - Production dependencies
- Python 3.8 or higher
- Neo4j instance (local or cloud) - version 5.x recommended
- Qdrant instance (local, cloud, or in-memory)
- LLM API key (Gemini, OpenAI, Anthropic, etc.)
- Docker & Docker Compose (optional, for containerized deployment)
-
Clone the repository
git clone https://github.com/ArchAI-Labs/cypher_mind.git cd cypher_mind -
Create a virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements_streamlit.txt
-
Download spaCy language model (for NLP entity extraction)
python -m spacy download en_core_web_sm
Create a .env file in the project root with the following variables:
# Neo4j Configuration
NEO4J_URI=neo4j://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
# LLM Configuration (choose one)
MODEL=gemini/gemini-pro # For Google Gemini
# MODEL=gpt-4 # For OpenAI GPT-4
# MODEL=claude-3-opus-20240229 # For Anthropic Claude
GEMINI_API_KEY=your_gemini_api_key # If using Gemini
# OPENAI_API_KEY=your_openai_key # If using OpenAI
# ANTHROPIC_API_KEY=your_anthropic_key # If using Anthropic
# Qdrant Configuration
QDRANT_COLLECTION=cypher_cache
QDRANT_MODE=memory # Options: memory, docker, cloud
# QDRANT_HOST=your_cloud_host # For cloud mode
# QDRANT_API_KEY=your_qdrant_key # For cloud mode
# QDRANT_URL=http://localhost:6333 # For docker mode
# Embedding Configuration
EMBEDDER=sentence-transformers/all-MiniLM-L6-v2
VECTOR_SIZE=384
# Data Import Configuration
NODE_URL=data/nodes.json # Path to node definitions
REL_URL=data/relationships.json # Path to relationship definitions
SAMPLE_QUESTIONS=data/sample_questions.json
RESET=false # Set to "true" to reset DB on startup
# Context Configuration (for LLM schema awareness)
NODE_CONTEXT_URL=data/node_context.json # Node types and properties
REL_CONTEXT_URL=data/rel_context.json # Relationship types and properties
# Template Configuration (optional)
QUERY_TEMPLATE_PATH=data_fake/query_template.json # For template-based queries-
Build the container
docker compose build --no-cache
-
Start all services (Neo4j, Qdrant, CypherMind)
docker compose up -d
-
Access the application
- Streamlit UI: http://localhost:8501
- Neo4j Browser: http://localhost:7474
- Qdrant Dashboard: http://localhost:6333/dashboard
-
Start Neo4j and Qdrant (if not using cloud services)
# Neo4j neo4j start # Qdrant (using Docker) docker run -p 6333:6333 qdrant/qdrant
-
Import initial data (optional)
python src/main.py
-
Run the Streamlit application
streamlit run src/app_streamlit.py
-
Access the application at http://localhost:8501
Create query templates in data_fake/query_template.json for common query patterns:
{
"templates": [
{
"intent": "get_top_users",
"template": "get top {count} users from project {project}",
"parameters": ["count", "project"],
"cypher_template": "MATCH (p:Person)-[:WORKS_ON]->(proj:Project {name: '{project}'}) RETURN p.name, p.email LIMIT {count}",
"priority": 1,
"parameter_patterns": {
"count": "\\b(?:top|first)\\s+(\\d+)\\b",
"project": "project\\s+([A-Za-z0-9_\\s]+)"
},
"aliases": [
"show me top {count} users in project {project}",
"list {count} users working on {project}"
]
}
]
}Benefits:
- Zero-latency query execution (no LLM call)
- Consistent query structure
- Parameter validation
- Support for query variations via aliases
The GDS Manager allows you to execute graph algorithms directly from natural language or programmatically:
- PageRank - Identifies influential nodes
- Betweenness Centrality - Finds bridge nodes
- Closeness Centrality - Measures node accessibility
- Louvain Community Detection - Discovers communities
- Node Similarity - Finds similar nodes
from backend.gds_manager import GDSManager
import os
# Initialize
uri = os.getenv("NEO4J_URI")
user = os.getenv("NEO4J_USER")
password = os.getenv("NEO4J_PASSWORD")
gds = GDSManager(uri, user, password)
# Create a graph projection
gds.create_graph_projection(
graph_name="my_graph",
node_projection=["Person", "Project"],
relationship_projection={
"WORKS_ON": {
"type": "WORKS_ON",
"orientation": "NATURAL"
}
}
)
# Run PageRank
results = gds.run_pagerank(
graph_name="my_graph",
write_property="pagerank"
)
# Get top ranked nodes
gds.get_top_nodes_by_algorithm(
algorithm="pagerank",
property_name="pagerank",
limit=10
)
# Cleanup
gds.drop_graph_projection("my_graph")
gds.close()Import nodes and relationships from CSV files:
from backend.import_data import GraphImport
import os
uri = os.getenv("NEO4J_URI")
user = os.getenv("NEO4J_USER")
password = os.getenv("NEO4J_PASSWORD")
importer = GraphImport(uri, user, password)
# Define node files
node_files = {
"data/persons.csv": ("Person", ["name", "email", "age"], "id"),
"data/projects.csv": ("Project", ["name", "description"], "id")
}
# Define relationship files
relationship_files = {
"data/works_on.csv": ("WORKS_ON", "Person", "Project", ["person_id", "project_id"], ["role"])
}
# Import all data
importer.import_all(node_files, relationship_files)
importer.close()Before using the Streamlit app, create context JSON files to help the LLM understand your graph schema:
node_context.json:
{
"Person": ["name", "email", "age"],
"Project": ["name", "description", "start_date"]
}rel_context.json:
{
"WORKS_ON": ["person_id", "project_id", "role"],
"MANAGES": ["manager_id", "project_id"]
}User Question
↓
[1] Recent Results Cache (Last 3 queries, in-memory)
↓ (miss)
[2] Template Matching (Regex + NLP parameter extraction)
↓ (miss)
[3] Exact Vector Match (Similarity > 0.95)
↓ (miss)
[4] Semantic Signature (NLP entity matching)
↓ (miss)
[5] Semantic Similarity (Fuzzy vector search)
↓ (miss)
[6] Fuzzy String Matching (Levenshtein distance)
↓ (miss)
[7] LLM Generation (Gemini/GPT/Claude)
↓
Store in Cache → Return Result
- Streamlit UI receives user input
- Semantic Cache attempts multi-strategy resolution
- LLM Module generates Cypher if cache misses
- Neo4j Driver executes queries
- GDS Manager handles graph algorithm requests
- Results formatted and displayed
- Cache Updated with new query-result pairs
- Layered Architecture: Clear separation of UI, business logic, and data access
- Strategy Pattern: Multiple query resolution strategies with intelligent fallback
- Cache-Aside Pattern: Multi-layer caching with write-through
- Repository Pattern: Abstracted database access via managers
- Template Method: Extensible algorithm execution framework
- Async/Await: Non-blocking operations for concurrent processing
The project includes a comprehensive test suite with 74+ test cases:
pytest# LLM integration tests
pytest tests/backend/test_llm.py
# Semantic cache tests (all 6 strategies)
pytest tests/backend/test_semantic_cache.py
# GDS algorithm tests
pytest tests/backend/test_gds_manager.py
# UI tests
pytest tests/test_app_streamlit.pypytest --cov=src --cov-report=html| Module | Tests | Coverage Areas |
|---|---|---|
| LLM Integration | 11 | Schema generation, Cypher generation, validation, intent extraction, query cleaning |
| Semantic Cache | 23 | All 6 search strategies, parameter extraction, async operations, batch search, performance stats |
| GDS Manager | 10 | Graph projections, PageRank, Betweenness, Louvain, Node Similarity, error handling |
| Data Import | 9+ | Node/relationship import, batch operations, constraint creation |
| Streamlit UI | 20+ | Session management, cache controls, UI interactions, error handling |
| Utilities | 5+ | Result formatting, sample question generation |
| Strategy | Avg Response Time | Hit Rate (typical) |
|---|---|---|
| Recent Cache | <10ms | 15-20% |
| Template Match | <50ms | 25-30% |
| Exact Match | <100ms | 10-15% |
| Semantic Signature | <150ms | 15-20% |
| Similarity Search | <200ms | 20-25% |
| Fuzzy Match | <250ms | 5-10% |
| LLM Generation | 1-3s | Last resort |
- Vector Quantization: 8x reduction with scalar quantization, 16x with binary
- LRU Caching: 10,000 entry limit prevents memory bloat
- Payload Indexing: Fast filtering without full vector search
- Disk Storage: Optional for large-scale deployments
For High Throughput:
QDRANT_MODE=cloud # Use Qdrant Cloud for distributed search
EMBEDDER=sentence-transformers/all-MiniLM-L6-v2 # Fast embeddings
SIMILARITY_THRESHOLD=0.70 # Lower threshold, more cache hitsFor High Accuracy:
MODEL=gpt-4 # More powerful LLM
EMBEDDER=sentence-transformers/all-mpnet-base-v2 # Higher quality embeddings
SIMILARITY_THRESHOLD=0.85 # Higher threshold, more LLM callsFor Low Latency:
QDRANT_MODE=memory # In-memory vector search
QUERY_TEMPLATE_PATH=data_fake/query_template.json # Enable templates
SIMILARITY_THRESHOLD=0.75 # Balanced thresholdfrom backend.semantic_cache import SemanticCache
cache = SemanticCache(
collection_name="cypher_cache",
mode="memory",
embedder_model="sentence-transformers/all-MiniLM-L6-v2"
)
# Smart search with 6-level cascade
result = cache.smart_search(
query="show me top 10 users",
similarity_threshold=0.75
)
# Async batch search
results = await cache.async_batch_search(
queries=["query1", "query2", "query3"],
similarity_threshold=0.75
)
# Store query-result pair
cache.store_query(
query="show me top 10 users",
cypher="MATCH (u:User) RETURN u LIMIT 10",
result=[{"name": "John"}],
template_used="get_top_users"
)
# Get performance statistics
stats = cache.get_detailed_stats()from backend.gds_manager import GDSManager
gds = GDSManager(uri, user, password)
# Create graph projection
gds.create_graph_projection(
graph_name="social_graph",
node_projection=["Person"],
relationship_projection="KNOWS"
)
# Run algorithms
pagerank_results = gds.run_pagerank("social_graph", write_property="score")
communities = gds.run_louvain("social_graph", write_property="community")
centrality = gds.run_betweenness("social_graph", write_property="centrality")
# Get results
top_influencers = gds.get_top_nodes_by_algorithm("pagerank", "score", limit=10)
# Cleanup
gds.drop_graph_projection("social_graph")from backend.llm import (
ask_neo4j_llm,
extract_query_intent,
validate_cypher_syntax,
clean_cypher_query
)
# Generate Cypher from natural language
response = ask_neo4j_llm(
question="Who are the top 5 influencers?",
schema_info=schema,
sample_questions=samples
)
# Extract intent
intent = extract_query_intent("Show me all projects managed by John")
# Returns: {
# "action": "retrieve",
# "entities": ["projects", "John"],
# "filters": {"manager": "John"},
# "limit": None
# }
# Validate Cypher
is_valid = validate_cypher_syntax("MATCH (n) RETURN n")
# Clean LLM-generated query
clean_query = clean_cypher_query("```cypher\nMATCH (n) RETURN n\n```")1. Qdrant Connection Error
Error: Failed to connect to Qdrant
Solution: Ensure Qdrant is running (docker run -p 6333:6333 qdrant/qdrant)
2. Neo4j Authentication Failed
Error: Neo4j authentication failed
Solution: Verify NEO4J_USER and NEO4J_PASSWORD in .env file
3. LLM API Key Invalid
Error: API key authentication failed
Solution: Check your GEMINI_API_KEY/OPENAI_API_KEY in .env
4. Slow Query Performance
Issue: Queries taking >5 seconds
Solution:
- Enable query templates for common patterns
- Lower similarity threshold (0.70-0.75)
- Use vector quantization in Qdrant
- Check Qdrant collection size and optimize
5. Cache Not Working
Issue: Every query calls LLM
Solution:
- Verify Qdrant connection
- Check QDRANT_COLLECTION exists
- Ensure embedder model is downloaded
- Review similarity_threshold (may be too high)
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)Check cache statistics:
from backend.semantic_cache import SemanticCache
cache = SemanticCache()
stats = cache.get_detailed_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']}%")
print(f"Total queries: {stats['total_searches']}")- ✅ Multi-LLM support via LiteLLM
- ✅ Comprehensive test suite (74+ tests)
- ✅ Graph Data Science integration
- ✅ Template-based query system
- ✅ Advanced 6-strategy semantic caching
- ✅ Async operations support
- ✅ NLP entity extraction with spaCy
- ✅ Fuzzy matching with rapidfuzz
- ✅ Vector quantization for memory optimization
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure all tests pass (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install development dependencies
pip install -r requirements_streamlit.txt
pip install pytest pytest-mock pytest-asyncio pytest-cov
# Run tests with coverage
pytest --cov=src --cov-report=html
# Format code
black src/ tests/
# Lint code
flake8 src/ tests/This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use CypherMind in your research or project, please cite:
@software{cyphermind2025,
title={CypherMind: Advanced Natural Language to Cypher Translation},
author={ArchAI Labs},
year={2025},
url={https://github.com/ArchAI-Labs/cypher_mind}
}- Neo4j for the powerful graph database platform
- Qdrant for the high-performance vector database
- LiteLLM for unified LLM API access
- Streamlit for the interactive web framework
- spaCy for advanced NLP capabilities
- ArchAI automated documentation system for project analysis
- Documentation: GitHub Wiki
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with ❤️ by ArchAI Labs
Generated and maintained with the support of ArchAI, an automated documentation system.

