ArchAI - CypherMind

Overview

ArchAI - CypherMind is an advanced natural language to Cypher query translation system that democratizes access to Neo4j graph databases. It combines cutting-edge LLM technology with intelligent caching, template-based query generation, and graph data science capabilities to provide a production-grade solution for querying graph databases using natural language.

Key Features

Multi-Strategy Query Resolution: 6-level intelligent query resolution cascade (recent cache → templates → exact match → semantic signature → similarity search → fuzzy matching → LLM)
Graph Data Science Integration: Execute advanced graph algorithms (PageRank, Betweenness Centrality, Louvain Community Detection, Node Similarity) directly from natural language
Template-Based Query System: Zero-latency responses for common query patterns with NLP-powered parameter extraction
Multi-LLM Support: Provider-agnostic architecture via LiteLLM (Gemini, GPT-4, Claude, and more)
Advanced Semantic Caching: Qdrant-powered vector similarity search with quantization, indexing, and multiple caching layers
Async Operations: Full async support for concurrent query processing and batch operations
Production-Ready: Comprehensive test suite with 74+ test cases, extensive error handling, and performance monitoring

Performance Highlights

Sub-100ms response time for cached and templated queries
8x-16x memory reduction through vector quantization
Intelligent fallback handling query variations without LLM invocation
Multi-layer caching (recent results, templates, frequent queries, vector database)

Technology Stack

Core Technologies

Language: Python 3.8+
Web Framework: Streamlit
Graph Database: Neo4j 5.x
Vector Database: Qdrant
LLM Integration: LiteLLM (multi-provider support)

Key Libraries

neo4j-driver: Official Neo4j Python driver
litellm (1.70.4): Multi-provider LLM abstraction layer
qdrant-client (1.13.3): Vector database client with quantization support
fastembed: Fast local text embeddings
spacy (3.7+): NLP for entity extraction and query analysis
rapidfuzz (3.0+): Fuzzy string matching for query variations
pandas: Data manipulation and tabular display
python-dotenv: Environment configuration

Testing & Development

pytest (7.0+): Testing framework
pytest-mock (3.10+): Mocking support
pytest-asyncio (0.21+): Async test support

Directory Structure

├── src/
│   ├── app_streamlit.py              - Main Streamlit application with enhanced UI
│   ├── main.py                       - Data import script
│   ├── backend/
│   │   ├── llm.py                    - Multi-LLM integration (Gemini/GPT/Claude)
│   │   ├── semantic_cache.py         - Advanced 6-strategy semantic caching
│   │   ├── gds_manager.py            - Graph Data Science algorithm execution
│   │   ├── import_data.py            - Graph data import from CSV
│   │   └── utils/
│   │       └── streamlit_app_utils.py - Utility functions for UI
├── tests/
│   ├── backend/
│   │   ├── test_llm.py               - 11 tests for LLM integration
│   │   ├── test_semantic_cache.py    - 23 tests for caching strategies
│   │   ├── test_gds_manager.py       - 10 tests for GDS algorithms
│   │   ├── test_import_data.py       - 9+ tests for data import
│   │   └── utils/
│   │       └── test_streamlit_app_utils.py
│   ├── test_app_streamlit.py         - 20+ tests for UI
│   └── test_main.py                  - 10+ tests for main script
├── data/                             - Data files for import
├── data_fake/
│   └── query_template.json           - Query template library
├── img/
│   ├── logo_cyphermind.png
│   └── component_diagram.png
├── .env.example                      - Environment variables template
├── docker-compose.yml                - Docker orchestration
├── Dockerfile                        - Application container
├── pytest.ini                        - Test configuration
└── requirements_streamlit.txt        - Production dependencies

Getting Started

Prerequisites

Python 3.8 or higher
Neo4j instance (local or cloud) - version 5.x recommended
Qdrant instance (local, cloud, or in-memory)
LLM API key (Gemini, OpenAI, Anthropic, etc.)
Docker & Docker Compose (optional, for containerized deployment)

Installation

Clone the repository

git clone https://github.com/ArchAI-Labs/cypher_mind.git
cd cypher_mind

Create a virtual environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies

pip install -r requirements_streamlit.txt

Download spaCy language model (for NLP entity extraction)
```
python -m spacy download en_core_web_sm
```

Configuration

Create a .env file in the project root with the following variables:

# Neo4j Configuration
NEO4J_URI=neo4j://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password

# LLM Configuration (choose one)
MODEL=gemini/gemini-pro                    # For Google Gemini
# MODEL=gpt-4                              # For OpenAI GPT-4
# MODEL=claude-3-opus-20240229             # For Anthropic Claude
GEMINI_API_KEY=your_gemini_api_key         # If using Gemini
# OPENAI_API_KEY=your_openai_key           # If using OpenAI
# ANTHROPIC_API_KEY=your_anthropic_key     # If using Anthropic

# Qdrant Configuration
QDRANT_COLLECTION=cypher_cache
QDRANT_MODE=memory                         # Options: memory, docker, cloud
# QDRANT_HOST=your_cloud_host              # For cloud mode
# QDRANT_API_KEY=your_qdrant_key           # For cloud mode
# QDRANT_URL=http://localhost:6333         # For docker mode

# Embedding Configuration
EMBEDDER=sentence-transformers/all-MiniLM-L6-v2
VECTOR_SIZE=384

# Data Import Configuration
NODE_URL=data/nodes.json                   # Path to node definitions
REL_URL=data/relationships.json            # Path to relationship definitions
SAMPLE_QUESTIONS=data/sample_questions.json
RESET=false                                # Set to "true" to reset DB on startup

# Context Configuration (for LLM schema awareness)
NODE_CONTEXT_URL=data/node_context.json    # Node types and properties
REL_CONTEXT_URL=data/rel_context.json      # Relationship types and properties

# Template Configuration (optional)
QUERY_TEMPLATE_PATH=data_fake/query_template.json  # For template-based queries

Running the Application

Option 1: Using Docker (Recommended)

Build the container
```
docker compose build --no-cache
```
Start all services (Neo4j, Qdrant, CypherMind)
```
docker compose up -d
```
Access the application
- Streamlit UI: http://localhost:8501
- Neo4j Browser: http://localhost:7474
- Qdrant Dashboard: http://localhost:6333/dashboard

Option 2: Local Development

Start Neo4j and Qdrant (if not using cloud services)

# Neo4j
neo4j start

# Qdrant (using Docker)
docker run -p 6333:6333 qdrant/qdrant

Import initial data (optional)
```
python src/main.py
```
Run the Streamlit application
```
streamlit run src/app_streamlit.py
```
Access the application at http://localhost:8501

Usage Guide

Query Template System

Create query templates in data_fake/query_template.json for common query patterns:

{
  "templates": [
    {
      "intent": "get_top_users",
      "template": "get top {count} users from project {project}",
      "parameters": ["count", "project"],
      "cypher_template": "MATCH (p:Person)-[:WORKS_ON]->(proj:Project {name: '{project}'}) RETURN p.name, p.email LIMIT {count}",
      "priority": 1,
      "parameter_patterns": {
        "count": "\\b(?:top|first)\\s+(\\d+)\\b",
        "project": "project\\s+([A-Za-z0-9_\\s]+)"
      },
      "aliases": [
        "show me top {count} users in project {project}",
        "list {count} users working on {project}"
      ]
    }
  ]
}

Benefits:

Zero-latency query execution (no LLM call)
Consistent query structure
Parameter validation
Support for query variations via aliases

Graph Data Science (GDS) Integration

The GDS Manager allows you to execute graph algorithms directly from natural language or programmatically:

Available Algorithms

PageRank - Identifies influential nodes
Betweenness Centrality - Finds bridge nodes
Closeness Centrality - Measures node accessibility
Louvain Community Detection - Discovers communities
Node Similarity - Finds similar nodes

Usage Example

from backend.gds_manager import GDSManager
import os

# Initialize
uri = os.getenv("NEO4J_URI")
user = os.getenv("NEO4J_USER")
password = os.getenv("NEO4J_PASSWORD")

gds = GDSManager(uri, user, password)

# Create a graph projection
gds.create_graph_projection(
    graph_name="my_graph",
    node_projection=["Person", "Project"],
    relationship_projection={
        "WORKS_ON": {
            "type": "WORKS_ON",
            "orientation": "NATURAL"
        }
    }
)

# Run PageRank
results = gds.run_pagerank(
    graph_name="my_graph",
    write_property="pagerank"
)

# Get top ranked nodes
gds.get_top_nodes_by_algorithm(
    algorithm="pagerank",
    property_name="pagerank",
    limit=10
)

# Cleanup
gds.drop_graph_projection("my_graph")
gds.close()

Data Import

Import nodes and relationships from CSV files:

from backend.import_data import GraphImport
import os

uri = os.getenv("NEO4J_URI")
user = os.getenv("NEO4J_USER")
password = os.getenv("NEO4J_PASSWORD")

importer = GraphImport(uri, user, password)

# Define node files
node_files = {
    "data/persons.csv": ("Person", ["name", "email", "age"], "id"),
    "data/projects.csv": ("Project", ["name", "description"], "id")
}

# Define relationship files
relationship_files = {
    "data/works_on.csv": ("WORKS_ON", "Person", "Project", ["person_id", "project_id"], ["role"])
}

# Import all data
importer.import_all(node_files, relationship_files)
importer.close()

Context Configuration

Before using the Streamlit app, create context JSON files to help the LLM understand your graph schema:

node_context.json:

{
  "Person": ["name", "email", "age"],
  "Project": ["name", "description", "start_date"]
}

rel_context.json:

{
  "WORKS_ON": ["person_id", "project_id", "role"],
  "MANAGES": ["manager_id", "project_id"]
}

Architecture

Query Resolution Flow

User Question
    ↓
[1] Recent Results Cache (Last 3 queries, in-memory)
    ↓ (miss)
[2] Template Matching (Regex + NLP parameter extraction)
    ↓ (miss)
[3] Exact Vector Match (Similarity > 0.95)
    ↓ (miss)
[4] Semantic Signature (NLP entity matching)
    ↓ (miss)
[5] Semantic Similarity (Fuzzy vector search)
    ↓ (miss)
[6] Fuzzy String Matching (Levenshtein distance)
    ↓ (miss)
[7] LLM Generation (Gemini/GPT/Claude)
    ↓
Store in Cache → Return Result

Component Interaction

Streamlit UI receives user input
Semantic Cache attempts multi-strategy resolution
LLM Module generates Cypher if cache misses
Neo4j Driver executes queries
GDS Manager handles graph algorithm requests
Results formatted and displayed
Cache Updated with new query-result pairs

Architectural Patterns

Layered Architecture: Clear separation of UI, business logic, and data access
Strategy Pattern: Multiple query resolution strategies with intelligent fallback
Cache-Aside Pattern: Multi-layer caching with write-through
Repository Pattern: Abstracted database access via managers
Template Method: Extensible algorithm execution framework
Async/Await: Non-blocking operations for concurrent processing

Testing

The project includes a comprehensive test suite with 74+ test cases:

Run All Tests

pytest

Run Specific Test Modules

# LLM integration tests
pytest tests/backend/test_llm.py

# Semantic cache tests (all 6 strategies)
pytest tests/backend/test_semantic_cache.py

# GDS algorithm tests
pytest tests/backend/test_gds_manager.py

# UI tests
pytest tests/test_app_streamlit.py

Run with Coverage

pytest --cov=src --cov-report=html

Test Coverage Summary

Module	Tests	Coverage Areas
LLM Integration	11	Schema generation, Cypher generation, validation, intent extraction, query cleaning
Semantic Cache	23	All 6 search strategies, parameter extraction, async operations, batch search, performance stats
GDS Manager	10	Graph projections, PageRank, Betweenness, Louvain, Node Similarity, error handling
Data Import	9+	Node/relationship import, batch operations, constraint creation
Streamlit UI	20+	Session management, cache controls, UI interactions, error handling
Utilities	5+	Result formatting, sample question generation

Performance Optimization

Caching Strategy Performance

Strategy	Avg Response Time	Hit Rate (typical)
Recent Cache	<10ms	15-20%
Template Match	<50ms	25-30%
Exact Match	<100ms	10-15%
Semantic Signature	<150ms	15-20%
Similarity Search	<200ms	20-25%
Fuzzy Match	<250ms	5-10%
LLM Generation	1-3s	Last resort

Memory Optimization

Vector Quantization: 8x reduction with scalar quantization, 16x with binary
LRU Caching: 10,000 entry limit prevents memory bloat
Payload Indexing: Fast filtering without full vector search
Disk Storage: Optional for large-scale deployments

Configuration Tuning

For High Throughput:

QDRANT_MODE=cloud  # Use Qdrant Cloud for distributed search
EMBEDDER=sentence-transformers/all-MiniLM-L6-v2  # Fast embeddings
SIMILARITY_THRESHOLD=0.70  # Lower threshold, more cache hits

For High Accuracy:

MODEL=gpt-4  # More powerful LLM
EMBEDDER=sentence-transformers/all-mpnet-base-v2  # Higher quality embeddings
SIMILARITY_THRESHOLD=0.85  # Higher threshold, more LLM calls

For Low Latency:

QDRANT_MODE=memory  # In-memory vector search
QUERY_TEMPLATE_PATH=data_fake/query_template.json  # Enable templates
SIMILARITY_THRESHOLD=0.75  # Balanced threshold

API Reference

SemanticCache

from backend.semantic_cache import SemanticCache

cache = SemanticCache(
    collection_name="cypher_cache",
    mode="memory",
    embedder_model="sentence-transformers/all-MiniLM-L6-v2"
)

# Smart search with 6-level cascade
result = cache.smart_search(
    query="show me top 10 users",
    similarity_threshold=0.75
)

# Async batch search
results = await cache.async_batch_search(
    queries=["query1", "query2", "query3"],
    similarity_threshold=0.75
)

# Store query-result pair
cache.store_query(
    query="show me top 10 users",
    cypher="MATCH (u:User) RETURN u LIMIT 10",
    result=[{"name": "John"}],
    template_used="get_top_users"
)

# Get performance statistics
stats = cache.get_detailed_stats()

GDSManager

from backend.gds_manager import GDSManager

gds = GDSManager(uri, user, password)

# Create graph projection
gds.create_graph_projection(
    graph_name="social_graph",
    node_projection=["Person"],
    relationship_projection="KNOWS"
)

# Run algorithms
pagerank_results = gds.run_pagerank("social_graph", write_property="score")
communities = gds.run_louvain("social_graph", write_property="community")
centrality = gds.run_betweenness("social_graph", write_property="centrality")

# Get results
top_influencers = gds.get_top_nodes_by_algorithm("pagerank", "score", limit=10)

# Cleanup
gds.drop_graph_projection("social_graph")

LLM Module

from backend.llm import (
    ask_neo4j_llm,
    extract_query_intent,
    validate_cypher_syntax,
    clean_cypher_query
)

# Generate Cypher from natural language
response = ask_neo4j_llm(
    question="Who are the top 5 influencers?",
    schema_info=schema,
    sample_questions=samples
)

# Extract intent
intent = extract_query_intent("Show me all projects managed by John")
# Returns: {
#   "action": "retrieve",
#   "entities": ["projects", "John"],
#   "filters": {"manager": "John"},
#   "limit": None
# }

# Validate Cypher
is_valid = validate_cypher_syntax("MATCH (n) RETURN n")

# Clean LLM-generated query
clean_query = clean_cypher_query("```cypher\nMATCH (n) RETURN n\n```")

Troubleshooting

Common Issues

1. Qdrant Connection Error

Error: Failed to connect to Qdrant
Solution: Ensure Qdrant is running (docker run -p 6333:6333 qdrant/qdrant)

2. Neo4j Authentication Failed

Error: Neo4j authentication failed
Solution: Verify NEO4J_USER and NEO4J_PASSWORD in .env file

3. LLM API Key Invalid

Error: API key authentication failed
Solution: Check your GEMINI_API_KEY/OPENAI_API_KEY in .env

4. Slow Query Performance

Issue: Queries taking >5 seconds
Solution:
- Enable query templates for common patterns
- Lower similarity threshold (0.70-0.75)
- Use vector quantization in Qdrant
- Check Qdrant collection size and optimize

5. Cache Not Working

Issue: Every query calls LLM
Solution:
- Verify Qdrant connection
- Check QDRANT_COLLECTION exists
- Ensure embedder model is downloaded
- Review similarity_threshold (may be too high)

Debug Mode

Enable detailed logging:

import logging

logging.basicConfig(level=logging.DEBUG)

Check cache statistics:

from backend.semantic_cache import SemanticCache

cache = SemanticCache()
stats = cache.get_detailed_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']}%")
print(f"Total queries: {stats['total_searches']}")

Roadmap

Completed Features ✅

✅ Multi-LLM support via LiteLLM
✅ Comprehensive test suite (74+ tests)
✅ Graph Data Science integration
✅ Template-based query system
✅ Advanced 6-strategy semantic caching
✅ Async operations support
✅ NLP entity extraction with spaCy
✅ Fuzzy matching with rapidfuzz
✅ Vector quantization for memory optimization

Contributing

We welcome contributions! Please see our contributing guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Write tests for your changes
Ensure all tests pass (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements_streamlit.txt
pip install pytest pytest-mock pytest-asyncio pytest-cov

# Run tests with coverage
pytest --cov=src --cov-report=html

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use CypherMind in your research or project, please cite:

@software{cyphermind2025,
  title={CypherMind: Advanced Natural Language to Cypher Translation},
  author={ArchAI Labs},
  year={2025},
  url={https://github.com/ArchAI-Labs/cypher_mind}
}

Acknowledgments

Neo4j for the powerful graph database platform
Qdrant for the high-performance vector database
LiteLLM for unified LLM API access
Streamlit for the interactive web framework
spaCy for advanced NLP capabilities
ArchAI automated documentation system for project analysis

Support

Documentation: GitHub Wiki
Issues: GitHub Issues
Discussions: GitHub Discussions

Built with ❤️ by ArchAI Labs

Generated and maintained with the support of ArchAI, an automated documentation system.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data_fake		data_fake
img		img
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements_docker.txt		requirements_docker.txt
requirements_streamlit.txt		requirements_streamlit.txt

License

ArchAI-Labs/cypher_mind

Folders and files

Latest commit

History

Repository files navigation

ArchAI - CypherMind

Overview

Key Features

Performance Highlights

Technology Stack

Core Technologies

Key Libraries

Testing & Development

Directory Structure

Getting Started

Prerequisites

Installation

Configuration

Running the Application

Option 1: Using Docker (Recommended)

Option 2: Local Development

Usage Guide

Query Template System

Graph Data Science (GDS) Integration

Available Algorithms

Usage Example

Data Import

Context Configuration

Architecture

Query Resolution Flow

Component Interaction

Architectural Patterns

Testing

Run All Tests

Run Specific Test Modules

Run with Coverage

Test Coverage Summary

Performance Optimization

Caching Strategy Performance

Memory Optimization

Configuration Tuning

API Reference

SemanticCache

GDSManager

LLM Module

Troubleshooting

Common Issues

Debug Mode

Roadmap

Completed Features ✅

Contributing

Development Setup

License

Citation

Acknowledgments

Support

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages