A comprehensive comparative study evaluating the performance of Vector-based RAG versus Graph-based RAG systems for Enterprise Knowledge Retrieval.
Implement and benchmark the quality and efficiency of four RAG implementations on corporate documents (Collective Bargaining Agreements, Company Regulations, and Ethical Codes):
- Simple Standard RAG - Basic chunking and standard retrieval with file routing
- Custom Standard RAG - Optimized retrieval with intelligent title-based chunking and retrieval
- GraphRAG Custom (Strict Mode) - Hand-crafted graph schema with hybrid Vector/Cypher retrieval
- GraphRAG Open (Automatic Mode) - Fully automated graph extraction and hybrid retrieval
After evaluating 15 questions across all four systems using LLM-as-a-judge and RAGAS metrics:
Custom Standard RAG 8 wins ( 53.3%) ββββββββββββββββββββββββββ
GraphRAG Strict Mode 5 wins ( 33.3%) ββββββββββββββββ
Simple Standard RAG 2 wins ( 13.3%) ββββββ
GraphRAG Open Mode 0 wins ( 0.0%)
Tie 0 wins ( 0.0%) Custom Standard RAG 794.1 tokens/query
Simple Standard RAG 1494.4 tokens/query (1.9x more)
GraphRAG Open Mode 7208.5 tokens/query (9.1x more)
GraphRAG Strict Mode 7848.7 tokens/query (9.9x more)| Pipeline | Faithfulness | Answer Relevance | Context Relevance |
|---|---|---|---|
| Simple Standard RAG | 0.585 | 0.670 | 0.500 |
| Custom Standard RAG | 0.843 | 0.699 | 0.827 |
| GraphRAG Open Mode | 0.718 | 0.625 | 0.731 |
| GraphRAG Strict Mode | 0.924 | 0.701 | 0.865 |
Metric Definitions (Range: [0-1], Higher is Better):
- Faithfulness: Measures if the answer is derived only from the retrieved context (hallucination check).
- High: Factually accurate to source. Low: Hallucinated content.
- Answer Relevance: Measures how pertinent the answer is to the user's question.
- High: Directly addresses the query. Low: Vague or off-topic.
- Context Relevance: Measures if the retrieved context contains only the necessary information (signal-to-noise ratio).
- High: Precise retrieval. Low: Too much noise or missing info.
The comparison between Simple and Custom Standard RAG reveals the massive impact of optimization:
- Custom Standard RAG is the most efficient (794 tokens) and achieved the highest win rate (53.3%). It uses optimized retrieval (routed hybrid search) and precise chunking based on extracted titles.
- Simple Standard RAG performed significantly worse (13.3% wins) and used ~2x more tokens (1494) than the Custom version. Its low Context Relevance (0.500) suggests currently retrieved contexts are too broad or irrelevant, confusing the LLM.
- GraphRAG Strict achieves the absolute highest quality scores (Faithfulness > 0.92) but at a 10x higher token cost.
- Custom Standard RAG remains the best balanced choice for production, offering excellent quality (Faithfulness 0.84) with minimal resource usage.
For enterprise knowledge retrieval:
- Best Quality: GraphRAG Custom Mode (custom schema + hybrid retrieval). Use it when accuracy is key and cost/latency are secondary.
- Best Efficiency: Standard Vector RAG (excellent quality-to-cost ratio). Production default, best balance of cost, speed, and accuracy.
- Best for Prototyping: GraphRAG Open Mode (quick setup, acceptable quality). Use it for rapid prototyping or time-constrained proof-of-concepts.
compare_rag/
βββ std_rag/ # Standard Vector RAG implementation
β βββ rag.py # Main RAG pipeline
β βββ retrieve.py # Intelligent retrieval with file routing
β βββ paragraph_injection.py # Vector DB injection with metadata
β βββ README.md # Detailed implementation docs
β
βββ graph_rag/ # Graph-based RAG implementation
β βββ ingest.py # Entry point for graph ingestion
β βββ main.py # Entry point for querying
β βββ src/
β β βββ ingestion/ # PDF loading, chunking, extraction
β β βββ retrieval/ # Similarity + Cypher retrievers
β β βββ graph/ # Neo4j client wrapper
β β βββ config/ # Settings and credentials
β βββ prompts/ # Extraction and retrieval prompts
β βββ README.md # Detailed implementation docs
β
βββ test_rag/ # Evaluation framework
βββ compare.py # Query all 3 RAG systems
βββ judge.py # LLM-as-a-judge evaluation
βββ prepare_ragas_data.py # Convert to RAGAS format
βββ run_ragas_eval.py # RAGAS metrics evaluation
βββ view_results.py # Results visualization
βββ questions.json # Test questions
βββ QA.json # Responses from all systems
βββ evaluation_results.json # Judge scores + reasoning
βββ ragas_results.json # RAGAS metricsA baseline vector-based system representing a "vanilla" RAG implementation.
Key Features:
- Naive Chunking: Fixed-size token chunking with overlap
- Simple Retrieval: Standard cosine similarity search with file routing
Performance:
- 1494 tokens/query
- 2 wins in quality evaluation
- Faithfulness: 0.585 (Lowest among custom implementations)
A sophisticated vector-based retrieval system using Milvus DB and LangChain.
Key Features:
- Intelligent File Routing: LLM-powered document detection to route queries to relevant files
- Paragraph-Level Understanding: Extracts and matches paragraph titles for precise retrieval
- Multi-Level Search:
- Standard semantic search across all documents
- Complete search with file routing + title matching
Technology Stack:
- Vector DB: Milvus Lite (local)
- Embeddings:
paraphrase-multilingual-mpnet-base-v2 - LLM: Azure OpenAI
- Chunking: Token-based with overlap
Performance:
- 794 tokens/query (most efficient)
- 6 wins in quality evaluation
- Faithfulness: 0.843
See std_rag/README.md for detailed implementation.
A custom-designed knowledge graph with hand-crafted schema and hybrid retrieval strategy.
Architecture:
- Predefined Schema:
- Nodes:
Articolo(Article),Diritto(Right),Dovere(Duty),Argomento(Topic) - Relationships:
MENZIONA_ARTICOLO,DEFINISCE_DIRITTO,DEFINISCE_DOVERE,HA_ARGOMENTO
- Nodes:
- Hybrid Retrieval:
- Vector Similarity: Find relevant chunks via embeddings
- Graph Traversal: Expand context by following relationships (sequential chunks, related topics)
- Text-to-Cypher: Convert questions to Cypher queries for structural questions (based on few-shot examples)
- Parallel Processing: Similarity and Cypher retrievers run concurrently
Technology Stack:
- Graph DB: Neo4j (local or Aura)
- Embeddings:
paraphrase-multilingual-mpnet-base-v2 - LLM: Azure OpenAI
- Entity Extraction: LLMGraphTransformer with custom prompts
Performance:
- 7848 tokens/query (10x more than Standard RAG)
- 6 wins in quality evaluation
- Highest RAGAS scores: Faithfulness 0.924, Context Relevance 0.865
See graph_rag/README.md for detailed implementation.
A fully automatic graph construction approach that requires no schema design.
How It Works:
- Automatic Entity Extraction: LLM extracts any entities from text
- Generic Relationships: All entities connected via
HAS_ENTITYrelationship - No Schema Constraints: Adapts to any document type
- Same Retrieval Strategy: Uses vector similarity + graph traversal and text2Cypher (without few-shots)
Advantages:
- β Implementation speed: Can be set up in hours
- β No domain expertise required: No need to design schema
- β Domain agnostic: Works on any document type
Disadvantages:
- β Lower quality: 0 wins, lowest RAGAS scores
- β High token cost: 7208 tokens/query without quality improvement
- β Generic structure: Misses domain-specific relationships
Performance:
- 7208 tokens/query
- 0 wins in quality evaluation
- Faithfulness: 0.718
A comprehensive testing pipeline combining LLM-as-a-judge and RAGAS metrics.
# Step 1: Query all three RAG systems
python test_rag/compare.py # Generates: QA.json (questions + answers from all 3 systems)
# Step 2: LLM-as-a-judge evaluation
python test_rag/judge.py # Generates: evaluation_results.json (winner + reasoning for each question)
# Step 3: Prepare RAGAS dataset
python test_rag/prepare_ragas_data.py # Generates: ragas_dataset.hf (HuggingFace dataset format)
# Step 4: Run RAGAS evaluation
python test_rag/run_ragas_eval.py # Generates: ragas_results.json, ragas_results.csv
# Step 5: View comprehensive results
python test_rag/view_results.py1. LLM-as-a-Judge (judge.py)
- Uses Azure OpenAI to compare answers side-by-side
- Evaluates: accuracy, completeness, relevance, clarity
- Outputs: winner (A/B/C/Tie) + reasoning
2. RAGAS Framework (run_ragas_eval.py)
- Faithfulness: Are answers grounded in retrieved context?
- Answer Relevance: Does the answer address the question?
- Context Relevance: Is retrieved context relevant to the question?
- Python 3.10+
- Neo4j Database (for GraphRAG)
- Azure OpenAI API Key
Create .env files in std_rag/ and graph_rag/ with your credentials:
# Azure OpenAI Configuration
AZURE_ENDPOINT=
AZURE_DEPLOYMENT=
AZURE_API_VERSION=
AZURE_API_KEY=
# Neo4j Configuration (for GraphRAG only)
NEO4J_URI=
NEO4J_USERNAME=
NEO4J_PASSWORD=-
There is no one-size-fits-all solution: The "best" RAG system depends on your constraints (quality vs cost vs development time).
-
Custom implementations win on quality: GraphRAG Custom's hand-crafted schema delivers the highest metrics, but requires domain expertise and 10x more tokens.
-
Vector RAG is surprisingly competitive: Custom Vector RAG achieves nearly equivalent quality with 10x better efficiency, making it the best value proposition.
-
Automatic approaches sacrifice quality: GraphRAG Open is fast to build but doesn't deliver production-ready quality.
-
Hybrid strategies matter: GraphRAG Strict's combination of vector search, graph traversal, and Cypher queries provides the most comprehensive retrieval.
-
Evaluation is critical: In ground truth-free scenarios, using both LLM-as-a-judge and RAGAS provides complementary insights into system performance.


