|
| 1 | +# Introduction to GraphRAG Workshop |
| 2 | + |
| 3 | +> Learn how to combine graph databases with generative AI to improve the quality of LLM-generated content through GraphRAG. |
| 4 | + |
| 5 | +This workshop covers building production-ready GraphRAG applications using Neo4j and OpenAI models. Key focus areas include schema-driven entity extraction, three retriever types (Vector, Vector+Cypher, Text2Cypher), hybrid approaches for different query types, and production considerations for scaling and optimization. The framework enables intelligent retrieval systems that leverage graph structure for explainable, deterministic results. |
| 6 | + |
| 7 | +[Learn more about this course](https://graphacademy.neo4j.com/courses/workshop-graphrag-introduction) |
| 8 | + |
| 9 | +## Concepts |
| 10 | + |
| 11 | +* **GraphRAG** - Retrieval Augmented Generation using graph databases to improve LLM response quality through structured, relationship-aware retrieval |
| 12 | +* **Knowledge Graph** - Graph database storing entities, relationships, and properties with semantic meaning, extracted from unstructured data |
| 13 | +* **Lexical Graph** - Graph structure preserving document hierarchy (Document → Chunk) while storing text chunks for semantic search |
| 14 | +* **Domain Graph** - Graph containing business domain knowledge with structured schemas for deterministic retrieval |
| 15 | +* **Entity Extraction** - LLM-powered process of identifying and structuring entities and relationships from unstructured text |
| 16 | +* **Schema-Driven Extraction** - Using predefined entity types and relationships to guide AI extraction with validation and quality control |
| 17 | +* **Vector Retriever** - Component performing semantic search across text chunks using vector embeddings and cosine similarity |
| 18 | +* **Vector + Cypher Retriever** - Hybrid retriever combining semantic search with graph traversal to provide enriched contextual results |
| 19 | +* **Text2Cypher Retriever** - Natural language to Cypher conversion enabling precise structured queries against the knowledge graph |
| 20 | +* **Index-free Adjacency** - Neo4j feature where relationships are stored as pointers, eliminating expensive join calculations for fast traversal |
| 21 | +* **Organizing Principles** - Rules defining how to classify and relate entities consistently within domain-specific schemas |
| 22 | + |
| 23 | +## Overview |
| 24 | + |
| 25 | +### Summary of GraphRAG Implementation and Best Practices |
| 26 | + |
| 27 | +#### **Module 1: Introduction and Fundamentals** |
| 28 | + |
| 29 | +1. **What is GraphRAG** |
| 30 | + - GraphRAG combines graph databases with generative AI to improve LLM response quality |
| 31 | + - Unlike vector-based RAG, GraphRAG provides deterministic, relationship-aware retrieval |
| 32 | + - Addresses vector RAG limitations: opaque similarity, chunk isolation, and hallucination-prone responses |
| 33 | + - Enables both local search (entity neighborhoods) and global search (pattern analysis) |
| 34 | + |
| 35 | +2. **Vector RAG vs GraphRAG Comparison** |
| 36 | + - **Vector RAG strengths**: Contextual questions, synonyms, fuzzy queries, broad exploration |
| 37 | + - **Vector RAG weaknesses**: Fact-based queries, numerical data, logical operations, entity connections |
| 38 | + - **GraphRAG advantages**: Explicit relationships, structured retrieval, rich context, index-free adjacency |
| 39 | + - **Use case decision**: Factual/numerical queries → GraphRAG, Semantic/exploratory → Vector RAG |
| 40 | + |
| 41 | +3. **Knowledge Graph Types for GraphRAG** |
| 42 | + - **Lexical Graphs**: Document hierarchy with text chunks and vector embeddings |
| 43 | + - **Lexical + Entities**: Combines document structure with extracted entity connections |
| 44 | + - **Domain Graphs**: Business domain knowledge with structured schemas and ontologies |
| 45 | + - **Memory Graphs**: Semantic and episodic memory for conversational context |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +#### **Module 2: Building Knowledge Graphs** |
| 50 | + |
| 51 | +1. **GraphRAG Pipeline Architecture** |
| 52 | + - **Document Processing**: PDF extraction and semantic text chunking |
| 53 | + - **Entity Extraction**: LLM-powered identification of domain entities and relationships |
| 54 | + - **Graph Storage**: Structured entities and relationships stored in Neo4j |
| 55 | + - **Vector Indexing**: Embeddings generated for chunks enabling semantic search |
| 56 | + - **Quality Control**: Schema validation and entity resolution for consistency |
| 57 | + |
| 58 | +2. **Schema-Driven Extraction Best Practices** |
| 59 | + ```python |
| 60 | + # Entity schema definition |
| 61 | + entities = [ |
| 62 | + {"label": "Company", "properties": [{"name": "name", "type": "STRING"}]}, |
| 63 | + {"label": "Executive", "properties": [{"name": "name", "type": "STRING"}]}, |
| 64 | + {"label": "FinancialMetric", "properties": [{"name": "name", "type": "STRING"}]} |
| 65 | + ] |
| 66 | + |
| 67 | + # Relationship schema definition |
| 68 | + relations = [ |
| 69 | + {"label": "HAS_METRIC", "source": "Company", "target": "FinancialMetric"}, |
| 70 | + {"label": "FACES_RISK", "source": "Company", "target": "RiskFactor"}, |
| 71 | + {"label": "ISSUED_STOCK", "source": "Company", "target": "StockType"} |
| 72 | + ] |
| 73 | + ``` |
| 74 | + |
| 75 | +3. **SimpleKGPipeline Implementation** |
| 76 | + ```python |
| 77 | + from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline |
| 78 | + |
| 79 | + # Complete pipeline configuration |
| 80 | + kg_builder = SimpleKGPipeline( |
| 81 | + driver=driver, |
| 82 | + llm=llm, |
| 83 | + embedder=embedder, |
| 84 | + prompt_template=custom_prompt, # Guided extraction prompts |
| 85 | + entities=entities, # Entity schema |
| 86 | + relations=relations, # Relationship schema |
| 87 | + from_pdf=True, # PDF processing enabled |
| 88 | + ) |
| 89 | + |
| 90 | + # Execute extraction pipeline |
| 91 | + result = await kg_builder.run_async(text=document_text) |
| 92 | + ``` |
| 93 | + |
| 94 | +4. **Custom Extraction Prompts** |
| 95 | + ```python |
| 96 | + # Company validation prompt example |
| 97 | + company_instruction = ( |
| 98 | + "You are an expert in extracting company information from SEC filings. " |
| 99 | + "When extracting, the company name must match exactly as shown below. " |
| 100 | + "If the text refers to 'the Company', you MUST look up the exact name. " |
| 101 | + "UNDER NO CIRCUMSTANCES output generic phrases." |
| 102 | + ) |
| 103 | + |
| 104 | + custom_template = company_instruction + ERExtractionTemplate.DEFAULT_TEMPLATE |
| 105 | + prompt_template = ERExtractionTemplate(template=custom_template) |
| 106 | + ``` |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +#### **Module 3: Querying and Retrieval** |
| 111 | + |
| 112 | +1. **Vector Retriever Implementation** |
| 113 | + ```python |
| 114 | + from neo4j_graphrag.retrievers import VectorRetriever |
| 115 | + |
| 116 | + vector_retriever = VectorRetriever( |
| 117 | + driver=driver, |
| 118 | + index_name='chunkEmbeddings', |
| 119 | + embedder=embedder, |
| 120 | + return_properties=['text'] |
| 121 | + ) |
| 122 | + |
| 123 | + # Best for: Semantic exploration, broad topic search, conceptual questions |
| 124 | + result = vector_retriever.search(query="What are Apple's main risks?", top_k=5) |
| 125 | + ``` |
| 126 | + |
| 127 | +2. **Vector + Cypher Retriever Implementation** |
| 128 | + ```python |
| 129 | + from neo4j_graphrag.retrievers import VectorCypherRetriever |
| 130 | + |
| 131 | + # Custom Cypher for graph traversal |
| 132 | + retrieval_query = """ |
| 133 | + MATCH (chunk:Chunk) |
| 134 | + WHERE chunk.text CONTAINS $query |
| 135 | + |
| 136 | + // Traverse to related entities |
| 137 | + MATCH (chunk)-[:FROM_CHUNK]->(company:Company) |
| 138 | + OPTIONAL MATCH (company)-[r]->(related) |
| 139 | + WHERE NOT related:Chunk AND NOT related:Document |
| 140 | + |
| 141 | + RETURN chunk.text AS context, |
| 142 | + company.name AS entity, |
| 143 | + collect(DISTINCT related.name) AS related_entities, |
| 144 | + collect(DISTINCT type(r)) AS relationship_types |
| 145 | + """ |
| 146 | + |
| 147 | + vector_cypher_retriever = VectorCypherRetriever( |
| 148 | + driver=driver, |
| 149 | + index_name='chunkEmbeddings', |
| 150 | + embedder=embedder, |
| 151 | + retrieval_query=retrieval_query |
| 152 | + ) |
| 153 | + |
| 154 | + # Best for: Entity-specific context, relationship exploration |
| 155 | + result = vector_cypher_retriever.search(query="Apple's financial performance", top_k=5) |
| 156 | + ``` |
| 157 | + |
| 158 | +3. **Text2Cypher Retriever Implementation** |
| 159 | + ```python |
| 160 | + from neo4j_graphrag.retrievers import Text2CypherRetriever |
| 161 | + |
| 162 | + # Schema information for LLM |
| 163 | + schema = """ |
| 164 | + Node types: |
| 165 | + - Company: {name: STRING} |
| 166 | + - FinancialMetric: {name: STRING} |
| 167 | + - RiskFactor: {name: STRING} |
| 168 | + |
| 169 | + Relationship types: |
| 170 | + - (Company)-[:HAS_METRIC]->(FinancialMetric) |
| 171 | + - (Company)-[:FACES_RISK]->(RiskFactor) |
| 172 | + """ |
| 173 | + |
| 174 | + text2cypher_retriever = Text2CypherRetriever( |
| 175 | + driver=driver, |
| 176 | + llm=llm, |
| 177 | + neo4j_schema=schema |
| 178 | + ) |
| 179 | + |
| 180 | + # Best for: Precise queries, numerical data, structured facts |
| 181 | + result = text2cypher_retriever.search(query="How many companies face cybersecurity risks?") |
| 182 | + ``` |
| 183 | + |
| 184 | +4. **Advanced Query Patterns** |
| 185 | + ```cypher |
| 186 | + // Count entities by type |
| 187 | + MATCH (e) |
| 188 | + WHERE NOT e:Document AND NOT e:Chunk |
| 189 | + RETURN labels(e) as entityType, count(e) as count |
| 190 | + ORDER BY count DESC |
| 191 | + |
| 192 | + // Explore company relationships |
| 193 | + MATCH (c:Company {name: 'APPLE INC'}) |
| 194 | + RETURN c.name, |
| 195 | + COUNT { (c)-[r1]->(extracted) WHERE NOT extracted:Chunk } AS extractedEntities, |
| 196 | + COUNT { (:AssetManager)-[:OWNS]->(c) } AS assetManagers, |
| 197 | + COUNT { (c)<-[:FROM_CHUNK]->(chunk:Chunk) } AS textChunks |
| 198 | + |
| 199 | + // Find relationship patterns |
| 200 | + MATCH (c:Company)-[r]->(entity) |
| 201 | + RETURN c.name, type(r) as relationship, entity.name |
| 202 | + ORDER BY c.name, relationship |
| 203 | + ``` |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +#### **Key Implementation Insights** |
| 208 | + |
| 209 | +**Technology Stack:** Neo4j GraphRAG Library, OpenAI GPT-4o/embeddings, LangChain integration |
| 210 | + |
| 211 | +**Quality & Performance:** |
| 212 | +- Schema-driven extraction with entity validation and context resolution |
| 213 | +- Specific relationship types and proper MERGE patterns for performance |
| 214 | +- Caching and traversal depth limits for optimization |
| 215 | +- Vector RAG for exploration, GraphRAG for precise/factual queries |
| 216 | + |
| 217 | +## Implementation Patterns |
| 218 | + |
| 219 | +### Entity Extraction with Quality Control |
| 220 | + |
| 221 | +```python |
| 222 | +from pydantic import BaseModel, Field |
| 223 | +from typing import List, Optional |
| 224 | + |
| 225 | +class Entity(BaseModel): |
| 226 | + name: str = Field(description="The entity name") |
| 227 | + type: str = Field(description="Entity type: Person, Location, Organization") |
| 228 | + properties: dict = Field(description="Additional properties") |
| 229 | + |
| 230 | +class Relationship(BaseModel): |
| 231 | + source: str = Field(description="Source entity name") |
| 232 | + target: str = Field(description="Target entity name") |
| 233 | + type: str = Field(description="Relationship type") |
| 234 | + properties: dict = Field(default_factory=dict) |
| 235 | + |
| 236 | +# Structured output parser for consistent extraction |
| 237 | +parser = PydanticOutputParser(pydantic_object=List[Entity]) |
| 238 | +``` |
| 239 | + |
| 240 | +### Graph Modeling Best Practices |
| 241 | + |
| 242 | +```cypher |
| 243 | +// Good - specific, performant relationships |
| 244 | +MATCH (company:Company)-[:HAS_METRIC]->(metric:FinancialMetric) |
| 245 | +WHERE metric.name CONTAINS 'revenue' |
| 246 | + |
| 247 | +// Bad - generic, slow relationships |
| 248 | +MATCH (company:Company)-[r:RELATED_TO]->(entity) |
| 249 | +WHERE r.type = 'has_metric' AND entity.name CONTAINS 'revenue' |
| 250 | + |
| 251 | +// Proper MERGE usage - avoid duplicates |
| 252 | +MERGE (c:Company {name: "Apple Inc"}) |
| 253 | +SET c.ticker = "AAPL", c.sector = "Technology" |
| 254 | + |
| 255 | +// Bad - creates duplicate nodes |
| 256 | +MERGE (c:Company {name: "Apple Inc", ticker: "AAPL", sector: "Technology"}) |
| 257 | +``` |
| 258 | + |
| 259 | +### Text-to-Cypher Safety and Validation |
| 260 | + |
| 261 | +```python |
| 262 | +TEXT_TO_CYPHER_PROMPT = """ |
| 263 | +You are an expert at converting natural language to Cypher queries. |
| 264 | +Use ONLY read-only operations: MATCH, RETURN, WHERE, ORDER BY, LIMIT. |
| 265 | +NEVER use CREATE, DELETE, SET, MERGE, or any write operations. |
| 266 | + |
| 267 | +Schema: |
| 268 | +{schema} |
| 269 | + |
| 270 | +Examples: |
| 271 | +Question: "How many companies are in the database?" |
| 272 | +Cypher: MATCH (c:Company) RETURN count(c) as total_companies |
| 273 | + |
| 274 | +Question: "What financial metrics does Apple have?" |
| 275 | +Cypher: MATCH (c:Company {name: 'APPLE INC'})-[:HAS_METRIC]->(m:FinancialMetric) RETURN m.name |
| 276 | + |
| 277 | +Convert this question: {question} |
| 278 | +Only return the Cypher query, nothing else. |
| 279 | +""" |
| 280 | + |
| 281 | +def validate_cypher_safety(query: str) -> bool: |
| 282 | + """Validate that Cypher query contains only read operations""" |
| 283 | + forbidden_keywords = ['CREATE', 'DELETE', 'SET', 'MERGE', 'REMOVE', 'DROP'] |
| 284 | + query_upper = query.upper() |
| 285 | + return not any(keyword in query_upper for keyword in forbidden_keywords) |
| 286 | +``` |
| 287 | + |
| 288 | +### Hybrid Retrieval Strategy |
| 289 | + |
| 290 | +```python |
| 291 | +class HybridGraphRAGRetriever: |
| 292 | + def __init__(self, vector_retriever, vector_cypher_retriever, text2cypher_retriever): |
| 293 | + self.vector_retriever = vector_retriever |
| 294 | + self.vector_cypher_retriever = vector_cypher_retriever |
| 295 | + self.text2cypher_retriever = text2cypher_retriever |
| 296 | + |
| 297 | + def route_query(self, question: str) -> str: |
| 298 | + """Route questions to appropriate retriever based on type""" |
| 299 | + if any(word in question.lower() for word in ['how many', 'count', 'number of']): |
| 300 | + return 'text2cypher' |
| 301 | + elif any(word in question.lower() for word in ['what', 'which', 'who']): |
| 302 | + return 'vector_cypher' |
| 303 | + else: |
| 304 | + return 'vector' |
| 305 | + |
| 306 | + def retrieve(self, question: str, top_k: int = 5): |
| 307 | + retriever_type = self.route_query(question) |
| 308 | + |
| 309 | + if retriever_type == 'text2cypher': |
| 310 | + return self.text2cypher_retriever.search(question) |
| 311 | + elif retriever_type == 'vector_cypher': |
| 312 | + return self.vector_cypher_retriever.search(question, top_k=top_k) |
| 313 | + else: |
| 314 | + return self.vector_retriever.search(question, top_k=top_k) |
| 315 | +``` |
| 316 | + |
| 317 | +### Error Handling and Robustness |
| 318 | + |
| 319 | +```python |
| 320 | +def robust_text_to_cypher(question: str, max_retries: int = 3) -> str: |
| 321 | + """Generate Cypher with retry logic and validation""" |
| 322 | + for attempt in range(max_retries): |
| 323 | + try: |
| 324 | + cypher = llm_generate_cypher(question) |
| 325 | + if not validate_cypher_safety(cypher): |
| 326 | + raise ValueError("Unsafe Cypher query detected") |
| 327 | + graph.query(f"{cypher} LIMIT 1") # Test execution |
| 328 | + return cypher |
| 329 | + except Exception as e: |
| 330 | + if attempt == max_retries - 1: return None |
| 331 | + question = f"[Previous attempt failed: {str(e)}] {question}" |
| 332 | + return None |
| 333 | +``` |
| 334 | + |
| 335 | +## Production Considerations |
| 336 | + |
| 337 | +### Scaling Strategies |
| 338 | +- **Small datasets** (< 100k nodes): Single Neo4j instance with vector indexes |
| 339 | +- **Medium datasets** (100k - 1M nodes): Neo4j cluster with read replicas for retrieval |
| 340 | +- **Large datasets** (> 1M nodes): Distributed setup with graph partitioning strategies |
| 341 | + |
| 342 | +### Caching Implementation |
| 343 | +```python |
| 344 | +# Cache Cypher queries and entity extraction with LRU cache and Redis |
| 345 | +@lru_cache(maxsize=1000) |
| 346 | +def cached_graph_query(query: str, params_str: str): |
| 347 | + return graph.query(query, json.loads(params_str)) |
| 348 | + |
| 349 | +def get_or_cache_entities(text_chunk: str) -> List[Dict]: |
| 350 | + cache_key = f"entities:{hash(text_chunk)}" |
| 351 | + cached = redis_client.get(cache_key) |
| 352 | + if cached: return json.loads(cached) |
| 353 | + |
| 354 | + entities = extract_entities(text_chunk) |
| 355 | + redis_client.setex(cache_key, 3600, json.dumps(entities)) |
| 356 | + return entities |
| 357 | +``` |
| 358 | + |
| 359 | +### Common Mistakes and Solutions |
| 360 | + |
| 361 | +1. **Information Overload in Knowledge Graphs** |
| 362 | + - Problem: LLM extracts too much irrelevant information |
| 363 | + - Solution: Use explicit constraints and few-shot examples with negative examples |
| 364 | + |
| 365 | +2. **Unperformant Graph Traversals** |
| 366 | + - Problem: Queries traverse too many relationships |
| 367 | + - Solution: Add intermediate aggregation nodes and limit traversal depth |
| 368 | + |
| 369 | +3. **Poor Entity Resolution** |
| 370 | + - Problem: Same entities created with different names |
| 371 | + - Solution: Use existing graph data for entity resolution and similarity matching |
| 372 | + |
| 373 | +4. **Assuming Binary Choice** |
| 374 | + - Problem: Treating GraphRAG vs Vector RAG as either/or decision |
| 375 | + - Solution: Use hybrid approaches that combine both methods strategically |
| 376 | + |
| 377 | + |
0 commit comments