Skip to content

Commit 03a1a9a

Browse files
committed
add llms.txt to questions lesson
1 parent d74345b commit 03a1a9a

File tree

1 file changed

+377
-0
lines changed
  • asciidoc/courses/workshop-graphrag-introduction/modules/3-querying/lessons/5-questions

1 file changed

+377
-0
lines changed
Lines changed: 377 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,377 @@
1+
# Introduction to GraphRAG Workshop
2+
3+
> Learn how to combine graph databases with generative AI to improve the quality of LLM-generated content through GraphRAG.
4+
5+
This workshop covers building production-ready GraphRAG applications using Neo4j and OpenAI models. Key focus areas include schema-driven entity extraction, three retriever types (Vector, Vector+Cypher, Text2Cypher), hybrid approaches for different query types, and production considerations for scaling and optimization. The framework enables intelligent retrieval systems that leverage graph structure for explainable, deterministic results.
6+
7+
[Learn more about this course](https://graphacademy.neo4j.com/courses/workshop-graphrag-introduction)
8+
9+
## Concepts
10+
11+
* **GraphRAG** - Retrieval Augmented Generation using graph databases to improve LLM response quality through structured, relationship-aware retrieval
12+
* **Knowledge Graph** - Graph database storing entities, relationships, and properties with semantic meaning, extracted from unstructured data
13+
* **Lexical Graph** - Graph structure preserving document hierarchy (Document → Chunk) while storing text chunks for semantic search
14+
* **Domain Graph** - Graph containing business domain knowledge with structured schemas for deterministic retrieval
15+
* **Entity Extraction** - LLM-powered process of identifying and structuring entities and relationships from unstructured text
16+
* **Schema-Driven Extraction** - Using predefined entity types and relationships to guide AI extraction with validation and quality control
17+
* **Vector Retriever** - Component performing semantic search across text chunks using vector embeddings and cosine similarity
18+
* **Vector + Cypher Retriever** - Hybrid retriever combining semantic search with graph traversal to provide enriched contextual results
19+
* **Text2Cypher Retriever** - Natural language to Cypher conversion enabling precise structured queries against the knowledge graph
20+
* **Index-free Adjacency** - Neo4j feature where relationships are stored as pointers, eliminating expensive join calculations for fast traversal
21+
* **Organizing Principles** - Rules defining how to classify and relate entities consistently within domain-specific schemas
22+
23+
## Overview
24+
25+
### Summary of GraphRAG Implementation and Best Practices
26+
27+
#### **Module 1: Introduction and Fundamentals**
28+
29+
1. **What is GraphRAG**
30+
- GraphRAG combines graph databases with generative AI to improve LLM response quality
31+
- Unlike vector-based RAG, GraphRAG provides deterministic, relationship-aware retrieval
32+
- Addresses vector RAG limitations: opaque similarity, chunk isolation, and hallucination-prone responses
33+
- Enables both local search (entity neighborhoods) and global search (pattern analysis)
34+
35+
2. **Vector RAG vs GraphRAG Comparison**
36+
- **Vector RAG strengths**: Contextual questions, synonyms, fuzzy queries, broad exploration
37+
- **Vector RAG weaknesses**: Fact-based queries, numerical data, logical operations, entity connections
38+
- **GraphRAG advantages**: Explicit relationships, structured retrieval, rich context, index-free adjacency
39+
- **Use case decision**: Factual/numerical queries → GraphRAG, Semantic/exploratory → Vector RAG
40+
41+
3. **Knowledge Graph Types for GraphRAG**
42+
- **Lexical Graphs**: Document hierarchy with text chunks and vector embeddings
43+
- **Lexical + Entities**: Combines document structure with extracted entity connections
44+
- **Domain Graphs**: Business domain knowledge with structured schemas and ontologies
45+
- **Memory Graphs**: Semantic and episodic memory for conversational context
46+
47+
---
48+
49+
#### **Module 2: Building Knowledge Graphs**
50+
51+
1. **GraphRAG Pipeline Architecture**
52+
- **Document Processing**: PDF extraction and semantic text chunking
53+
- **Entity Extraction**: LLM-powered identification of domain entities and relationships
54+
- **Graph Storage**: Structured entities and relationships stored in Neo4j
55+
- **Vector Indexing**: Embeddings generated for chunks enabling semantic search
56+
- **Quality Control**: Schema validation and entity resolution for consistency
57+
58+
2. **Schema-Driven Extraction Best Practices**
59+
```python
60+
# Entity schema definition
61+
entities = [
62+
{"label": "Company", "properties": [{"name": "name", "type": "STRING"}]},
63+
{"label": "Executive", "properties": [{"name": "name", "type": "STRING"}]},
64+
{"label": "FinancialMetric", "properties": [{"name": "name", "type": "STRING"}]}
65+
]
66+
67+
# Relationship schema definition
68+
relations = [
69+
{"label": "HAS_METRIC", "source": "Company", "target": "FinancialMetric"},
70+
{"label": "FACES_RISK", "source": "Company", "target": "RiskFactor"},
71+
{"label": "ISSUED_STOCK", "source": "Company", "target": "StockType"}
72+
]
73+
```
74+
75+
3. **SimpleKGPipeline Implementation**
76+
```python
77+
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
78+
79+
# Complete pipeline configuration
80+
kg_builder = SimpleKGPipeline(
81+
driver=driver,
82+
llm=llm,
83+
embedder=embedder,
84+
prompt_template=custom_prompt, # Guided extraction prompts
85+
entities=entities, # Entity schema
86+
relations=relations, # Relationship schema
87+
from_pdf=True, # PDF processing enabled
88+
)
89+
90+
# Execute extraction pipeline
91+
result = await kg_builder.run_async(text=document_text)
92+
```
93+
94+
4. **Custom Extraction Prompts**
95+
```python
96+
# Company validation prompt example
97+
company_instruction = (
98+
"You are an expert in extracting company information from SEC filings. "
99+
"When extracting, the company name must match exactly as shown below. "
100+
"If the text refers to 'the Company', you MUST look up the exact name. "
101+
"UNDER NO CIRCUMSTANCES output generic phrases."
102+
)
103+
104+
custom_template = company_instruction + ERExtractionTemplate.DEFAULT_TEMPLATE
105+
prompt_template = ERExtractionTemplate(template=custom_template)
106+
```
107+
108+
---
109+
110+
#### **Module 3: Querying and Retrieval**
111+
112+
1. **Vector Retriever Implementation**
113+
```python
114+
from neo4j_graphrag.retrievers import VectorRetriever
115+
116+
vector_retriever = VectorRetriever(
117+
driver=driver,
118+
index_name='chunkEmbeddings',
119+
embedder=embedder,
120+
return_properties=['text']
121+
)
122+
123+
# Best for: Semantic exploration, broad topic search, conceptual questions
124+
result = vector_retriever.search(query="What are Apple's main risks?", top_k=5)
125+
```
126+
127+
2. **Vector + Cypher Retriever Implementation**
128+
```python
129+
from neo4j_graphrag.retrievers import VectorCypherRetriever
130+
131+
# Custom Cypher for graph traversal
132+
retrieval_query = """
133+
MATCH (chunk:Chunk)
134+
WHERE chunk.text CONTAINS $query
135+
136+
// Traverse to related entities
137+
MATCH (chunk)-[:FROM_CHUNK]->(company:Company)
138+
OPTIONAL MATCH (company)-[r]->(related)
139+
WHERE NOT related:Chunk AND NOT related:Document
140+
141+
RETURN chunk.text AS context,
142+
company.name AS entity,
143+
collect(DISTINCT related.name) AS related_entities,
144+
collect(DISTINCT type(r)) AS relationship_types
145+
"""
146+
147+
vector_cypher_retriever = VectorCypherRetriever(
148+
driver=driver,
149+
index_name='chunkEmbeddings',
150+
embedder=embedder,
151+
retrieval_query=retrieval_query
152+
)
153+
154+
# Best for: Entity-specific context, relationship exploration
155+
result = vector_cypher_retriever.search(query="Apple's financial performance", top_k=5)
156+
```
157+
158+
3. **Text2Cypher Retriever Implementation**
159+
```python
160+
from neo4j_graphrag.retrievers import Text2CypherRetriever
161+
162+
# Schema information for LLM
163+
schema = """
164+
Node types:
165+
- Company: {name: STRING}
166+
- FinancialMetric: {name: STRING}
167+
- RiskFactor: {name: STRING}
168+
169+
Relationship types:
170+
- (Company)-[:HAS_METRIC]->(FinancialMetric)
171+
- (Company)-[:FACES_RISK]->(RiskFactor)
172+
"""
173+
174+
text2cypher_retriever = Text2CypherRetriever(
175+
driver=driver,
176+
llm=llm,
177+
neo4j_schema=schema
178+
)
179+
180+
# Best for: Precise queries, numerical data, structured facts
181+
result = text2cypher_retriever.search(query="How many companies face cybersecurity risks?")
182+
```
183+
184+
4. **Advanced Query Patterns**
185+
```cypher
186+
// Count entities by type
187+
MATCH (e)
188+
WHERE NOT e:Document AND NOT e:Chunk
189+
RETURN labels(e) as entityType, count(e) as count
190+
ORDER BY count DESC
191+
192+
// Explore company relationships
193+
MATCH (c:Company {name: 'APPLE INC'})
194+
RETURN c.name,
195+
COUNT { (c)-[r1]->(extracted) WHERE NOT extracted:Chunk } AS extractedEntities,
196+
COUNT { (:AssetManager)-[:OWNS]->(c) } AS assetManagers,
197+
COUNT { (c)<-[:FROM_CHUNK]->(chunk:Chunk) } AS textChunks
198+
199+
// Find relationship patterns
200+
MATCH (c:Company)-[r]->(entity)
201+
RETURN c.name, type(r) as relationship, entity.name
202+
ORDER BY c.name, relationship
203+
```
204+
205+
---
206+
207+
#### **Key Implementation Insights**
208+
209+
**Technology Stack:** Neo4j GraphRAG Library, OpenAI GPT-4o/embeddings, LangChain integration
210+
211+
**Quality & Performance:**
212+
- Schema-driven extraction with entity validation and context resolution
213+
- Specific relationship types and proper MERGE patterns for performance
214+
- Caching and traversal depth limits for optimization
215+
- Vector RAG for exploration, GraphRAG for precise/factual queries
216+
217+
## Implementation Patterns
218+
219+
### Entity Extraction with Quality Control
220+
221+
```python
222+
from pydantic import BaseModel, Field
223+
from typing import List, Optional
224+
225+
class Entity(BaseModel):
226+
name: str = Field(description="The entity name")
227+
type: str = Field(description="Entity type: Person, Location, Organization")
228+
properties: dict = Field(description="Additional properties")
229+
230+
class Relationship(BaseModel):
231+
source: str = Field(description="Source entity name")
232+
target: str = Field(description="Target entity name")
233+
type: str = Field(description="Relationship type")
234+
properties: dict = Field(default_factory=dict)
235+
236+
# Structured output parser for consistent extraction
237+
parser = PydanticOutputParser(pydantic_object=List[Entity])
238+
```
239+
240+
### Graph Modeling Best Practices
241+
242+
```cypher
243+
// Good - specific, performant relationships
244+
MATCH (company:Company)-[:HAS_METRIC]->(metric:FinancialMetric)
245+
WHERE metric.name CONTAINS 'revenue'
246+
247+
// Bad - generic, slow relationships
248+
MATCH (company:Company)-[r:RELATED_TO]->(entity)
249+
WHERE r.type = 'has_metric' AND entity.name CONTAINS 'revenue'
250+
251+
// Proper MERGE usage - avoid duplicates
252+
MERGE (c:Company {name: "Apple Inc"})
253+
SET c.ticker = "AAPL", c.sector = "Technology"
254+
255+
// Bad - creates duplicate nodes
256+
MERGE (c:Company {name: "Apple Inc", ticker: "AAPL", sector: "Technology"})
257+
```
258+
259+
### Text-to-Cypher Safety and Validation
260+
261+
```python
262+
TEXT_TO_CYPHER_PROMPT = """
263+
You are an expert at converting natural language to Cypher queries.
264+
Use ONLY read-only operations: MATCH, RETURN, WHERE, ORDER BY, LIMIT.
265+
NEVER use CREATE, DELETE, SET, MERGE, or any write operations.
266+
267+
Schema:
268+
{schema}
269+
270+
Examples:
271+
Question: "How many companies are in the database?"
272+
Cypher: MATCH (c:Company) RETURN count(c) as total_companies
273+
274+
Question: "What financial metrics does Apple have?"
275+
Cypher: MATCH (c:Company {name: 'APPLE INC'})-[:HAS_METRIC]->(m:FinancialMetric) RETURN m.name
276+
277+
Convert this question: {question}
278+
Only return the Cypher query, nothing else.
279+
"""
280+
281+
def validate_cypher_safety(query: str) -> bool:
282+
"""Validate that Cypher query contains only read operations"""
283+
forbidden_keywords = ['CREATE', 'DELETE', 'SET', 'MERGE', 'REMOVE', 'DROP']
284+
query_upper = query.upper()
285+
return not any(keyword in query_upper for keyword in forbidden_keywords)
286+
```
287+
288+
### Hybrid Retrieval Strategy
289+
290+
```python
291+
class HybridGraphRAGRetriever:
292+
def __init__(self, vector_retriever, vector_cypher_retriever, text2cypher_retriever):
293+
self.vector_retriever = vector_retriever
294+
self.vector_cypher_retriever = vector_cypher_retriever
295+
self.text2cypher_retriever = text2cypher_retriever
296+
297+
def route_query(self, question: str) -> str:
298+
"""Route questions to appropriate retriever based on type"""
299+
if any(word in question.lower() for word in ['how many', 'count', 'number of']):
300+
return 'text2cypher'
301+
elif any(word in question.lower() for word in ['what', 'which', 'who']):
302+
return 'vector_cypher'
303+
else:
304+
return 'vector'
305+
306+
def retrieve(self, question: str, top_k: int = 5):
307+
retriever_type = self.route_query(question)
308+
309+
if retriever_type == 'text2cypher':
310+
return self.text2cypher_retriever.search(question)
311+
elif retriever_type == 'vector_cypher':
312+
return self.vector_cypher_retriever.search(question, top_k=top_k)
313+
else:
314+
return self.vector_retriever.search(question, top_k=top_k)
315+
```
316+
317+
### Error Handling and Robustness
318+
319+
```python
320+
def robust_text_to_cypher(question: str, max_retries: int = 3) -> str:
321+
"""Generate Cypher with retry logic and validation"""
322+
for attempt in range(max_retries):
323+
try:
324+
cypher = llm_generate_cypher(question)
325+
if not validate_cypher_safety(cypher):
326+
raise ValueError("Unsafe Cypher query detected")
327+
graph.query(f"{cypher} LIMIT 1") # Test execution
328+
return cypher
329+
except Exception as e:
330+
if attempt == max_retries - 1: return None
331+
question = f"[Previous attempt failed: {str(e)}] {question}"
332+
return None
333+
```
334+
335+
## Production Considerations
336+
337+
### Scaling Strategies
338+
- **Small datasets** (< 100k nodes): Single Neo4j instance with vector indexes
339+
- **Medium datasets** (100k - 1M nodes): Neo4j cluster with read replicas for retrieval
340+
- **Large datasets** (> 1M nodes): Distributed setup with graph partitioning strategies
341+
342+
### Caching Implementation
343+
```python
344+
# Cache Cypher queries and entity extraction with LRU cache and Redis
345+
@lru_cache(maxsize=1000)
346+
def cached_graph_query(query: str, params_str: str):
347+
return graph.query(query, json.loads(params_str))
348+
349+
def get_or_cache_entities(text_chunk: str) -> List[Dict]:
350+
cache_key = f"entities:{hash(text_chunk)}"
351+
cached = redis_client.get(cache_key)
352+
if cached: return json.loads(cached)
353+
354+
entities = extract_entities(text_chunk)
355+
redis_client.setex(cache_key, 3600, json.dumps(entities))
356+
return entities
357+
```
358+
359+
### Common Mistakes and Solutions
360+
361+
1. **Information Overload in Knowledge Graphs**
362+
- Problem: LLM extracts too much irrelevant information
363+
- Solution: Use explicit constraints and few-shot examples with negative examples
364+
365+
2. **Unperformant Graph Traversals**
366+
- Problem: Queries traverse too many relationships
367+
- Solution: Add intermediate aggregation nodes and limit traversal depth
368+
369+
3. **Poor Entity Resolution**
370+
- Problem: Same entities created with different names
371+
- Solution: Use existing graph data for entity resolution and similarity matching
372+
373+
4. **Assuming Binary Choice**
374+
- Problem: Treating GraphRAG vs Vector RAG as either/or decision
375+
- Solution: Use hybrid approaches that combine both methods strategically
376+
377+

0 commit comments

Comments
 (0)