- Documentation: Get Started
- LazyGraphRag (Affordable GraphRAG) still in development (as of Nov 2025): Issue
- GraphRAG: Automatically builds its own graph without any user input.
- Query mode:
- Global: Community scan approach
- Local: Embedding lookups with minor gains compared to RAG.
- DRIFT Search: Enhanced local search by integrating community information into the query process.
- Indexing Pipeline
- DataShaper is used for creating graph pipelines.
- Parquet file-based storage: No specific database has been implemented for storage.
- LanceDB: Used for embedding lookups.
- Chunking strategy: The chunk size can be configured in the settings (default: 1200). No special methods are required for chunking.
- Integration with Third-Party Libraries
- Web Application (Backup under
docs/appdirectory)
- 2.1 Source Documents → Text Chunks
- 2.2 Text Chunks → Element Instances: multiple rounds
- 2.3 Element Instances → Element Summaries: extract descriptions
- 2.4 Element Summaries → Graph Communities: leiden, 2-levels communities: lv0, lv1:
- 2.5 Graph Communities → Community Summaries: report-like summaries of each community
- 2.6 Community Summaries → Community Answers → Global Answer
GraphRAG v2 (2.0+) simplified the indexing workflow and output structure by removing DataShaper-based workflow naming conventions and consolidating related data.
flowchart TD
A[["Raw Documents"]] --> B["documents.parquet: Document metadata"]
A --> C["text_units.parquet: Chunked text"]
C --> D["entities.parquet: Entities with embeddings"]
C --> E["relationships.parquet: Entity relationships"]
D --> F["communities.parquet: Hierarchical communities"]
E --> F
F --> G["community_reports.parquet: Community summaries"]
C --> H["covariates.parquet: Claims and context"]
style D fill:#e1f5ff
style F fill:#fff4e1
style G fill:#ffe1e1
Key Changes from v1 to v2:
- Simplified Output Files: Removed
create_final_prefix from all output files - Consolidated Graph Data: Merged node properties (degree, x, y coordinates) into the entities table
- Enhanced Community Structure: Added
childrenfield to communities for easier hierarchical traversal - Renamed Fields: Changed
attributestometadatain documents table - Removed Redundancy: Eliminated the separate
nodestable (data merged intoentities)
v2 Output Files:
| File | Description | Key Fields |
|---|---|---|
| documents.parquet | Source document metadata | id, title, metadata, text_unit_ids |
| text_units.parquet | Text chunks with mappings | id, text, chunk, entity_ids, relationship_ids, document_ids |
| entities.parquet | Entities with graph properties | id, name, type, description, text_unit_ids, description_embedding, degree, x, y, frequency |
| relationships.parquet | Entity relationships | source, target, description, text_unit_ids, rank, weight |
| communities.parquet | Community hierarchy | id, level, title, relationship_ids, text_unit_ids, children, parent |
| community_reports.parquet | Community summaries | community, title, summary, full_content, rank, children |
| covariates.parquet | Extracted claims | id, subject_id, object_id, type, status, description, text_unit_ids |
Migration from v1 to v2:
See docs/index_migration_to_v2.ipynb for the complete migration notebook.
# Key transformation steps:
# 1. Rename: attributes → metadata in documents
# 2. Merge: node properties (degree, x, y) into entities
# 3. Add: children field to communities
# 4. Remove: create_final_* prefix from all files-
Hierarchy: Documents → Text units → Entities/Relationships → Communities → Community reports
-
The output files: All files use simplified names without workflow prefixes.
```cmd documents.parquet text_units.parquet entities.parquet relationships.parquet communities.parquet community_reports.parquet covariates.parquet ```
flowchart TD
A[["Raw Documents"]] --> B["create_base_text_units: Chunked text"]
B --> C["create_base_extracted_entities: Entity extraction"]
C --> D["create_summarized_entities: Entity summaries"]
D --> E["create_base_entity_graph: Graph construction"]
E --> F["create_final_entities: Entities with embeddings"]
E --> G["create_final_nodes: Node properties"]
E --> H["create_final_communities: Community structure"]
E --> J["create_final_relationships: Relationships"]
F --> I["join_text_units_to_entity_ids: Entity mappings"]
J --> K["join_text_units_to_relationship_ids: Relationship mappings"]
G --> L["create_final_community_reports: Community summaries"]
B --> M["create_final_text_units: Text units"]
M --> N["create_base_documents: Document structure"]
N --> O["create_final_documents: Final documents"]
G --> J
J --> L
I --> M
K --> M
style F fill:#e1f5ff
style H fill:#fff4e1
style L fill:#ffe1e1
style O fill:#e1ffe1
v1 Workflow Steps:
| Step | Input | Output |
|---|---|---|
| create_base_text_units | Raw text documents | Chunked text |
| create_base_extracted_entities | Chunked text | GraphML XML with entities |
| create_summarized_entities | GraphML XML (entities) | GraphML XML with summaries |
| create_base_entity_graph | GraphML XML (summaries) | GraphML XML (base graph) |
| create_final_entities | GraphML XML (base graph) | Entity list with embeddings |
| create_final_nodes | GraphML XML (base graph) | Community ID, node properties |
| create_final_communities | GraphML XML (base graph) | Community ID, relationship IDs |
| join_text_units_to_entity_ids | Entity list with embeddings | text_unit_id → entity_ids mapping |
| create_final_relationships | GraphML XML + node data | Source, target, text_unit_id, rank |
| join_text_units_to_relationship_ids | Relationship data | text_unit_id → relationship_ids mapping |
| create_final_community_reports | Community + relationship data | AI-generated community summaries |
| create_final_text_units | Chunked text + mappings | text_unit_id, entity_ids, relationship_ids |
| create_base_documents | Text unit mappings | Document and text_unit_ids |
| create_final_documents | Document structure | Final documents with metadata |
-
Hierarchy: Documents → Text units → Entity IDs, relationship IDs → Communities → Community reports
-
The files with 'final' in their names: The item tagged with 'final' is the final result. These will be used for rendering a graph.
```cmd create_final_entities.parquet create_final_relationships.parquet create_final_documents.parquet create_final_text_units.parquet create_final_communities.parquet create_final_community_reports.parquet create_final_covariates.parquet ```
This will create two files: .env and settings.yaml. See samples in .env.sample and settings.sample.yaml.
graphrag init.env
GRAPHRAG_API_KEY=<your-azure-openai-key>settings.yaml
models:
default_chat_model:
type: chat
api_base: https://<instance>.openai.azure.com
api_version: <your-api-version>
deployment_name: <azure_model_deployment_name>
default_embedding_model:
type: embedding
api_base: https://<instance>.openai.azure.com
api_version: <your-api-version>
deployment_name: <azure_embedding_deployment_name>- Workflows are expressed as sequences of steps, which we call verbs. Each step has a verb name and a configuration object. In DataShaper, these verbs model relational concepts such as SELECT, DROP, JOIN, etc.. Each verb transforms an input data table, and that table is passed down the pipeline.
- A claim might not be explicitly stated in the source text but can be inferred from the context. For example, if a text mentions that “Gustave Eiffel designed a famous landmark in Paris,” we can deduce the claim that “The Eiffel Tower is located in Paris” based on our knowledge that the Eiffel Tower is the famous landmark designed by Gustave Eiffel. This inferred information helps provide a more comprehensive understanding of the relationships between entities.
- The quality of chucking depends on the context size. When targeting long texts after gleaning, there are significant differences in the number of entities extracted. To maximize entity extraction, multiple rounds of gleaning are performed. >
CONTINUE_PROMPT&LOOP_PROMPTin src
- r(G) = n - c, where n is the number of vertices (nodes) in the graph and c is the number of connected components.