Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces an optional workflow called merge_entities, which can be run after the extract_graph workflow. It aims to merge duplicate or near-duplicate entities (e.g., car and cars, or PCA and principal component analysis) in the entity and relationship tables.
Motivation
Currently, Graphrag may extract entities that are semantically similar but not identical. These duplicates increase the number of sparse or fragmented nodes in the knowledge graph and may negatively affect community detection and other downstream tasks.
By merging these entities, the graph becomes more semantically compact and meaningful, with improved structure and potentially better community coherence.
I created a graph about the soldering process. In this graph, You can see that without merging entities "Increased board complexity" was a separate fragment, and no community report was created but after merging entities, it is connected to the main node "soldering" and a community is created.
Proposed Changes
Add a new optional merge_entities workflow
Add config for merge_entities workflow (i.e. enable: true/false, ....)
Add workflow to default workflows
Add merge_entities prompt
Add a JSON log file of llm output to the output folder
Checklist
I really appreciate it if you provide me with some feedback and if you think this is a good feature I will work on document and unit tests.
Here are some examples of merged entities:
SOLDER
Merged from: SOLDER, MOLTEN SOLDER, SOLDER JOINTS, SOLDER JOINT, SOLDERED JOINT
CLEANING
Merged from: CLEANING, CLEANING PROCESSES, CLEANING PROCESS
WAVE SOLDERING
Merged from: WAVE SOLDERING, CS (WAVE SOLDERING) PROCESS
MACHINE SOLDERING
Merged from: MACHINE SOLDERING, SOLDERING MACHINE