Description
A number of users have asked how to add new content to an existing index without needing to re-run the entire process. This is a feature we are planning, and are in the design stages now to ensure we have an efficient approach.
As it stands, new content can be added to a GraphRAG index without requiring a complete re-index. This is because we rely heavily on a cache to avoid repeating the same calls to the model API. There are several stages within the pipeline for which this is very efficient - namely those stages that are atomic and do not have upstream dependencies. For example, if you add new documents to a folder, we will not re-chunk existing documents or perform entity and relationship extraction on the existing content; we will simply fetch the processed content from the cache and pass it on. The new documents will be processed and new entities and relationships extracted. Downstream of this, the graph construction process will need to recreate the graph to include the new nodes and edges, and communities will be recomputed - resulting in re-summarization, etc. You can get a sense of this process and what downstream steps may be re-processed by looking at the data flow diagram.
Describe the solution you'd like
An ideal solution would be to add a new command to GraphRAG such as update
that can be run against new data and augment an existing index. Considerations here include things such as evaluating the new entities to determine if they can be added to an existing community, and when those communities have been altered enough to constitute a "drift" that needs recomputing. We could also perform analysis to determine which communities have been edited, such that we ignore summarization on those that haven't changed.
Additional context
We also need to consider the types of analysis incremental ingest can enable beyond just "updates". For example, daily ingest of news with thoughtful graph construction/annotation could allow for delta analysis such that questions like "what happened with person x in the last 24 hours" or "catch me up on the news themes this week".
Some desired types of editing users have described in other issues:
- Adding new documents
- Removing old documents
- Editing the graph itself
Scope
For now we are going to limit the scope of this feature to just an incremental index update to append content, and not worry about removal, manual graph editing, or the metadata tagging that would be required to do delta-style queries.
Approach
Putting here a little more detail on the approach we've discussed. It largely echoes what I put above as ideas, but I'll repeat for clarity:
- We will create a new
graphrag.append
command to run updates that add content. The reason for a new command is so that the originalgraphrag.index
is predictable in its behavior, i.e., that users know that communities will always be recomputed so they don't have to worry about model drift. - The append command will try to minimize community recomputes so that summarization is not performed again. If certain thresholds are met, recompute may be required, so the worst case degrades to the same performance as a normal indexing.
- The first efficiency optimization will be to attempt to place all new entities into an existing community rather than re-running Leiden and triggering updates for everything.
- We will only run summarization on those communities whose membership has changed, i.e., their new entity inputs should trigger resummarization in order to account for the new content.
- We will establish user-configurable thresholds to determine when Leiden must be re-run, such as the number of new entities that don't find an existing community, or possible a measure of the modularity change of the graph (TBD).