feat: Add Top-K and min co-occurrence filters to NLP edge extraction#2273
Open
nahisaho wants to merge 1 commit intomicrosoft:mainfrom
Open
feat: Add Top-K and min co-occurrence filters to NLP edge extraction#2273nahisaho wants to merge 1 commit intomicrosoft:mainfrom
nahisaho wants to merge 1 commit intomicrosoft:mainfrom
Conversation
The NLP-based graph extraction (_extract_edges in build_noun_graph.py) uses itertools.combinations() to create co-occurrence edges between all noun phrases in each text chunk. With entity-dense corpora (e.g. scientific text averaging ~60 entities per 800-token chunk), this produces C(60,2)=1,770 pairs per chunk — leading to 65-70x more relationships than LLM-based extraction and paradoxically making the 'fast/lazy' NLP mode more expensive than 'standard' LLM mode. This commit adds two configurable filters to _extract_edges(): 1. max_entities_per_chunk (default: 0 = disabled): When > 0, only the K most globally-frequent entities per text chunk are paired, capping edges at C(K,2) instead of C(N,2). With K=15 on a 20-paper materials-science corpus, this reduced relationships from 120,287 to 2,660 (97.8% reduction) while improving query quality compared to Standard mode. 2. min_co_occurrence (default: 1 = no filtering): When > 1, edges appearing in fewer text units are discarded as likely coincidental co-occurrences. In testing, ~57.5% of edges appeared in only 1 chunk. Both parameters are exposed through settings.yaml via extract_graph_nlp.max_entities_per_chunk and extract_graph_nlp.min_co_occurrence, with backward-compatible defaults that preserve existing behavior. Includes: - Config: defaults.py, extract_graph_nlp_config.py - Pipeline: extract_graph_nlp.py workflow, build_noun_graph.py - Tests: 15 unit tests for filtering logic - Semversioner: patch change document Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@nahisaho please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
nahisaho
pushed a commit
to nahisaho/graphrag-hybrid-installer
that referenced
this pull request
Mar 8, 2026
Features: - Interactive installer for Microsoft GraphRAG with hybrid NLP extraction - scispaCy + GiNZA + domain dictionary integration - NLP edge optimization patch for build_noun_graph.py (Top-K + co-occurrence filter) - Multi-provider support: OpenAI / Azure OpenAI / Ollama - MCP Server for Claude Desktop / VS Code Copilot integration - Domain dictionary builder for specialized corpora The NLP edge optimization patch addresses the O(N²) relationship explosion in GraphRAG v3.0.6's Lazy/Fast mode, reducing relationships by 97.8% and costs by 89.6% while maintaining query quality. See also: microsoft/graphrag#2273 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The NLP-based graph extraction in
build_noun_graph.pyusesitertools.combinations()to create co-occurrence edges between all noun phrases in each text chunk. With entity-dense corpora (e.g., scientific/technical text), this O(N²) all-pairs algorithm produces a massive number of edges that overwhelm downstream processing and paradoxically make the "fast/lazy" NLP mode more expensive than the LLM-based "standard" mode.Problem
Root Cause
_extract_edges()callscombinations(sorted(set(titles)), 2)on every text unit. With scientific text averaging ~60 noun-phrase entities per 800-token chunk, this produces C(60,2) = 1,770 pairs per chunk.Impact (measured on a 20-paper materials-science corpus, 153 chunks)
The NLP mode generates 65× more relationships than LLM extraction, causing the "fast" pipeline to be 55% more expensive than Standard despite using no LLM for entity extraction itself. The cost comes from downstream
community_reportsgeneration, which must summarize the inflated graph.Why
prune_graphdoes not fix thisThe existing
prune_graphworkflow step applies PMI-based pruning, but:min_edge_weight_pct=40) still leaves ~72K edgesmin_edge_weight_pct=80,max_node_degree_std=2.0) over-prunes to just 6 edgesThe problem must be addressed at the source — during edge construction, not after.
Solution
This PR adds two configurable parameters to
_extract_edges():1.
max_entities_per_chunk(default:0= disabled)When > 0, only the K most globally-frequent entities per text chunk are paired, capping edges at C(K,2) instead of C(N,2).
2.
min_co_occurrence(default:1= no filtering)When > 1, edges appearing in fewer text units are discarded as likely coincidental co-occurrences. In testing, ~57.5% of edges appeared in only 1 chunk.
Why these defaults?
Both defaults preserve exact backward compatibility — existing users see zero behavioral change. Users experiencing the edge explosion can opt in by setting these parameters in
settings.yaml.Results
Tested with 20 materials-science papers (153 text chunks), gpt-4o-mini + text-embedding-3-small:
With
max_entities_per_chunk=15, min_co_occurrence=2Top-K parameter sweep
K=15 emerged as the Pareto optimal — best query quality at the lowest cost, outperforming even Standard mode on a local search benchmark.
Changes
config/defaults.pymax_entities_per_chunk=0,min_co_occurrence=1toExtractGraphNLPDefaultsconfig/models/extract_graph_nlp_config.pyExtractGraphNLPConfigindex/workflows/extract_graph_nlp.pybuild_noun_graph()index/operations/build_noun_graph/build_noun_graph.py_extract_edges()tests/unit/config/utils.pytests/unit/indexing/operations/test_build_noun_graph.py.semversioner/next-release/Backward Compatibility
max_entities_per_chunk=0disables Top-K,min_co_occurrence=1keeps all edgestest_extract_graph_nlp) continues to pass with 1,147 entities and 29,442 relationships