feat: intra-file slicing, incremental safety, and cross-file stitch#1326
Closed
Const011 wants to merge 2 commits into
Closed
feat: intra-file slicing, incremental safety, and cross-file stitch#1326Const011 wants to merge 2 commits into
Const011 wants to merge 2 commits into
Conversation
Add file_slice.py for heading-aware markdown splitting and adaptive retry bisection; wire it from llm.py without duplicating slice logic. Abort incremental extract when re-extraction fails instead of pruning into an empty subgraph; stitch references edges from changed files onto the existing graph so incremental updates stay discoverable. Includes OpenRouter/Gemma backend fixes and tests. Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
|
Thanks @Const011 — there's genuinely good engineering here (the
Rebase onto current v8, fix the import, reconcile with #1344/#1350, and ideally split — happy to review the pieces. |
Author
|
Closing to rebase onto current v8 and split into focused PRs per review. New PRs will reference this work. |
This was referenced Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Graphify's incremental
extractpath is the practical way to maintain a knowledge graph as a corpus grows file-by-file. In production use we hit three related failures: large documents silently overflowing the model context, incremental runs destroying existing graph data when re-extraction failed, and incremental updates producing isolated subgraphs that query tools could not reach from the rest of the graph.This PR addresses all three without requiring a full corpus re-extract on every change.
Problem 1: Silent truncation on large documents
Symptom: Semantic extraction on large
.mdfiles appeared to "succeed" (exit 0) but returned empty or invalid JSON — especially under token pressure (local Ollama, tight budgets, or long docs).Root cause:
_pack_chunks_by_tokenstreated each file as atomic. A single file larger than the chunk budget was sent whole; the model truncated output (finish_reason="length") or returned unparseable JSON._parse_llm_jsonthen yielded{"nodes": [], ...}with no hard failure, so downstream merge proceeded as if extraction worked.Fix: New module
graphify/file_slice.py.md,.mdx,.txt,.rst) intoFileSliceunits at heading / paragraph boundaries before packing.split_unit_for_retry,split_chunk_for_retry).llm.pyonly orchestrates:expand_oversized_files()→ pack → extract; slice logic lives infile_slice.py.This matches how full runs already group related files by directory for cross-file edges — but fixes the case where one file alone exceeds the budget.
Problem 2: Incremental update silently shrank the graph (critical)
Symptom: After editing one markdown file and running incremental
extract, nodes for that file disappeared entirely; unrelated files could survive but the changed region was gone. Exit code was still 0.Root cause: Incremental merge always pruned changed files before inserting fresh extraction (
prune_sources = deleted + changed). If the LLM chunk "completed" but produced zero nodes (invalid JSON, connection blip, truncation), merge still ran: old nodes pruned, nothing replaced.Fix: In
__main__.py(incremental mode only):paths_missing_from_extractioninbuild.py).graph.jsonor update the manifest on failure — existing graph is preserved.prune_sources(changed code files still pruned — AST is local and reliable).This converts a silent data-loss bug into a loud, recoverable error.
Problem 3: Incremental updates were not discoverable (isolated subgraph)
Symptom: Re-extracting one changed doc produced entities and edges inside that file's chunk only. Mentions of symbols in other files (e.g.
`calculateWorkingTimeFromTeamsCalendar`referencing a calendar doc already in the graph) did not create edges to existing nodes. BFS/query/pathcould not reach the updated region from the rest of the corpus — breaking the main "add files one at a time" workflow.Root cause: By design, incremental semantic extraction sends only changed files to the LLM (unlike full runs that co-pack same-directory files). The model has no stable node IDs for the rest of the graph and often hallucinates paths instead of wiring references.
Fix: New module
graphify/stitch.py— deterministic post-merge stitch pass (incremental only):build_merge, scan each changed file on disk for backtick identifiers and path-like markdown links.referencesedges from an anchor node in the changed file to external targets (source= referencer,target= referenced — matches graphify's edge-direction rules and works with undirected BFS/path).source_fileon fresh nodes, usenew_node_idsfrom the extraction and exclude nodes correctly placed under a different on-disk file.Why this approach (and not alternatives)?
Stitch is deterministic, cheap, and testable — it reuses path-normalisation from
build.pyand label indexing patterns fromsymbol_resolution.py.OpenRouter / OpenAI-compatible backends
The
openaibackend now respects:OPENAI_BASE_URL(e.g.https://openrouter.ai/api/v1) — previously hardcoded toapi.openai.comOPENROUTER_API_KEYas an accepted key alongsideOPENAI_API_KEYThis makes OpenRouter usable without a separate backend entry. Related small fixes:
GOOGLE_BYOKfor Gemini, and skipreasoning_efforton Gemma models that reject it (400 from Google API).Files touched
graphify/file_slice.py,graphify/llm.py(calls only),tests/test_file_slice.pygraphify/build.py,graphify/__main__.py,tests/test_build.pygraphify/stitch.py,graphify/__main__.py,tests/test_stitch.pytests/test_chunking.pyTest plan
pytest tests/test_file_slice.py tests/test_stitch.py tests/test_build.py tests/test_chunking.py(60 tests)references(e.g. array-field doc → calendar symbols), verified withgraphify path