feat: intra-file slicing, incremental safety, and cross-file stitch by Const011 · Pull Request #1326 · Graphify-Labs/graphify

Const011 · 2026-06-15T18:10:11Z

Motivation

Graphify's incremental extract path is the practical way to maintain a knowledge graph as a corpus grows file-by-file. In production use we hit three related failures: large documents silently overflowing the model context, incremental runs destroying existing graph data when re-extraction failed, and incremental updates producing isolated subgraphs that query tools could not reach from the rest of the graph.

This PR addresses all three without requiring a full corpus re-extract on every change.

Problem 1: Silent truncation on large documents

Symptom: Semantic extraction on large .md files appeared to "succeed" (exit 0) but returned empty or invalid JSON — especially under token pressure (local Ollama, tight budgets, or long docs).

Root cause: _pack_chunks_by_tokens treated each file as atomic. A single file larger than the chunk budget was sent whole; the model truncated output (finish_reason="length") or returned unparseable JSON. _parse_llm_json then yielded {"nodes": [], ...} with no hard failure, so downstream merge proceeded as if extraction worked.

Fix: New module graphify/file_slice.py

Splits oversized splittable text (.md, .mdx, .txt, .rst) into FileSlice units at heading / paragraph boundaries before packing.
Adaptive retry can bisect a slice when a chunk still overflows context (split_unit_for_retry, split_chunk_for_retry).
llm.py only orchestrates: expand_oversized_files() → pack → extract; slice logic lives in file_slice.py.

This matches how full runs already group related files by directory for cross-file edges — but fixes the case where one file alone exceeds the budget.

Problem 2: Incremental update silently shrank the graph (critical)

Symptom: After editing one markdown file and running incremental extract, nodes for that file disappeared entirely; unrelated files could survive but the changed region was gone. Exit code was still 0.

Root cause: Incremental merge always pruned changed files before inserting fresh extraction (prune_sources = deleted + changed). If the LLM chunk "completed" but produced zero nodes (invalid JSON, connection blip, truncation), merge still ran: old nodes pruned, nothing replaced.

Fix: In __main__.py (incremental mode only):

After semantic re-extract, abort with exit 1 if any uncached changed file has no nodes/edges in the fresh result, or if any chunk raised (paths_missing_from_extraction in build.py).
Do not write graph.json or update the manifest on failure — existing graph is preserved.
Only include successfully re-extracted semantic files in prune_sources (changed code files still pruned — AST is local and reliable).

This converts a silent data-loss bug into a loud, recoverable error.

Problem 3: Incremental updates were not discoverable (isolated subgraph)

Symptom: Re-extracting one changed doc produced entities and edges inside that file's chunk only. Mentions of symbols in other files (e.g. `calculateWorkingTimeFromTeamsCalendar` referencing a calendar doc already in the graph) did not create edges to existing nodes. BFS/query/path could not reach the updated region from the rest of the corpus — breaking the main "add files one at a time" workflow.

Root cause: By design, incremental semantic extraction sends only changed files to the LLM (unlike full runs that co-pack same-directory files). The model has no stable node IDs for the rest of the graph and often hallucinates paths instead of wiring references.

Fix: New module graphify/stitch.py — deterministic post-merge stitch pass (incremental only):

After build_merge, scan each changed file on disk for backtick identifiers and path-like markdown links.
Resolve symbols against the full merged graph (unique label match; skip ambiguous names — same conservatism as AST cross-file resolution).
Add references edges from an anchor node in the changed file to external targets (source = referencer, target = referenced — matches graphify's edge-direction rules and works with undirected BFS/path).
When the LLM mis-attributes source_file on fresh nodes, use new_node_ids from the extraction and exclude nodes correctly placed under a different on-disk file.

Why this approach (and not alternatives)?

Alternative	Why we did not rely on it alone
Full corpus re-extract	Correct but expensive; defeats incremental purpose
Pack neighbor files into LLM context	Higher token cost; still no guaranteed stable IDs
LLM-only "wire edges" pass	Extra cost + variance; we already know explicit mentions from source text
Bidirectional / rewrite unchanged files	Out of scope; outbound stitch + undirected search is enough for discoverability

Stitch is deterministic, cheap, and testable — it reuses path-normalisation from build.py and label indexing patterns from symbol_resolution.py.

OpenRouter / OpenAI-compatible backends

The openai backend now respects:

OPENAI_BASE_URL (e.g. https://openrouter.ai/api/v1) — previously hardcoded to api.openai.com
OPENROUTER_API_KEY as an accepted key alongside OPENAI_API_KEY

This makes OpenRouter usable without a separate backend entry. Related small fixes: GOOGLE_BYOK for Gemini, and skip reasoning_effort on Gemma models that reject it (400 from Google API).

Files touched

Area	Files
Intra-file slicing	`graphify/file_slice.py`, `graphify/llm.py` (calls only), `tests/test_file_slice.py`
Incremental safety	`graphify/build.py`, `graphify/__main__.py`, `tests/test_build.py`
Cross-file stitch	`graphify/stitch.py`, `graphify/__main__.py`, `tests/test_stitch.py`
Chunking regression	`tests/test_chunking.py`

Test plan

pytest tests/test_file_slice.py tests/test_stitch.py tests/test_build.py tests/test_chunking.py (60 tests)
Incremental smoke (8 markdown files, Gemma via Google API): failed re-extract aborts without data loss; successful re-extract + stitch adds cross-file references (e.g. array-field doc → calendar symbols), verified with graphify path
Upstream CI on Linux

Add file_slice.py for heading-aware markdown splitting and adaptive retry bisection; wire it from llm.py without duplicating slice logic. Abort incremental extract when re-extraction fails instead of pruning into an empty subgraph; stitch references edges from changed files onto the existing graph so incremental updates stay discoverable. Includes OpenRouter/Gemma backend fixes and tests. Co-authored-by: Cursor <cursoragent@cursor.com>

safishamsi · 2026-06-17T10:08:09Z

Thanks @Const011 — there's genuinely good engineering here (the file_slice module + adaptive-retry refactor and the conservative stitch pass are nicely done). But it can't merge in its current form:

Its own test suite is red. The v8 merge in this branch botched the tests/test_build.py conflict and dropped dedupe_edges, dedupe_nodes from the import line → 3 NameError failures. (The functions exist in build.py; just the import is broken.)
Stale / conflicting with merged work. It branched before update edge count is non-deterministic across build modes (multi-edges not collapsed) #1317 and now conflicts with fix(build_merge): replace re-extracted files instead of accumulating stale edges #1344 (build_merge auto-prunes re-extracted files) and fix: harden incremental no-cluster updates #1350 (no-cluster no-op short-circuit) in __main__.py. The incremental-safety prune_sources model needs to be reconciled with fix(build_merge): replace re-extracted files instead of accumulating stale edges #1344's auto-replace, not stacked.
Semantics change. It reroutes the --no-cluster incremental path through build_merge/NetworkX, which collapses parallel edges — changing the documented no-cluster contract. Please make that explicit/gated.
Scope. Three independent features + an unrelated OpenRouter/BYOK backend change in one +1180 PR. Strongly recommend splitting: (A) file-slice + retry refactor + backend, (B) incremental-safety abort, (C) cross-file stitch (the novel piece deserves isolated review).

Rebase onto current v8, fix the import, reconcile with #1344/#1350, and ideally split — happy to review the pieces.

Const011 · 2026-06-17T18:03:06Z

Closing to rebase onto current v8 and split into focused PRs per review. New PRs will reference this work.

Const011 and others added 2 commits June 15, 2026 22:09

Merge branch 'v8' into feat/incremental-stitch-file-slice

a50ea6b

Const011 closed this Jun 17, 2026

This was referenced Jun 17, 2026

feat: intra-file markdown slicing and adaptive retry (1/3) #1369

Closed

feat: abort incremental update on failed or empty re-extraction (2/3) #1370

Closed

feat: cross-file reference stitch for incremental updates (3/3) #1371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: intra-file slicing, incremental safety, and cross-file stitch#1326

feat: intra-file slicing, incremental safety, and cross-file stitch#1326
Const011 wants to merge 2 commits into
Graphify-Labs:v8from
Const011:feat/incremental-stitch-file-slice

Const011 commented Jun 15, 2026 •

edited

Loading

Uh oh!

safishamsi commented Jun 17, 2026

Uh oh!

Const011 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

Const011 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Problem 1: Silent truncation on large documents

Problem 2: Incremental update silently shrank the graph (critical)

Problem 3: Incremental updates were not discoverable (isolated subgraph)

OpenRouter / OpenAI-compatible backends

Files touched

Test plan

Uh oh!

safishamsi commented Jun 17, 2026

Uh oh!

Const011 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Const011 commented Jun 15, 2026 •

edited

Loading