Skip to content

feat: intra-file slicing, incremental safety, and cross-file stitch#1326

Closed
Const011 wants to merge 2 commits into
Graphify-Labs:v8from
Const011:feat/incremental-stitch-file-slice
Closed

feat: intra-file slicing, incremental safety, and cross-file stitch#1326
Const011 wants to merge 2 commits into
Graphify-Labs:v8from
Const011:feat/incremental-stitch-file-slice

Conversation

@Const011

@Const011 Const011 commented Jun 15, 2026

Copy link
Copy Markdown

Motivation

Graphify's incremental extract path is the practical way to maintain a knowledge graph as a corpus grows file-by-file. In production use we hit three related failures: large documents silently overflowing the model context, incremental runs destroying existing graph data when re-extraction failed, and incremental updates producing isolated subgraphs that query tools could not reach from the rest of the graph.

This PR addresses all three without requiring a full corpus re-extract on every change.


Problem 1: Silent truncation on large documents

Symptom: Semantic extraction on large .md files appeared to "succeed" (exit 0) but returned empty or invalid JSON — especially under token pressure (local Ollama, tight budgets, or long docs).

Root cause: _pack_chunks_by_tokens treated each file as atomic. A single file larger than the chunk budget was sent whole; the model truncated output (finish_reason="length") or returned unparseable JSON. _parse_llm_json then yielded {"nodes": [], ...} with no hard failure, so downstream merge proceeded as if extraction worked.

Fix: New module graphify/file_slice.py

  • Splits oversized splittable text (.md, .mdx, .txt, .rst) into FileSlice units at heading / paragraph boundaries before packing.
  • Adaptive retry can bisect a slice when a chunk still overflows context (split_unit_for_retry, split_chunk_for_retry).
  • llm.py only orchestrates: expand_oversized_files() → pack → extract; slice logic lives in file_slice.py.

This matches how full runs already group related files by directory for cross-file edges — but fixes the case where one file alone exceeds the budget.


Problem 2: Incremental update silently shrank the graph (critical)

Symptom: After editing one markdown file and running incremental extract, nodes for that file disappeared entirely; unrelated files could survive but the changed region was gone. Exit code was still 0.

Root cause: Incremental merge always pruned changed files before inserting fresh extraction (prune_sources = deleted + changed). If the LLM chunk "completed" but produced zero nodes (invalid JSON, connection blip, truncation), merge still ran: old nodes pruned, nothing replaced.

Fix: In __main__.py (incremental mode only):

  1. After semantic re-extract, abort with exit 1 if any uncached changed file has no nodes/edges in the fresh result, or if any chunk raised (paths_missing_from_extraction in build.py).
  2. Do not write graph.json or update the manifest on failure — existing graph is preserved.
  3. Only include successfully re-extracted semantic files in prune_sources (changed code files still pruned — AST is local and reliable).

This converts a silent data-loss bug into a loud, recoverable error.


Problem 3: Incremental updates were not discoverable (isolated subgraph)

Symptom: Re-extracting one changed doc produced entities and edges inside that file's chunk only. Mentions of symbols in other files (e.g. `calculateWorkingTimeFromTeamsCalendar` referencing a calendar doc already in the graph) did not create edges to existing nodes. BFS/query/path could not reach the updated region from the rest of the corpus — breaking the main "add files one at a time" workflow.

Root cause: By design, incremental semantic extraction sends only changed files to the LLM (unlike full runs that co-pack same-directory files). The model has no stable node IDs for the rest of the graph and often hallucinates paths instead of wiring references.

Fix: New module graphify/stitch.py — deterministic post-merge stitch pass (incremental only):

  • After build_merge, scan each changed file on disk for backtick identifiers and path-like markdown links.
  • Resolve symbols against the full merged graph (unique label match; skip ambiguous names — same conservatism as AST cross-file resolution).
  • Add references edges from an anchor node in the changed file to external targets (source = referencer, target = referenced — matches graphify's edge-direction rules and works with undirected BFS/path).
  • When the LLM mis-attributes source_file on fresh nodes, use new_node_ids from the extraction and exclude nodes correctly placed under a different on-disk file.

Why this approach (and not alternatives)?

Alternative Why we did not rely on it alone
Full corpus re-extract Correct but expensive; defeats incremental purpose
Pack neighbor files into LLM context Higher token cost; still no guaranteed stable IDs
LLM-only "wire edges" pass Extra cost + variance; we already know explicit mentions from source text
Bidirectional / rewrite unchanged files Out of scope; outbound stitch + undirected search is enough for discoverability

Stitch is deterministic, cheap, and testable — it reuses path-normalisation from build.py and label indexing patterns from symbol_resolution.py.


OpenRouter / OpenAI-compatible backends

The openai backend now respects:

  • OPENAI_BASE_URL (e.g. https://openrouter.ai/api/v1) — previously hardcoded to api.openai.com
  • OPENROUTER_API_KEY as an accepted key alongside OPENAI_API_KEY

This makes OpenRouter usable without a separate backend entry. Related small fixes: GOOGLE_BYOK for Gemini, and skip reasoning_effort on Gemma models that reject it (400 from Google API).


Files touched

Area Files
Intra-file slicing graphify/file_slice.py, graphify/llm.py (calls only), tests/test_file_slice.py
Incremental safety graphify/build.py, graphify/__main__.py, tests/test_build.py
Cross-file stitch graphify/stitch.py, graphify/__main__.py, tests/test_stitch.py
Chunking regression tests/test_chunking.py

Test plan

  • pytest tests/test_file_slice.py tests/test_stitch.py tests/test_build.py tests/test_chunking.py (60 tests)
  • Incremental smoke (8 markdown files, Gemma via Google API): failed re-extract aborts without data loss; successful re-extract + stitch adds cross-file references (e.g. array-field doc → calendar symbols), verified with graphify path
  • Upstream CI on Linux

Const011 and others added 2 commits June 15, 2026 22:09
Add file_slice.py for heading-aware markdown splitting and adaptive retry
bisection; wire it from llm.py without duplicating slice logic. Abort
incremental extract when re-extraction fails instead of pruning into an empty
subgraph; stitch references edges from changed files onto the existing graph
so incremental updates stay discoverable. Includes OpenRouter/Gemma backend
fixes and tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
@safishamsi

Copy link
Copy Markdown
Collaborator

Thanks @Const011 — there's genuinely good engineering here (the file_slice module + adaptive-retry refactor and the conservative stitch pass are nicely done). But it can't merge in its current form:

  1. Its own test suite is red. The v8 merge in this branch botched the tests/test_build.py conflict and dropped dedupe_edges, dedupe_nodes from the import line → 3 NameError failures. (The functions exist in build.py; just the import is broken.)
  2. Stale / conflicting with merged work. It branched before update edge count is non-deterministic across build modes (multi-edges not collapsed) #1317 and now conflicts with fix(build_merge): replace re-extracted files instead of accumulating stale edges #1344 (build_merge auto-prunes re-extracted files) and fix: harden incremental no-cluster updates #1350 (no-cluster no-op short-circuit) in __main__.py. The incremental-safety prune_sources model needs to be reconciled with fix(build_merge): replace re-extracted files instead of accumulating stale edges #1344's auto-replace, not stacked.
  3. Semantics change. It reroutes the --no-cluster incremental path through build_merge/NetworkX, which collapses parallel edges — changing the documented no-cluster contract. Please make that explicit/gated.
  4. Scope. Three independent features + an unrelated OpenRouter/BYOK backend change in one +1180 PR. Strongly recommend splitting: (A) file-slice + retry refactor + backend, (B) incremental-safety abort, (C) cross-file stitch (the novel piece deserves isolated review).

Rebase onto current v8, fix the import, reconcile with #1344/#1350, and ideally split — happy to review the pieces.

@Const011

Copy link
Copy Markdown
Author

Closing to rebase onto current v8 and split into focused PRs per review. New PRs will reference this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants