Skip to content

Silent data loss: doc files with the same name in different directories produce colliding node IDs across extraction chunks #1504

Description

@sub4biz

graphify version: 0.8.50
Affected invocation: graphify extract . on any multi-directory corpus large enough to split into multiple chunks
Severity: High — data is silently lost with no warning, error, or log entry


Summary

When a corpus is large enough to split into multiple LLM extraction chunks, doc files with the same filename in different subdirectories produce identical node IDs from separate chunk runs. deduplicate_entities() silently discards one of the colliding nodes (first-writer-wins), so one directory's data is dropped without any indication. The user sees a complete-looking graph with fewer nodes than expected and no error message.


Minimal reproduction

my-corpus/
├── module-a/
│   └── docs/
│       └── README.md   ← "Booking Model — confirm, refund operations"
└── module-b/
    └── docs/
        └── README.md   ← "Booking Model — cancel, reschedule operations"

Run on a corpus large enough to split chunks (see "Trigger threshold" below):

graphify extract .

Expected: graph contains nodes from both module-a and module-b.
Actual: graph contains nodes from only one directory; the other is silently dropped.

To force splitting with this small corpus:

graphify extract . --token-budget 400

Root cause (verified from source)

1. Stem rules produce non-unique IDs

The system prompt (llm.py, line ~390) instructs models:

stem = filename without extension

This means project-a/models/README.md and project-b/models/README.md both get stem readme_. The model then generates IDs like readme_confirm, readme_cancel — identical namespaces for different files.

The system prompt also instructs models to produce lowercase-only IDs, so even if filenames differ only by case (README.md vs readme.md), both yield readme_ after the model applies the lowercase constraint.

2. deduplicate_entities() silently discards one of the colliding nodes

After all chunk results are merged into a flat list, the CLI calls:

# __main__.py line ~4737 — dedup=True is hardcoded
G = _build([merged], dedup=True, ...)

build() runs deduplicate_entities() before build_from_json(). Inside deduplicate_entities() (dedup.py, lines 222–228):

# Pre-deduplicate: keep first occurrence of each id
seen_ids: dict[str, dict] = {}
for node in nodes:
    nid = node.get("id", "")
    if nid and nid not in seen_ids:
        seen_ids[nid] = node
unique_nodes = list(seen_ids.values())

First-writer-wins: when two chunks produce the same ID, the node from whichever chunk appears first in the merged list survives; the other is silently dropped — no warning, no counter, no log entry. Which chunk comes first depends on the backend:

  • ollama and claude-cli backends: forced to max_concurrency=1 (sequential) — chunks merge in sorted(by_dir) order, so alphabetically earlier directories always win.
  • All other backends (deepseek, gemini, kimi, openai, …): ThreadPoolExecutor with as_completed() — chunks merge in API response-completion order, which is nondeterministic. Which project loses data can vary between runs of the same corpus.

build_from_json() then calls G.add_node() on the already-deduplicated list — it does not see duplicates at this point for the collision case. (The build() docstring's note about "last extraction's attributes win" refers to the separate intentional AST-vs-semantic overwrite, not to cross-chunk ID collisions.)

Note on G.add_node() and data loss: G.add_node() with an existing ID does overwrite silently, but for the cross-chunk collision case the data loss happens one step earlier, in deduplicate_entities(). Both layers are silent.

3. Chunking puts same-named files in separate chunks

_pack_chunks_by_tokens (llm.py, lines ~1454–1497) groups files by parent directory and packs greedily within the token budget. project-a/models/ and project-b/models/ are separate directory buckets, so on a large corpus they land in separate chunks and receive independent LLM calls with no shared context.

4. In-chunk disambiguation disappears in multi-chunk

When a model processes a single chunk containing both project-a/models/README.md and project-b/models/README.md, it can see the duplicate paths and self-disambiguate (e.g., readme_md_pending_b). Split across two chunks, each model call sees only one file, has no reason to disambiguate, and produces the same IDs. Single-chunk runs appear clean; multi-chunk runs silently lose data.

5. No post-generation validation — models violate the spec in both directions

The spec (extraction-spec.md) states explicitly:

"CRITICAL: never append chunk numbers, sequence numbers, or any suffix to an ID (no _c1, _c2, _chunk2, etc.)"

Yet models routinely add disambiguation suffixes (_b, _2, _) when they see duplicate file paths in one chunk. graphify never validates the returned IDs against this rule. The result is a dangerous inversion:

  • Single-chunk run: model sees both project-a/README.md and project-b/README.md simultaneously → adds a forbidden _b suffix to one → run appears clean
  • Multi-chunk run (the real scenario): model sees only one file → no duplicate visible → no suffix added → both files produce the same bare ID → silent overwrite

The "self-correction" a model applies in single-chunk is the exact behavior the spec forbids, and it disappears precisely when it would be needed. A test on a small corpus (one chunk) gives false confidence that the graph is correct.

Beyond suffix violations, models deviate from the stem rule in multiple other ways, all silently:

Model API model ID Actual ID strategy Spec-compliant? Multi-chunk risk
DeepSeek V3 deepseek-chat {filename}_{extension}_* (includes extension) No Collide — same ext, different dirs
DeepSeek v4-flash deepseek-v4-flash Non-deterministic — run 1: full path (project_a_*, safe); run 2: bare stem (readme_*, collides) No Unpredictable — same corpus, different runs, different outcome; a "good" graph is not reproducible
llama-4-scout-17b meta-llama/llama-4-scout-17b-16e-instruct Bare concept names only (confirm, cancel) — no file context at all No Critical — any shared concept name → overwrite

Label normalization deviation (observed in DeepSeek V3 multi-chunk run): models split CamelCase identifiers in labels even when the source text uses a single token. Observed: source text "BookingService" → node label "Booking Service" (space inserted at word boundary). The ID readme_booking_service is the same either way, but the label diverges from the original.

This creates a secondary failure mode that compounds the ID collision: _norm_label in dedup.py treats "BookingService""bookingservice" and "Booking Service""booking service" as different keys. If two chunks each extract the same CamelCase concept but one produces "BookingService" and the other "Booking Service", the label-similarity passes in deduplicate_entities() will not merge them even if they survive the ID dedup step — two fragmented nodes remain in the graph for what is semantically one entity.

graphify has no mechanism to detect or reject any of these deviations. There is no ID format check, no cross-chunk deduplication guard, no label format validation, and no warning when a node is silently dropped.


Trigger threshold

The default --token-budget is 60,000 tokens. Splitting occurs when the corpus exceeds roughly:

File type Processing Est. tokens/unit Units to fill one chunk
Small README (~500 chars) read as-is ~165 ~363
Typical README (~2 000 chars) read as-is ~540 ~111
Large .md/.txt/.rst (SKILL.md, AGENTS.md, wiki, ~10K chars) sliced into FileSlice units ≤ 20K chars ~2 540 per slice ~23
Any .md/.txt/.rst ≥ 20 000 chars (capped at _FILE_CHAR_CAP) each slice = 1 unit ~5 040 per slice ~11
PDF (any size) text extracted, then truncated — not sliced ~5 040 (cap) ~11
.docx / .xlsx converted to .md sidecar first, then sliced ~5 040 per slice ~11
.html, .yaml, .yml read as-is, truncated — not sliced ~5 040 (cap) ~11
Images (.png, .jpg, .webp, .gif) vision or text ref 1 600 (_IMAGE_TOKEN_ESTIMATE) ~21 (hard-capped at _MAX_IMAGES_PER_CHUNK = 20)

Constants from source: _FILE_CHAR_CAP = 20_000, _CHARS_PER_TOKEN = 4, _PER_FILE_OVERHEAD_CHARS = 160, _IMAGE_TOKEN_ESTIMATE = 1_600, _MAX_IMAGES_PER_CHUNK = 20.
Sliceable types (file_slice.py): .md, .mdx, .markdown, .txt, .rst — all others are truncated.

Note on large .md files: A single 100 000-char file (a large wiki, SKILL.md, API reference) produces 5 slices × ~5 040 tokens = ~25 200 tokens — consuming ~42% of the default budget before any other file is processed. Two such files from sibling directories will almost certainly split into separate chunks.

Note on PDFs: Large PDFs are truncated after the first ~20 000 characters of extracted text — the rest is silently dropped regardless of page count. Additionally, a PDF with an on-disk size > 50 MiB is silently skipped entirely (extract_pdf_text returns "").

Note on sliced files and intra-file collisions: A single large .md file sliced into FileSlice units can also trigger a collision within itself if its slices land in separate chunks: each chunk independently generates IDs for concepts that appear in multiple sections (e.g., a concept referenced in the intro and again in the conclusion) and G.add_node() silently overwrites the earlier node's attributes.

This is not just a "huge monorepo" problem. Realistic trigger scenarios:

  • 12+ PDFs in different subdirectories (tech specs, academic papers, design documents)
  • 12+ large .md files (SKILL.md, AGENTS.md, wiki articles, API references with ≥ 20K chars)
  • 2 wiki-style index .md files of 50K chars each — together they create ~10 slices, nearly filling one chunk and forcing a split with any subsequent file
  • 21+ images anywhere in the corpus — _MAX_IMAGES_PER_CHUNK = 20 forces a split regardless of token budget
  • Any corpus with reduced --token-budget (e.g. --token-budget 4000 → threshold drops to ~8 typical docs)

Empirical verification — confirmed multi-chunk data loss

Two independent LLM runs on a purpose-built corpus confirmed actual data loss.

Reproduction corpus

Both README files share the same heading # Booking Service, the same BookingService entity, and the same Data Model concepts (ConfirmationCode, BookingRecord). All project-identifying text was removed from headings so models have no content-based hint to disambiguate.

.graphifyignore (place at corpus root):

verify.py
verify-out/
verify-out-*/
RUN.md

project-a/README.md:

# Booking Service

This document describes the core booking service.

## Overview

The BookingService handles all reservation workflows.
It coordinates with the PaymentGateway and the InventoryManager to complete orders.

## Core Operations

### CreateBooking
Initiates a new booking record. Validates availability via InventoryManager,
charges the customer via PaymentGateway, and returns a ConfirmationCode.

### ConfirmBooking
Transitions a booking from PENDING to CONFIRMED state. Sends confirmation
email via NotificationService. Updates the AuditLog.

### RefundPolicy
Full refunds are issued if cancellation occurs more than 72 hours before
the booking start time. The RefundProcessor handles all fund returns.
Partial refunds are calculated by the PricingEngine.

## Data Model

- BookingRecord: id, customer_id, status, created_at, updated_at
- ConfirmationCode: unique alphanumeric token, 8 characters
- BookingStatus: PENDING | CONFIRMED | CANCELLED | COMPLETED

## Error Handling

- BookingConflictError: raised when InventoryManager reports no availability
- PaymentFailureError: raised when PaymentGateway rejects the transaction
- InvalidStatusTransition: raised when state machine receives illegal input

## Dependencies

- InventoryManager: checks and reserves slots
- PaymentGateway: processes credit card and wallet transactions
- NotificationService: sends emails and push notifications
- PricingEngine: calculates base price and applicable discounts
- AuditLog: append-only record of all booking state changes

project-a/INTERNAL.md:

# Booking Service — Internal Notes

Short internal reference for the team.

## Key contacts

- Lead: Alice Chen (booking logic, payment integration)
- Backend: Bob Torres (infrastructure)

## Known quirks

- InventoryManager has a 500ms cache TTL — stale reads possible under high load
- PricingEngine rounds down to nearest cent — audit carefully

## AlphaSpecificConcept

This concept is unique to Project Alpha and should appear only in the Alpha graph.
It must not be lost or merged with Project Beta data.

project-b/README.md:

# Booking Service

This document describes the core booking service.

## Overview

The BookingService handles all reservation workflows.
It coordinates with the ReservationEngine and the CapacityPlanner to complete orders.

## Core Operations

### CreateBooking
Opens a new booking slot. Verifies capacity via CapacityPlanner,
charges the customer via BillingModule, and returns a ConfirmationCode.

### CancelBooking
Transitions a booking from ACTIVE to CANCELLED state. Triggers refund
flow via RefundEngine. Notifies customer via MessageBroker. Updates EventLog.

### ReschedulePolicy
Bookings may be rescheduled up to 3 times without penalty. The
ReschedulingCoordinator validates new slot availability. A RescheduleFee
applies after the third rescheduling attempt.

## Data Model

- BookingRecord: id, tenant_id, state, opened_at, closed_at
- ConfirmationCode: unique alphanumeric token, 8 characters
- BookingState: ACTIVE | CANCELLED | RESCHEDULED | EXPIRED

## Error Handling

- CapacityExceededError: raised when CapacityPlanner finds no open slots
- BillingFailureError: raised when BillingModule cannot charge the tenant
- RescheduleLimit: raised when tenant exceeds maximum rescheduling count

## Dependencies

- CapacityPlanner: checks and locks available slots
- BillingModule: processes payments and invoices
- MessageBroker: delivers emails and SMS notifications
- ReschedulingCoordinator: orchestrates slot swaps
- EventLog: immutable event stream for all booking lifecycle changes

Both runs used --token-budget 400, producing 3 chunks (verified by run output and by running _pack_chunks_by_tokens directly with actual tiktoken estimates — _TOKENIZER is tiktoken cl100k_base, not the chars/4 fallback):

  • Chunk 1: project-a/README.md (334 tokens) — generates readme_* IDs
  • Chunk 2: project-a/INTERNAL.md (147 tokens) — generates internal_* IDs
  • Chunk 3: project-b/README.md (348 tokens) — generates readme_* IDs

At --token-budget 600, only 2 chunks are produced (334+147=481 ≤ 600, both project-a files together), but the collision between Chunks 1 and 3 (both named README.md) still occurs. Raising the budget delays but does not prevent the split once the corpus grows.

Chunks 1 and 3 each process a file named README.md in isolation, with no shared context → both produce identical readme_* namespaces independently.

Run 1 — meta-llama/llama-4-scout-17b-16e-instruct via Groq (budget=400)

export OLLAMA_BASE_URL="https://api.groq.com/openai/v1"
export OLLAMA_API_KEY="<GROQ_API_KEY>"
export OLLAMA_MODEL="meta-llama/llama-4-scout-17b-16e-instruct"
export GRAPHIFY_MAX_OUTPUT_TOKENS="8000"
graphify extract ./verify-collision --backend ollama --token-budget 400 --out ./verify-out-groq
graphify output: 3 chunks → Deduplicated 0 node(s) → 32 nodes

Node dump for project-b/README.md: 9 nodes (expected ~14–16).

Shared concept (both READMEs) In graph Source file attributed
BookingService MISSING
CreateBooking MISSING
ConfirmationCode MISSING
BookingRecord MISSING

graphify output: Deduplicated 0 node(s) — no label collisions, no counter, no warning. The silent drop is invisible in every log and output stream.

Run 2 — DeepSeek V3 (deepseek-chat) via DeepSeek API (budget=400)

export DEEPSEEK_API_KEY="<DEEPSEEK_API_KEY>"
export GRAPHIFY_DEEPSEEK_MODEL="deepseek-chat"
graphify extract ./verify-collision --backend deepseek --token-budget 400 --out ./verify-out-v3
graphify output: 3 chunks → Deduplicated 1 node(s) → 31 nodes

Deduplicated 1 node(s) comes from deduplicate_entities() label-similarity passes (dedup.py line 401), unrelated to the ID collision. The two InventoryManager nodes with different IDs and different source_file values are explicitly protected from cross-file label merging (dedup.py lines 364-368) and both survive in the graph. The actual merged pair is unknown. deepseek backend uses a thread pool (as_completed), so chunk completion order is nondeterministic — which version survives an ID collision depends on API response timing, not alphabetical directory order.

Shared concept (both READMEs) In graph Source file attributed
ConfirmationCode YES project-b/README.md
BookingRecord YES project-b/README.md
CreateBooking YES project-a/README.md

Both ConfirmationCode and BookingRecord exist in both project-a and project-b README content. Each chunk independently generated readme_confirmation_code / readme_booking_record. deduplicate_entities() kept one and silently dropped the other. The surviving source_file attribution is nondeterministic — it depends on which chunk's output arrived first in the merged list. No warning was emitted.

Both runs: zero indication of data loss in any graphify output, log, or counter.


Models tested and behaviour (single-chunk baseline)

All tested models produced IDs that would collide in multi-chunk. Tested on a two-module corpus (project-a/models/README.md and project-b/models/README.md), two independent runs each:

Model API model ID ID strategy observed (single-chunk) Multi-chunk risk
DeepSeek V3 deepseek-chat readme_md_*, notes_txt_* — bare stem + extension, consistent across runs COLLIDE — same IDs from both chunks
DeepSeek v4-flash deepseek-v4-flash Non-deterministic: run 1 → full path (project_a_*, safe); run 2 → bare stem (readme_*, collides) UNPREDICTABLE — same corpus can produce a safe graph on one run and a colliding graph on the next
llama-4-scout-17b meta-llama/llama-4-scout-17b-16e-instruct Bare concept names only (confirm, cancel) — no file path context at all CRITICAL — any shared concept name across files maps to the same ID

DeepSeek v4-flash non-determinism is a compounding risk: the model changed its ID strategy between two runs on the identical corpus (same files, same budget). A graph that looks correct today can silently have collisions after adding files or re-running, with no indication in the output. This makes the bug intermittent and hard to detect.


Why existing safeguards don't help

  • deduplicate_entities() label passes (dedup.py lines ~233–376): after the ID pre-dedup (first-writer-wins), passes 1 and 2 merge nodes by normalised label similarity, emitting "Deduplicated N node(s).". This is the source of those messages in our test runs. Operates on label text, not IDs; cross-file pairs with identical labels are explicitly blocked (dedup.py lines 364-368). Unrelated to the ID collision.
  • deduplicate_by_label (build.py line 422): a separate function that also merges by normalised label, emitting "Deduplicated N node(s) by label.". In v0.8.50 this function is defined but never called anywhere in the CLI pipeline — it is dead code. Does not protect against ID collisions.
  • normalize_id() (ids.py line ~40: return s.strip("_").casefold()): used only for edge endpoint remapping (norm_to_id dict). Not applied to node IDs at graph-build time.
  • Ghost detection (build.py lines ~207–276): merges semantic nodes into their AST canonical twins. Only applies to AST vs. semantic merge — does not detect two semantic nodes with the same ID.

What IS protected (recommended monorepo workflow)

The documented multi-subfolder workflow (references/github-and-merge.md) is not affected: running graphify extract per subfolder and then graphify merge-graphs calls prefix_graph_for_global(G, repo_tag), which rewrites all node IDs to repo_tag::original_id before nx.compose. This makes collisions impossible.

The bug only affects users who run graphify extract . (or graphify extract <root>) directly on a root that contains multiple sibling subdirectories — which is the natural single-command usage.


Suggested fix

The ID collision at G.add_node should at minimum be detected and reported. One or more of:

  1. Compute stems library-side, don't delegate to the LLM (most robust): graphify knows all file paths before extraction. Compute a collision-free stem deterministically and pass it as an explicit attribute in the file wrapper:

    <untrusted_source path="module-a/docs/README.md" stem="module_a_docs_readme">

    System prompt: "Use the stem= attribute as your ID prefix, exactly as given." The stem is derived from the full relative path ("/".join(relative_path.with_suffix("").parts).replace("-","_").lower()), which is unique by construction. This eliminates the collision regardless of model behavior and removes the dependency on any model following a stem rule correctly.

  2. Detect and warn at graph build time: before G.add_node, check if the ID already exists and the incoming node's source_file differs from the existing node's source_file. Emit a warning: "WARNING: node '{id}' from {new_file} overwrites node from {existing_file} — possible ID collision across chunks".

  3. Make the stem uniqueness-preserving: update the system prompt stem rule to include enough path depth to differentiate files in sibling subdirectories. A two-level stem ({grandparent_dir}_{parent_dir}_{filename_without_ext}) would cover the common pattern.

  4. Post-generation validation: after collecting all chunk results and before calling G.add_node, scan for duplicate IDs with different source_file values and raise or log them as a batch.


Question for maintainer

deduplicate_by_label (build.py line 422) is defined but never called anywhere in the CLI pipeline in v0.8.50 — it appears to be dead code. Its message "Deduplicated N node(s) by label." is never emitted; the "Deduplicated N node(s)." messages in actual runs come from deduplicate_entities() in dedup.py.

Is this intentional (function kept for external API use?) or an unintended regression? If it was meant to run as part of the build pipeline, its absence may be a separate bug — and if it did run, it would not protect against the ID collision described here (it operates on label text, and cross-file nodes with identical labels are explicitly blocked from merging at dedup.py lines 364-368).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions