Silent data loss: doc files with the same name in different directories produce colliding node IDs across extraction chunks

**graphify version:** 0.8.50  
**Affected invocation:** `graphify extract .` on any multi-directory corpus large enough to split into multiple chunks  
**Severity:** High — data is silently lost with no warning, error, or log entry

---

## Summary

When a corpus is large enough to split into multiple LLM extraction chunks, doc files with the same filename in different subdirectories produce **identical node IDs** from separate chunk runs. `deduplicate_entities()` silently discards one of the colliding nodes (first-writer-wins), so one directory's data is dropped without any indication. The user sees a complete-looking graph with fewer nodes than expected and no error message.

---

## Minimal reproduction

```
my-corpus/
├── module-a/
│   └── docs/
│       └── README.md   ← "Booking Model — confirm, refund operations"
└── module-b/
    └── docs/
        └── README.md   ← "Booking Model — cancel, reschedule operations"
```

Run on a corpus large enough to split chunks (see "Trigger threshold" below):

```bash
graphify extract .
```

**Expected:** graph contains nodes from both `module-a` and `module-b`.  
**Actual:** graph contains nodes from only **one** directory; the other is silently dropped.

To force splitting with this small corpus:
```bash
graphify extract . --token-budget 400
```

---

## Root cause (verified from source)

### 1. Stem rules produce non-unique IDs

The system prompt (`llm.py`, line ~390) instructs models:

> `stem = filename without extension`

This means `project-a/models/README.md` and `project-b/models/README.md` both get stem `readme_`. The model then generates IDs like `readme_confirm`, `readme_cancel` — identical namespaces for different files.

The system prompt also instructs models to produce **lowercase-only IDs**, so even if filenames differ only by case (`README.md` vs `readme.md`), both yield `readme_` after the model applies the lowercase constraint.

### 2. `deduplicate_entities()` silently discards one of the colliding nodes

After all chunk results are merged into a flat list, the CLI calls:

```python
# __main__.py line ~4737 — dedup=True is hardcoded
G = _build([merged], dedup=True, ...)
```

`build()` runs `deduplicate_entities()` **before** `build_from_json()`. Inside `deduplicate_entities()` (`dedup.py`, lines 222–228):

```python
# Pre-deduplicate: keep first occurrence of each id
seen_ids: dict[str, dict] = {}
for node in nodes:
    nid = node.get("id", "")
    if nid and nid not in seen_ids:
        seen_ids[nid] = node
unique_nodes = list(seen_ids.values())
```

**First-writer-wins**: when two chunks produce the same ID, the node from whichever chunk appears first in the merged list survives; the other is **silently dropped** — no warning, no counter, no log entry. Which chunk comes first depends on the backend:
- **`ollama` and `claude-cli` backends**: forced to `max_concurrency=1` (sequential) — chunks merge in `sorted(by_dir)` order, so alphabetically earlier directories always win.
- **All other backends** (`deepseek`, `gemini`, `kimi`, `openai`, …): `ThreadPoolExecutor` with `as_completed()` — chunks merge in API **response-completion order**, which is nondeterministic. Which project loses data can vary between runs of the same corpus.

`build_from_json()` then calls `G.add_node()` on the already-deduplicated list — it does not see duplicates at this point for the collision case. (The `build()` docstring's note about "last extraction's attributes win" refers to the separate intentional AST-vs-semantic overwrite, not to cross-chunk ID collisions.)

**Note on `G.add_node()` and data loss:** `G.add_node()` with an existing ID does overwrite silently, but for the cross-chunk collision case the data loss happens one step earlier, in `deduplicate_entities()`. Both layers are silent.

### 3. Chunking puts same-named files in separate chunks

`_pack_chunks_by_tokens` (`llm.py`, lines ~1454–1497) groups files by **parent directory** and packs greedily within the token budget. `project-a/models/` and `project-b/models/` are separate directory buckets, so on a large corpus they land in separate chunks and receive independent LLM calls with no shared context.

### 4. In-chunk disambiguation disappears in multi-chunk

When a model processes a single chunk containing both `project-a/models/README.md` and `project-b/models/README.md`, it can see the duplicate paths and self-disambiguate (e.g., `readme_md_pending_b`). Split across two chunks, each model call sees only one file, has no reason to disambiguate, and produces the same IDs. **Single-chunk runs appear clean; multi-chunk runs silently lose data.**

### 5. No post-generation validation — models violate the spec in both directions

The spec (`extraction-spec.md`) states explicitly:
> *"CRITICAL: never append chunk numbers, sequence numbers, or any suffix to an ID (no `_c1`, `_c2`, `_chunk2`, etc.)"*

Yet models routinely add disambiguation suffixes (`_b`, `_2`, `_`) when they see duplicate file paths in one chunk. graphify never validates the returned IDs against this rule. The result is a dangerous inversion:

- **Single-chunk run:** model sees both `project-a/README.md` and `project-b/README.md` simultaneously → adds a forbidden `_b` suffix to one → run appears clean
- **Multi-chunk run (the real scenario):** model sees only one file → no duplicate visible → no suffix added → both files produce the same bare ID → silent overwrite

The "self-correction" a model applies in single-chunk is the exact behavior the spec forbids, and it **disappears** precisely when it would be needed. A test on a small corpus (one chunk) gives false confidence that the graph is correct.

Beyond suffix violations, models deviate from the stem rule in multiple other ways, all silently:

| Model | API model ID | Actual ID strategy | Spec-compliant? | Multi-chunk risk |
|---|---|---|---|---|
| DeepSeek V3 | `deepseek-chat` | `{filename}_{extension}_*` (includes extension) | No | Collide — same ext, different dirs |
| DeepSeek v4-flash | `deepseek-v4-flash` | **Non-deterministic** — run 1: full path (`project_a_*`, safe); run 2: bare stem (`readme_*`, collides) | No | **Unpredictable** — same corpus, different runs, different outcome; a "good" graph is not reproducible |
| llama-4-scout-17b | `meta-llama/llama-4-scout-17b-16e-instruct` | Bare concept names only (`confirm`, `cancel`) — no file context at all | No | **Critical** — any shared concept name → overwrite |

**Label normalization deviation (observed in DeepSeek V3 multi-chunk run):** models split CamelCase identifiers in labels even when the source text uses a single token. Observed: source text `"BookingService"` → node label `"Booking Service"` (space inserted at word boundary). The ID `readme_booking_service` is the same either way, but the label diverges from the original.

This creates a secondary failure mode that compounds the ID collision: `_norm_label` in `dedup.py` treats `"BookingService"` → `"bookingservice"` and `"Booking Service"` → `"booking service"` as **different keys**. If two chunks each extract the same CamelCase concept but one produces `"BookingService"` and the other `"Booking Service"`, the label-similarity passes in `deduplicate_entities()` will not merge them even if they survive the ID dedup step — two fragmented nodes remain in the graph for what is semantically one entity.

graphify has no mechanism to detect or reject any of these deviations. There is no ID format check, no cross-chunk deduplication guard, no label format validation, and no warning when a node is silently dropped.

---

## Trigger threshold

The default `--token-budget` is 60,000 tokens. Splitting occurs when the corpus exceeds roughly:

| File type | Processing | Est. tokens/unit | Units to fill one chunk |
|---|---|---|---|
| Small README (~500 chars) | read as-is | ~165 | ~363 |
| Typical README (~2 000 chars) | read as-is | ~540 | ~111 |
| Large `.md`/`.txt`/`.rst` (SKILL.md, AGENTS.md, wiki, ~10K chars) | **sliced** into `FileSlice` units ≤ 20K chars | ~2 540 per slice | **~23** |
| Any `.md`/`.txt`/`.rst` ≥ 20 000 chars (capped at `_FILE_CHAR_CAP`) | each slice = 1 unit | ~5 040 per slice | **~11** |
| PDF (any size) | text extracted, then truncated — **not sliced** | ~5 040 (cap) | **~11** |
| `.docx` / `.xlsx` | converted to `.md` sidecar first, then sliced | ~5 040 per slice | **~11** |
| `.html`, `.yaml`, `.yml` | read as-is, truncated — **not sliced** | ~5 040 (cap) | **~11** |
| Images (`.png`, `.jpg`, `.webp`, `.gif`) | vision or text ref | 1 600 (`_IMAGE_TOKEN_ESTIMATE`) | **~21** (hard-capped at `_MAX_IMAGES_PER_CHUNK = 20`) |

Constants from source: `_FILE_CHAR_CAP = 20_000`, `_CHARS_PER_TOKEN = 4`, `_PER_FILE_OVERHEAD_CHARS = 160`, `_IMAGE_TOKEN_ESTIMATE = 1_600`, `_MAX_IMAGES_PER_CHUNK = 20`.  
Sliceable types (`file_slice.py`): `.md`, `.mdx`, `.markdown`, `.txt`, `.rst` — all others are truncated.

**Note on large `.md` files:** A single 100 000-char file (a large wiki, SKILL.md, API reference) produces **5 slices × ~5 040 tokens = ~25 200 tokens** — consuming ~42% of the default budget before any other file is processed. Two such files from sibling directories will almost certainly split into separate chunks.

**Note on PDFs:** Large PDFs are truncated after the first ~20 000 characters of extracted text — the rest is silently dropped regardless of page count. Additionally, a PDF with an on-disk size > 50 MiB is silently skipped entirely (`extract_pdf_text` returns `""`).

**Note on sliced files and intra-file collisions:** A single large `.md` file sliced into `FileSlice` units can also trigger a collision _within itself_ if its slices land in separate chunks: each chunk independently generates IDs for concepts that appear in multiple sections (e.g., a concept referenced in the intro and again in the conclusion) and `G.add_node()` silently overwrites the earlier node's attributes.

**This is not just a "huge monorepo" problem.** Realistic trigger scenarios:
- **12+ PDFs** in different subdirectories (tech specs, academic papers, design documents)
- **12+ large `.md` files** (SKILL.md, AGENTS.md, wiki articles, API references with ≥ 20K chars)
- **2 wiki-style index `.md` files** of 50K chars each — together they create ~10 slices, nearly filling one chunk and forcing a split with any subsequent file
- **21+ images** anywhere in the corpus — `_MAX_IMAGES_PER_CHUNK = 20` forces a split regardless of token budget
- **Any corpus with reduced `--token-budget`** (e.g. `--token-budget 4000` → threshold drops to ~8 typical docs)

---

## Empirical verification — confirmed multi-chunk data loss

Two independent LLM runs on a purpose-built corpus confirmed actual data loss.

### Reproduction corpus

Both README files share the same heading `# Booking Service`, the same `BookingService` entity, and the same Data Model concepts (`ConfirmationCode`, `BookingRecord`). All project-identifying text was removed from headings so models have no content-based hint to disambiguate.

**`.graphifyignore`** (place at corpus root):
```
verify.py
verify-out/
verify-out-*/
RUN.md
```

**`project-a/README.md`:**
```markdown
# Booking Service

This document describes the core booking service.

## Overview

The BookingService handles all reservation workflows.
It coordinates with the PaymentGateway and the InventoryManager to complete orders.

## Core Operations

### CreateBooking
Initiates a new booking record. Validates availability via InventoryManager,
charges the customer via PaymentGateway, and returns a ConfirmationCode.

### ConfirmBooking
Transitions a booking from PENDING to CONFIRMED state. Sends confirmation
email via NotificationService. Updates the AuditLog.

### RefundPolicy
Full refunds are issued if cancellation occurs more than 72 hours before
the booking start time. The RefundProcessor handles all fund returns.
Partial refunds are calculated by the PricingEngine.

## Data Model

- BookingRecord: id, customer_id, status, created_at, updated_at
- ConfirmationCode: unique alphanumeric token, 8 characters
- BookingStatus: PENDING | CONFIRMED | CANCELLED | COMPLETED

## Error Handling

- BookingConflictError: raised when InventoryManager reports no availability
- PaymentFailureError: raised when PaymentGateway rejects the transaction
- InvalidStatusTransition: raised when state machine receives illegal input

## Dependencies

- InventoryManager: checks and reserves slots
- PaymentGateway: processes credit card and wallet transactions
- NotificationService: sends emails and push notifications
- PricingEngine: calculates base price and applicable discounts
- AuditLog: append-only record of all booking state changes
```

**`project-a/INTERNAL.md`:**
```markdown
# Booking Service — Internal Notes

Short internal reference for the team.

## Key contacts

- Lead: Alice Chen (booking logic, payment integration)
- Backend: Bob Torres (infrastructure)

## Known quirks

- InventoryManager has a 500ms cache TTL — stale reads possible under high load
- PricingEngine rounds down to nearest cent — audit carefully

## AlphaSpecificConcept

This concept is unique to Project Alpha and should appear only in the Alpha graph.
It must not be lost or merged with Project Beta data.
```

**`project-b/README.md`:**
```markdown
# Booking Service

This document describes the core booking service.

## Overview

The BookingService handles all reservation workflows.
It coordinates with the ReservationEngine and the CapacityPlanner to complete orders.

## Core Operations

### CreateBooking
Opens a new booking slot. Verifies capacity via CapacityPlanner,
charges the customer via BillingModule, and returns a ConfirmationCode.

### CancelBooking
Transitions a booking from ACTIVE to CANCELLED state. Triggers refund
flow via RefundEngine. Notifies customer via MessageBroker. Updates EventLog.

### ReschedulePolicy
Bookings may be rescheduled up to 3 times without penalty. The
ReschedulingCoordinator validates new slot availability. A RescheduleFee
applies after the third rescheduling attempt.

## Data Model

- BookingRecord: id, tenant_id, state, opened_at, closed_at
- ConfirmationCode: unique alphanumeric token, 8 characters
- BookingState: ACTIVE | CANCELLED | RESCHEDULED | EXPIRED

## Error Handling

- CapacityExceededError: raised when CapacityPlanner finds no open slots
- BillingFailureError: raised when BillingModule cannot charge the tenant
- RescheduleLimit: raised when tenant exceeds maximum rescheduling count

## Dependencies

- CapacityPlanner: checks and locks available slots
- BillingModule: processes payments and invoices
- MessageBroker: delivers emails and SMS notifications
- ReschedulingCoordinator: orchestrates slot swaps
- EventLog: immutable event stream for all booking lifecycle changes
```

Both runs used `--token-budget 400`, producing 3 chunks (verified by run output and by running `_pack_chunks_by_tokens` directly with actual tiktoken estimates — `_TOKENIZER` is tiktoken `cl100k_base`, not the chars/4 fallback):

- Chunk 1: `project-a/README.md` (334 tokens) — generates `readme_*` IDs
- Chunk 2: `project-a/INTERNAL.md` (147 tokens) — generates `internal_*` IDs
- Chunk 3: `project-b/README.md` (348 tokens) — generates `readme_*` IDs

At `--token-budget 600`, only 2 chunks are produced (334+147=481 ≤ 600, both project-a files together), but the collision between Chunks 1 and 3 (both named README.md) still occurs. Raising the budget delays but does not prevent the split once the corpus grows.

Chunks 1 and 3 each process a file named `README.md` in isolation, with no shared context → both produce identical `readme_*` namespaces independently.

### Run 1 — `meta-llama/llama-4-scout-17b-16e-instruct` via Groq (budget=400)

```bash
export OLLAMA_BASE_URL="https://api.groq.com/openai/v1"
export OLLAMA_API_KEY="<GROQ_API_KEY>"
export OLLAMA_MODEL="meta-llama/llama-4-scout-17b-16e-instruct"
export GRAPHIFY_MAX_OUTPUT_TOKENS="8000"
graphify extract ./verify-collision --backend ollama --token-budget 400 --out ./verify-out-groq
```

```
graphify output: 3 chunks → Deduplicated 0 node(s) → 32 nodes
```

Node dump for `project-b/README.md`: **9 nodes** (expected ~14–16).

| Shared concept (both READMEs) | In graph | Source file attributed |
|---|---|---|
| BookingService | **MISSING** | — |
| CreateBooking | **MISSING** | — |
| ConfirmationCode | **MISSING** | — |
| BookingRecord | **MISSING** | — |

graphify output: **`Deduplicated 0 node(s)`** — no label collisions, no counter, no warning. The silent drop is invisible in every log and output stream.

### Run 2 — DeepSeek V3 (`deepseek-chat`) via DeepSeek API (budget=400)

```bash
export DEEPSEEK_API_KEY="<DEEPSEEK_API_KEY>"
export GRAPHIFY_DEEPSEEK_MODEL="deepseek-chat"
graphify extract ./verify-collision --backend deepseek --token-budget 400 --out ./verify-out-v3
```

```
graphify output: 3 chunks → Deduplicated 1 node(s) → 31 nodes
```

`Deduplicated 1 node(s)` comes from `deduplicate_entities()` label-similarity passes (dedup.py line 401), unrelated to the ID collision. The two `InventoryManager` nodes with different IDs and different `source_file` values are explicitly protected from cross-file label merging (dedup.py lines 364-368) and both survive in the graph. The actual merged pair is unknown. deepseek backend uses a thread pool (`as_completed`), so chunk completion order is nondeterministic — which version survives an ID collision depends on API response timing, not alphabetical directory order.

| Shared concept (both READMEs) | In graph | Source file attributed |
|---|---|---|
| ConfirmationCode | YES | `project-b/README.md` |
| BookingRecord | YES | `project-b/README.md` |
| CreateBooking | YES | `project-a/README.md` |

Both `ConfirmationCode` and `BookingRecord` exist in both project-a and project-b README content. Each chunk independently generated `readme_confirmation_code` / `readme_booking_record`. `deduplicate_entities()` kept one and silently dropped the other. The surviving `source_file` attribution is nondeterministic — it depends on which chunk's output arrived first in the merged list. No warning was emitted.

**Both runs: zero indication of data loss in any graphify output, log, or counter.**

---

## Models tested and behaviour (single-chunk baseline)

All tested models produced IDs that would collide in multi-chunk. Tested on a two-module corpus (`project-a/models/README.md` and `project-b/models/README.md`), two independent runs each:

| Model | API model ID | ID strategy observed (single-chunk) | Multi-chunk risk |
|---|---|---|---|
| DeepSeek V3 | `deepseek-chat` | `readme_md_*`, `notes_txt_*` — bare stem + extension, consistent across runs | **COLLIDE** — same IDs from both chunks |
| DeepSeek v4-flash | `deepseek-v4-flash` | **Non-deterministic**: run 1 → full path (`project_a_*`, safe); run 2 → bare stem (`readme_*`, collides) | **UNPREDICTABLE** — same corpus can produce a safe graph on one run and a colliding graph on the next |
| llama-4-scout-17b | `meta-llama/llama-4-scout-17b-16e-instruct` | Bare concept names only (`confirm`, `cancel`) — no file path context at all | **CRITICAL** — any shared concept name across files maps to the same ID |

**DeepSeek v4-flash non-determinism** is a compounding risk: the model changed its ID strategy between two runs on the identical corpus (same files, same budget). A graph that looks correct today can silently have collisions after adding files or re-running, with no indication in the output. This makes the bug intermittent and hard to detect.

---

## Why existing safeguards don't help

- **`deduplicate_entities()` label passes** (`dedup.py` lines ~233–376): after the ID pre-dedup (first-writer-wins), passes 1 and 2 merge nodes by normalised label similarity, emitting `"Deduplicated N node(s)."`. This is the source of those messages in our test runs. Operates on label text, not IDs; cross-file pairs with identical labels are explicitly blocked (dedup.py lines 364-368). Unrelated to the ID collision.
- **`deduplicate_by_label`** (`build.py` line 422): a separate function that also merges by normalised label, emitting `"Deduplicated N node(s) by label."`. **In v0.8.50 this function is defined but never called anywhere in the CLI pipeline** — it is dead code. Does not protect against ID collisions.
- **`normalize_id()`** (`ids.py` line ~40: `return s.strip("_").casefold()`): used only for edge endpoint remapping (`norm_to_id` dict). Not applied to node IDs at graph-build time.
- **Ghost detection** (`build.py` lines ~207–276): merges semantic nodes into their AST canonical twins. Only applies to AST vs. semantic merge — does not detect two semantic nodes with the same ID.

---

## What IS protected (recommended monorepo workflow)

The documented multi-subfolder workflow (`references/github-and-merge.md`) is **not affected**: running `graphify extract` per subfolder and then `graphify merge-graphs` calls `prefix_graph_for_global(G, repo_tag)`, which rewrites all node IDs to `repo_tag::original_id` before `nx.compose`. This makes collisions impossible.

The bug only affects users who run `graphify extract .` (or `graphify extract <root>`) directly on a root that contains multiple sibling subdirectories — which is the natural single-command usage.

---

## Suggested fix

The ID collision at `G.add_node` should at minimum be detected and reported. One or more of:

1. **Compute stems library-side, don't delegate to the LLM (most robust):** graphify knows all file paths before extraction. Compute a collision-free stem deterministically and pass it as an explicit attribute in the file wrapper:
   ```xml
   <untrusted_source path="module-a/docs/README.md" stem="module_a_docs_readme">
   ```
   System prompt: `"Use the stem= attribute as your ID prefix, exactly as given."` The stem is derived from the full relative path (`"/".join(relative_path.with_suffix("").parts).replace("-","_").lower()`), which is unique by construction. This eliminates the collision regardless of model behavior and removes the dependency on any model following a stem rule correctly.

2. **Detect and warn at graph build time:** before `G.add_node`, check if the ID already exists and the incoming node's `source_file` differs from the existing node's `source_file`. Emit a warning: `"WARNING: node '{id}' from {new_file} overwrites node from {existing_file} — possible ID collision across chunks"`.

3. **Make the stem uniqueness-preserving:** update the system prompt stem rule to include enough path depth to differentiate files in sibling subdirectories. A two-level stem (`{grandparent_dir}_{parent_dir}_{filename_without_ext}`) would cover the common pattern.

4. **Post-generation validation:** after collecting all chunk results and before calling `G.add_node`, scan for duplicate IDs with different `source_file` values and raise or log them as a batch.

---

## Question for maintainer

`deduplicate_by_label` (`build.py` line 422) is defined but never called anywhere in the CLI pipeline in v0.8.50 — it appears to be dead code. Its message `"Deduplicated N node(s) by label."` is never emitted; the `"Deduplicated N node(s)."` messages in actual runs come from `deduplicate_entities()` in `dedup.py`.

Is this intentional (function kept for external API use?) or an unintended regression? If it was meant to run as part of the build pipeline, its absence may be a separate bug — and if it did run, it would not protect against the ID collision described here (it operates on label text, and cross-file nodes with identical labels are explicitly blocked from merging at `dedup.py` lines 364-368).

Model	API model ID	ID strategy observed (single-chunk)	Multi-chunk risk
DeepSeek V3	`deepseek-chat`	`readme_md_`, `notes_txt_` — bare stem + extension, consistent across runs	COLLIDE — same IDs from both chunks
DeepSeek v4-flash	`deepseek-v4-flash`	Non-deterministic: run 1 → full path (`project_a_`, safe); run 2 → bare stem (`readme_`, collides)	UNPREDICTABLE — same corpus can produce a safe graph on one run and a colliding graph on the next
llama-4-scout-17b	`meta-llama/llama-4-scout-17b-16e-instruct`	Bare concept names only (`confirm`, `cancel`) — no file path context at all	CRITICAL — any shared concept name across files maps to the same ID

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Silent data loss: doc files with the same name in different directories produce colliding node IDs across extraction chunks #1504

Summary

Minimal reproduction

Root cause (verified from source)

1. Stem rules produce non-unique IDs

2. `deduplicate_entities()` silently discards one of the colliding nodes

3. Chunking puts same-named files in separate chunks

4. In-chunk disambiguation disappears in multi-chunk

5. No post-generation validation — models violate the spec in both directions

Trigger threshold

Empirical verification — confirmed multi-chunk data loss

Reproduction corpus

Run 1 — `meta-llama/llama-4-scout-17b-16e-instruct` via Groq (budget=400)

Run 2 — DeepSeek V3 (`deepseek-chat`) via DeepSeek API (budget=400)

Models tested and behaviour (single-chunk baseline)

Why existing safeguards don't help

What IS protected (recommended monorepo workflow)

Suggested fix

Question for maintainer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	API model ID	Actual ID strategy	Spec-compliant?	Multi-chunk risk
DeepSeek V3	`deepseek-chat`	`{filename}_{extension}_*` (includes extension)	No	Collide — same ext, different dirs
DeepSeek v4-flash	`deepseek-v4-flash`	Non-deterministic — run 1: full path (`project_a_`, safe); run 2: bare stem (`readme_`, collides)	No	Unpredictable — same corpus, different runs, different outcome; a "good" graph is not reproducible
llama-4-scout-17b	`meta-llama/llama-4-scout-17b-16e-instruct`	Bare concept names only (`confirm`, `cancel`) — no file context at all	No	Critical — any shared concept name → overwrite

File type	Processing	Est. tokens/unit	Units to fill one chunk
Small README (~500 chars)	read as-is	~165	~363
Typical README (~2 000 chars)	read as-is	~540	~111
Large `.md`/`.txt`/`.rst` (SKILL.md, AGENTS.md, wiki, ~10K chars)	sliced into `FileSlice` units ≤ 20K chars	~2 540 per slice	~23
Any `.md`/`.txt`/`.rst` ≥ 20 000 chars (capped at `_FILE_CHAR_CAP`)	each slice = 1 unit	~5 040 per slice	~11
PDF (any size)	text extracted, then truncated — not sliced	~5 040 (cap)	~11
`.docx` / `.xlsx`	converted to `.md` sidecar first, then sliced	~5 040 per slice	~11
`.html`, `.yaml`, `.yml`	read as-is, truncated — not sliced	~5 040 (cap)	~11
Images (`.png`, `.jpg`, `.webp`, `.gif`)	vision or text ref	1 600 (`_IMAGE_TOKEN_ESTIMATE`)	~21 (hard-capped at `_MAX_IMAGES_PER_CHUNK = 20`)

Shared concept (both READMEs)	In graph	Source file attributed
BookingService	MISSING	—
CreateBooking	MISSING	—
ConfirmationCode	MISSING	—
BookingRecord	MISSING	—

Shared concept (both READMEs)	In graph	Source file attributed
ConfirmationCode	YES	`project-b/README.md`
BookingRecord	YES	`project-b/README.md`
CreateBooking	YES	`project-a/README.md`

Uh oh!

Uh oh!

Silent data loss: doc files with the same name in different directories produce colliding node IDs across extraction chunks #1504

Description

Summary

Minimal reproduction

Root cause (verified from source)

1. Stem rules produce non-unique IDs

2. deduplicate_entities() silently discards one of the colliding nodes

3. Chunking puts same-named files in separate chunks

4. In-chunk disambiguation disappears in multi-chunk

5. No post-generation validation — models violate the spec in both directions

Trigger threshold

Empirical verification — confirmed multi-chunk data loss

Reproduction corpus

Run 1 — meta-llama/llama-4-scout-17b-16e-instruct via Groq (budget=400)

Run 2 — DeepSeek V3 (deepseek-chat) via DeepSeek API (budget=400)

Models tested and behaviour (single-chunk baseline)

Why existing safeguards don't help

What IS protected (recommended monorepo workflow)

Suggested fix

Question for maintainer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `deduplicate_entities()` silently discards one of the colliding nodes

Run 1 — `meta-llama/llama-4-scout-17b-16e-instruct` via Groq (budget=400)

Run 2 — DeepSeek V3 (`deepseek-chat`) via DeepSeek API (budget=400)