Path-based IDs still collide after #1504 fix: `normalize_id` collapses non-word sequences to `_`

Hi @safishamsi — first, thank you for the quick turnaround on #1504 and #1509 — the full-path stem approach is exactly the right direction and the `_semantic_id_remap()` migration utility is a nice touch.

While smoke-testing v0.9.0 I noticed that the collision problem may be wider than the original issue. The fix correctly uses the full repo-relative path, but `normalize_id` collapses any contiguous sequence of non-word characters (`[^\w]+`) into a single `_`, and then collapses consecutive underscores. That affects a large set of characters valid in directory and file names: `-`, `.`, ` `, `!`, `(`, `@`, `#`, `~`, `+` and others on Linux/macOS, and a similar set on Windows (where only `\/:*?"<>|` are forbidden). Any combination of these in path segments can produce the same ID even after the fix.

**Cross-directory collisions (slash vs other separators):**
```
foo_bar/baz.md    →  foo_bar/baz   →  foo_bar_baz
foo/bar_baz.md    →  foo/bar_baz   →  foo_bar_baz   ← slash vs underscore

my-module/baz.md  →  my-module/baz →  my_module_baz
my_module/baz.md  →  my_module/baz →  my_module_baz ← dash vs underscore

my module/baz.md  →  my module/baz →  my_module_baz ← space vs underscore (space matches [^\w]+, becomes _)

utils.helper/baz.md  →  utils.helper/baz  →  utils_helper_baz
utils/helper_baz.md  →  utils/helper_baz  →  utils_helper_baz ← dot vs slash vs underscore
```

**Same-directory collisions (dot/dash in filename vs underscore):**

`with_suffix("")` removes only the last extension, so intermediate dots survive into the stem:

```
foo.bar.ts  →  foo.bar  →  foo_bar
foo_bar.ts  →  foo_bar  →  foo_bar   ← collision, same directory
foo-bar.ts  →  foo-bar  →  foo_bar   ← collision, same directory
```

Three different files in the same folder produce identical IDs.

**Consecutive non-word characters:**

A contiguous run of non-word characters (`[^\w]+`) collapses to a single `_`, so quantity doesn't matter either:
```
foo--bar/baz.md  →  foo_bar_baz
foo-bar/baz.md   →  foo_bar_baz   ← collision (double dash vs single)
foo_bar/baz.md   →  foo_bar_baz   ← collision
```

All of these are realistic paths in Python packages, Go modules, JS monorepos — a few concrete examples:

| Path pair | Language | ID collision |
|---|---|---|
| `my_package/utils.py` vs `my/package_utils.py` | Python | `my_package_utils` |
| `my-service/handler.go` vs `my_service/handler.go` | Go | `my_service_handler` |
| `components/foo.test.ts` vs `components/foo_test.ts` | TypeScript | `components_foo_test` |
| `v1_api/schema.go` vs `v1/api_schema.go` | Go | `v1_api_schema` |

For languages where extraction is purely AST-based (Go, Python, JS/TS), the collision is especially silent: there is no LLM output layer that could hint something went wrong — nodes are just quietly merged with last-writer-wins semantics.

The root issue is that `normalize_id` is designed to normalize arbitrary strings to a safe `[a-z0-9_]` format — a lossy operation that isn't suitable for generating unique keys from structured paths. Using it downstream of `_file_stem()` undoes the uniqueness guarantee that the fix is trying to establish.

The fix likely requires rethinking how path-based IDs are generated independently of `normalize_id`, since the two have conflicting goals: one needs to preserve structural distinctness, the other is designed to be lossy. This would mean another breaking ID change — but fixing it now in 0.9.x is much less costly than later: the longer it waits, the more users will have built graphs with the current v0.9.0 format. A clean break now is better than silent data loss compounding over time.

Regardless of the approach taken, it would be worth emitting a warning (or failing fast) when two different source paths produce the same node ID at build time. The migration warning added in 3999dbc covers the legacy-format case well, but a collision detector for the new format would catch any remaining edge cases explicitly rather than letting them silently corrupt the graph.

Worth noting that `_semantic_id_remap()` would need the same update when re-deriving IDs from `source_file` attributes, otherwise the migration step produces mismatched keys.

Path pair	Language	ID collision
`my_package/utils.py` vs `my/package_utils.py`	Python	`my_package_utils`
`my-service/handler.go` vs `my_service/handler.go`	Go	`my_service_handler`
`components/foo.test.ts` vs `components/foo_test.ts`	TypeScript	`components_foo_test`
`v1_api/schema.go` vs `v1/api_schema.go`	Go	`v1_api_schema`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Path-based IDs still collide after #1504 fix: `normalize_id` collapses non-word sequences to `_` #1522

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Path-based IDs still collide after #1504 fix: normalize_id collapses non-word sequences to _ #1522

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Path-based IDs still collide after #1504 fix: `normalize_id` collapses non-word sequences to `_` #1522