Skip to content

Path-based IDs still collide after #1504 fix: normalize_id collapses non-word sequences to _ #1522

Description

@sub4biz

Hi @safishamsi — first, thank you for the quick turnaround on #1504 and #1509 — the full-path stem approach is exactly the right direction and the _semantic_id_remap() migration utility is a nice touch.

While smoke-testing v0.9.0 I noticed that the collision problem may be wider than the original issue. The fix correctly uses the full repo-relative path, but normalize_id collapses any contiguous sequence of non-word characters ([^\w]+) into a single _, and then collapses consecutive underscores. That affects a large set of characters valid in directory and file names: -, ., , !, (, @, #, ~, + and others on Linux/macOS, and a similar set on Windows (where only \/:*?"<>| are forbidden). Any combination of these in path segments can produce the same ID even after the fix.

Cross-directory collisions (slash vs other separators):

foo_bar/baz.md    →  foo_bar/baz   →  foo_bar_baz
foo/bar_baz.md    →  foo/bar_baz   →  foo_bar_baz   ← slash vs underscore

my-module/baz.md  →  my-module/baz →  my_module_baz
my_module/baz.md  →  my_module/baz →  my_module_baz ← dash vs underscore

my module/baz.md  →  my module/baz →  my_module_baz ← space vs underscore (space matches [^\w]+, becomes _)

utils.helper/baz.md  →  utils.helper/baz  →  utils_helper_baz
utils/helper_baz.md  →  utils/helper_baz  →  utils_helper_baz ← dot vs slash vs underscore

Same-directory collisions (dot/dash in filename vs underscore):

with_suffix("") removes only the last extension, so intermediate dots survive into the stem:

foo.bar.ts  →  foo.bar  →  foo_bar
foo_bar.ts  →  foo_bar  →  foo_bar   ← collision, same directory
foo-bar.ts  →  foo-bar  →  foo_bar   ← collision, same directory

Three different files in the same folder produce identical IDs.

Consecutive non-word characters:

A contiguous run of non-word characters ([^\w]+) collapses to a single _, so quantity doesn't matter either:

foo--bar/baz.md  →  foo_bar_baz
foo-bar/baz.md   →  foo_bar_baz   ← collision (double dash vs single)
foo_bar/baz.md   →  foo_bar_baz   ← collision

All of these are realistic paths in Python packages, Go modules, JS monorepos — a few concrete examples:

Path pair Language ID collision
my_package/utils.py vs my/package_utils.py Python my_package_utils
my-service/handler.go vs my_service/handler.go Go my_service_handler
components/foo.test.ts vs components/foo_test.ts TypeScript components_foo_test
v1_api/schema.go vs v1/api_schema.go Go v1_api_schema

For languages where extraction is purely AST-based (Go, Python, JS/TS), the collision is especially silent: there is no LLM output layer that could hint something went wrong — nodes are just quietly merged with last-writer-wins semantics.

The root issue is that normalize_id is designed to normalize arbitrary strings to a safe [a-z0-9_] format — a lossy operation that isn't suitable for generating unique keys from structured paths. Using it downstream of _file_stem() undoes the uniqueness guarantee that the fix is trying to establish.

The fix likely requires rethinking how path-based IDs are generated independently of normalize_id, since the two have conflicting goals: one needs to preserve structural distinctness, the other is designed to be lossy. This would mean another breaking ID change — but fixing it now in 0.9.x is much less costly than later: the longer it waits, the more users will have built graphs with the current v0.9.0 format. A clean break now is better than silent data loss compounding over time.

Regardless of the approach taken, it would be worth emitting a warning (or failing fast) when two different source paths produce the same node ID at build time. The migration warning added in 3999dbc covers the legacy-format case well, but a collision detector for the new format would catch any remaining edge cases explicitly rather than letting them silently corrupt the graph.

Worth noting that _semantic_id_remap() would need the same update when re-deriving IDs from source_file attributes, otherwise the migration step produces mismatched keys.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions