Hi @safishamsi — first, thank you for the quick turnaround on #1504 and #1509 — the full-path stem approach is exactly the right direction and the _semantic_id_remap() migration utility is a nice touch.
While smoke-testing v0.9.0 I noticed that the collision problem may be wider than the original issue. The fix correctly uses the full repo-relative path, but normalize_id collapses any contiguous sequence of non-word characters ([^\w]+) into a single _, and then collapses consecutive underscores. That affects a large set of characters valid in directory and file names: -, ., , !, (, @, #, ~, + and others on Linux/macOS, and a similar set on Windows (where only \/:*?"<>| are forbidden). Any combination of these in path segments can produce the same ID even after the fix.
Cross-directory collisions (slash vs other separators):
foo_bar/baz.md → foo_bar/baz → foo_bar_baz
foo/bar_baz.md → foo/bar_baz → foo_bar_baz ← slash vs underscore
my-module/baz.md → my-module/baz → my_module_baz
my_module/baz.md → my_module/baz → my_module_baz ← dash vs underscore
my module/baz.md → my module/baz → my_module_baz ← space vs underscore (space matches [^\w]+, becomes _)
utils.helper/baz.md → utils.helper/baz → utils_helper_baz
utils/helper_baz.md → utils/helper_baz → utils_helper_baz ← dot vs slash vs underscore
Same-directory collisions (dot/dash in filename vs underscore):
with_suffix("") removes only the last extension, so intermediate dots survive into the stem:
foo.bar.ts → foo.bar → foo_bar
foo_bar.ts → foo_bar → foo_bar ← collision, same directory
foo-bar.ts → foo-bar → foo_bar ← collision, same directory
Three different files in the same folder produce identical IDs.
Consecutive non-word characters:
A contiguous run of non-word characters ([^\w]+) collapses to a single _, so quantity doesn't matter either:
foo--bar/baz.md → foo_bar_baz
foo-bar/baz.md → foo_bar_baz ← collision (double dash vs single)
foo_bar/baz.md → foo_bar_baz ← collision
All of these are realistic paths in Python packages, Go modules, JS monorepos — a few concrete examples:
| Path pair |
Language |
ID collision |
my_package/utils.py vs my/package_utils.py |
Python |
my_package_utils |
my-service/handler.go vs my_service/handler.go |
Go |
my_service_handler |
components/foo.test.ts vs components/foo_test.ts |
TypeScript |
components_foo_test |
v1_api/schema.go vs v1/api_schema.go |
Go |
v1_api_schema |
For languages where extraction is purely AST-based (Go, Python, JS/TS), the collision is especially silent: there is no LLM output layer that could hint something went wrong — nodes are just quietly merged with last-writer-wins semantics.
The root issue is that normalize_id is designed to normalize arbitrary strings to a safe [a-z0-9_] format — a lossy operation that isn't suitable for generating unique keys from structured paths. Using it downstream of _file_stem() undoes the uniqueness guarantee that the fix is trying to establish.
The fix likely requires rethinking how path-based IDs are generated independently of normalize_id, since the two have conflicting goals: one needs to preserve structural distinctness, the other is designed to be lossy. This would mean another breaking ID change — but fixing it now in 0.9.x is much less costly than later: the longer it waits, the more users will have built graphs with the current v0.9.0 format. A clean break now is better than silent data loss compounding over time.
Regardless of the approach taken, it would be worth emitting a warning (or failing fast) when two different source paths produce the same node ID at build time. The migration warning added in 3999dbc covers the legacy-format case well, but a collision detector for the new format would catch any remaining edge cases explicitly rather than letting them silently corrupt the graph.
Worth noting that _semantic_id_remap() would need the same update when re-deriving IDs from source_file attributes, otherwise the migration step produces mismatched keys.
Hi @safishamsi — first, thank you for the quick turnaround on #1504 and #1509 — the full-path stem approach is exactly the right direction and the
_semantic_id_remap()migration utility is a nice touch.While smoke-testing v0.9.0 I noticed that the collision problem may be wider than the original issue. The fix correctly uses the full repo-relative path, but
normalize_idcollapses any contiguous sequence of non-word characters ([^\w]+) into a single_, and then collapses consecutive underscores. That affects a large set of characters valid in directory and file names:-,.,,!,(,@,#,~,+and others on Linux/macOS, and a similar set on Windows (where only\/:*?"<>|are forbidden). Any combination of these in path segments can produce the same ID even after the fix.Cross-directory collisions (slash vs other separators):
Same-directory collisions (dot/dash in filename vs underscore):
with_suffix("")removes only the last extension, so intermediate dots survive into the stem:Three different files in the same folder produce identical IDs.
Consecutive non-word characters:
A contiguous run of non-word characters (
[^\w]+) collapses to a single_, so quantity doesn't matter either:All of these are realistic paths in Python packages, Go modules, JS monorepos — a few concrete examples:
my_package/utils.pyvsmy/package_utils.pymy_package_utilsmy-service/handler.govsmy_service/handler.gomy_service_handlercomponents/foo.test.tsvscomponents/foo_test.tscomponents_foo_testv1_api/schema.govsv1/api_schema.gov1_api_schemaFor languages where extraction is purely AST-based (Go, Python, JS/TS), the collision is especially silent: there is no LLM output layer that could hint something went wrong — nodes are just quietly merged with last-writer-wins semantics.
The root issue is that
normalize_idis designed to normalize arbitrary strings to a safe[a-z0-9_]format — a lossy operation that isn't suitable for generating unique keys from structured paths. Using it downstream of_file_stem()undoes the uniqueness guarantee that the fix is trying to establish.The fix likely requires rethinking how path-based IDs are generated independently of
normalize_id, since the two have conflicting goals: one needs to preserve structural distinctness, the other is designed to be lossy. This would mean another breaking ID change — but fixing it now in 0.9.x is much less costly than later: the longer it waits, the more users will have built graphs with the current v0.9.0 format. A clean break now is better than silent data loss compounding over time.Regardless of the approach taken, it would be worth emitting a warning (or failing fast) when two different source paths produce the same node ID at build time. The migration warning added in 3999dbc covers the legacy-format case well, but a collision detector for the new format would catch any remaining edge cases explicitly rather than letting them silently corrupt the graph.
Worth noting that
_semantic_id_remap()would need the same update when re-deriving IDs fromsource_fileattributes, otherwise the migration step produces mismatched keys.