Skip to content

fix: _is_sensitive silently drops topic notes (token-economics-of-recall.md flagged as a secret)#1169

Closed
edudatsuj45 wants to merge 1 commit into
Graphify-Labs:v8from
edudatsuj45:fix-sensitive-topic-slugs
Closed

fix: _is_sensitive silently drops topic notes (token-economics-of-recall.md flagged as a secret)#1169
edudatsuj45 wants to merge 1 commit into
Graphify-Labs:v8from
edudatsuj45:fix-sensitive-topic-slugs

Conversation

@edudatsuj45

Copy link
Copy Markdown

Problem

The generic keyword patterns in _SENSITIVE_PATTERNS flag any filename containing token/secret/password/credential as a standalone word — so prose documents whose descriptive slug merely mentions the topic are silently dropped from the graph:

Filename Current Reality
token-economics-of-recall.md skipped as sensitive a note about LLM token costs
password-policy-discussion.md skipped as sensitive a design discussion

This is the remaining failure class after the #436#718#920 lineage. Hit in the wild: an Obsidian-style memory vault where a note on token economics vanished from the graph with no visible warning (skipped_sensitive is returned but nothing surfaces it — exactly the silent-data-loss failure mode described in #718's closing observation).

Fix

Split the two generic keyword patterns out of _SENSITIVE_PATTERNS and only count a match when the keyword is load-bearing in the filename:

  • the keyword ends the stem (api_token.txt, oauth_token.json, github-personal-access-token.txt) — secret stores name their contents, and the content noun is the head of the compound, which comes last in English; or
  • the stem has ≤ 2 words (token.txt, token_config.yaml, secret_handler.txt).

A keyword buried mid-phrase in a ≥ 3-word slug is a topic word, not a credential store. The specific patterns (.pem/.env/id_rsa/.netrc/aws_credentials/_SENSITIVE_DIRS) are unchanged and still always apply.

The end-of-stem check runs before word counting, so multi-word keywords survive their own separator: my_private_key.txt is still flagged even though splitting on _ would break private_key apart. Leading dots are stripped before stem extraction so dotfiles like .token keep their keyword.

Behavior table

Filename Before After
token-economics-of-recall.md flagged ❌ clean ✅
password-policy-discussion.md flagged ❌ clean ✅
api_token.txt / oauth_token.json (#920) flagged flagged
token.txt / tokens.txt / .token flagged flagged
token_config.yaml / secret_handler.txt (#920) flagged flagged
github-personal-access-token.txt flagged flagged
my_private_key.txt flagged flagged
passwords.py / credentials.json flagged flagged
tokenizer.py / tokenize.py (#718) clean clean
.env / server.pem / id_rsa / .ssh/… flagged flagged

Tests

All 13 existing test_sensitive_* contracts pass unchanged; adds 6 regression tests covering the topic-slug false positive plus the dotfile, plural, end-of-long-name, and multi-word-keyword edge cases.

tests/test_detect.py: 19 passed (sensitive subset)
tests/test_detect.py: 109 passed, 6 failed* , 1 skipped
* pre-existing WinError 1314 symlink-privilege failures on Windows without Developer Mode, unrelated to this change

🤖 Generated with Claude Code

…f-recall.md

The generic keyword patterns (credential/secret/password/token) flagged any
filename containing the keyword as a standalone word, silently dropping prose
documents whose descriptive slug merely mentions the topic:

  token-economics-of-recall.md   -> skipped as sensitive (a note ABOUT tokens)
  password-policy-discussion.md  -> skipped as sensitive

Follow-up to the Graphify-Labs#436 -> Graphify-Labs#718 -> Graphify-Labs#920 lineage: the remaining failure class is
keyword-as-topic-word in multi-word descriptive filenames.

Fix: split the generic keyword patterns out of _SENSITIVE_PATTERNS and only
count a match when the keyword is load-bearing in the name:

  - the keyword ends the stem (api_token.txt, github-personal-access-token.txt,
    oauth_token.json) - secret stores name their contents, and the content
    noun is the head of the compound, which comes last; or
  - the stem has <= 2 words (token.txt, token_config.yaml, secret_handler.txt).

A keyword buried mid-phrase in a >= 3-word slug is a topic word, not a
credential store. Specific patterns (.pem/.env/id_rsa/.netrc/aws_credentials)
are unchanged and still always apply.

All existing contracts preserved: api_token.txt, oauth_token.json, token.txt,
token_config.yaml, secret_handler.txt, passwords.py, credentials.json still
flagged; tokenizer.py / tokenize.py still clean. Adds 6 regression tests
including dotfile (.token), plural (tokens.txt), and multi-word-keyword
(my_private_key.txt) edge cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
safishamsi added a commit that referenced this pull request Jun 7, 2026
#1170 — replace nohup with cross-platform Python detach in git hooks.
Git for Windows MSYS has no nohup so post-commit/post-checkout hooks
silently failed. Now uses subprocess.Popen with DETACHED_PROCESS |
CREATE_NEW_PROCESS_GROUP on Windows, start_new_session=True on POSIX.
Quoting-safe (argv list). Fixes #1161.

#1169 — fix _is_sensitive false positives on topic-mentioning filenames.
token-economics-of-recall.md and password-policy-discussion.md were
silently dropped as secrets. Generic keywords (token/secret/password)
now only fire when the keyword ends the filename stem or the stem is
≤2 words. Specific patterns (.env/.pem/id_rsa etc.) remain unconditional.

#1165 — fix multi-word endpoint resolution in _score_nodes.
graphify path "AuthService" "UserRepo" never fired the exact-match bonus
because per-token comparison never equalled the full label. Now joins
normalized tokens and compares against the full label and its tokenized
form. O(1) per node, affects query_graph and shortest_path uniformly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@safishamsi

Copy link
Copy Markdown
Collaborator

Landed in a8dbbe5. Generic keywords (token/secret/password) now only fire when the keyword ends the filename stem or the stem is ≤2 words. Specific patterns (.env/.pem/id_rsa etc.) remain unconditional. token-economics-of-recall.md passes through; tokens.txt is still caught. Thanks!

@safishamsi safishamsi closed this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants