Skip to content

Contain symlinked extraction inputs#1613

Closed
Tok6Flow0 wants to merge 1 commit into
Graphify-Labs:v8from
Tok6Flow0:codex/contain-symlink-extraction
Closed

Contain symlinked extraction inputs#1613
Tok6Flow0 wants to merge 1 commit into
Graphify-Labs:v8from
Tok6Flow0:codex/contain-symlink-extraction

Conversation

@Tok6Flow0

Copy link
Copy Markdown
Contributor

Summary

  • make symlink directory following explicit opt-in instead of auto-enabled
  • require resolved corpus paths to stay under the scan root before detection, AST collection, and LLM/image extraction reads
  • skip out-of-root symlink targets while preserving in-root symlink support

Security impact

A corpus can contain symlinks that resolve outside the selected scan root. Previously, direct symlinked directories could be followed automatically and symlinked files could be read by default. If semantic extraction is enabled, those outside-root file contents can enter the extraction prompt or image payload. This patch treats the scan root as a containment boundary before reading or sending file contents.

Tests

  • python -m pytest tests/test_detect.py tests/test_image_vision.py tests/test_extract.py -q
    • 256 passed

safishamsi added a commit that referenced this pull request Jul 2, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@safishamsi

Copy link
Copy Markdown
Collaborator

Merged into v8 as 009a98b (your authorship). Verified the containment: an in-corpus symlink whose target resolves outside the scan root is skipped at detection (and the AST/LLM read boundaries), while in-root files extract normally and symlink-following stays explicit opt-in. Good security hardening. Full suite 2841. Ships next release.

@safishamsi safishamsi closed this Jul 2, 2026
nokternol added a commit to nokternol/graphify that referenced this pull request Jul 4, 2026
…ify-Labs#1613)

`explain` resolved to `_find_node`'s first match unconditionally,
discarding every other candidate with no signal — the exact failure
mode Graphify-Labs#1445 fixed for `query`'s BFS seeding, present all along in
`explain`'s own resolution step but never addressed. Confirmed live:
`explain "genres"` silently returned an unrelated, degree-1 Storybook
literal while `GenresResponse` (the actually relevant node per this
repo's own Trial 1 ground truth) sat one precedence tier down,
discarded — not a same-tier tie but a lone weak exact match masking a
real prefix match, since `_find_node`'s tiers apply strict precedence
(exact beats prefix beats substring) with nothing to catch a weak
winner.

`_find_node`'s internal tiers are now exposed via `_find_node_tiers`
(source_exact, exact, prefix, substring, plus whether the existing
file-level-node heuristic already resolved a multi-entry source_exact
tier). `explain` uses this to catch two ambiguity shapes instead of
picking `matches[0]` blindly:

- same-tier ties: more than one candidate in the top precedence tier.
- precedence collapse: exactly one candidate in the top tier, but it is
  weakly connected (degree <= 1) and a lower tier has candidates that
  are fully hidden by tier precedence alone.

Either case prints a numbered candidate list (label, source, degree)
instead of guessing, and exits cleanly; `--force` reproduces the old
behavior for scripts that want it. A degree-dominance escape hatch
(mirrors `path`'s existing top-vs-runner-up gap check, applied to
degree) keeps same-tier ties from over-firing: when one candidate's
degree dominates the runner-up, it resolves directly with no prompt,
found necessary during live verification against a real repo where a
DI container type name legitimately appears as 3 distinct exact-tier
nodes (the real definition plus 2 unrelated per-file type-annotation
references) — a real, dominant winner, not a genuine tie.

Full suite (2764 tests, same 1 pre-existing unrelated failure as prior
commits) and ruff pass. Verified live against a real repo's graph.json:
`explain "genres"` now surfaces both candidates; `explain "Cradle"` and
`explain "identityResolutionJob"` (previously-clean cases) are
unaffected; `explain "identity linking"` (a genuine no-match) is
unaffected.
nokternol added a commit to nokternol/graphify that referenced this pull request Jul 4, 2026
…raphify-Labs#1614)

Port of Graphify-Labs#1613's `explain` fix into `path`'s endpoint resolution.
`path` already detected a close top-vs-runner-up score gap on either
endpoint, but only printed a stderr warning and then silently
proceeded with the top-scored candidate regardless — the same
silently-wrong-with-no-actionable-signal failure Graphify-Labs#1445 and Graphify-Labs#1613 fixed
elsewhere. Confirmed live against the real repro:
`path "filterRegistry" "useMediaLookups"` has `filterRegistry.ts`
(degree 21, the real module) and `filterRegistry.test.ts` (degree 8)
tied at the exact same score — a real ambiguity a stable sort tiebreak
happened to resolve "correctly" this time, with nothing to say it
might not next time. Both endpoints were individually resolving fine
in isolation; the misleading result came from silently accepting a
coin-flip on one of them.

Endpoint resolution now uses the same two-part gate as Graphify-Labs#1613: a close
score gap is treated as ambiguous and prints a numbered candidate list
(label, source, degree, score) instead of guessing, unless one
candidate's degree dominates the runner-up (mirrors Graphify-Labs#1613's
degree-dominance escape hatch, applied here to the pre-existing
score-gap check rather than tier membership). `--force` bypasses the
guard entirely, reproducing pre-Graphify-Labs#1614 behavior.

Full suite (2768 tests, same 1 pre-existing unrelated failure as prior
commits) and ruff pass. Verified live: the repro now lists both tied
candidates instead of silently picking one; `--force` reproduces the
old misleading result for comparison; a genuinely unambiguous path
(identityResolutionJob -> plexProvider) resolves directly with no
false-positive prompt.

Scope: this fixes endpoint disambiguation, not the separate issue that
a correctly-resolved shortest path can still route through irrelevant
test-infrastructure edges when that happens to be graph-theoretically
shortest — that's a distinct, unaddressed problem in path's traversal
itself, not its endpoint resolution.
nokternol added a commit to nokternol/graphify that referenced this pull request Jul 4, 2026
…iling (Graphify-Labs#1616)

`graphify explain "<phrase>"` treats its whole argument as one string that
must match/prefix/substring a single node's label as a whole — so a genuine
natural-language phrase (e.g. "critic score aggregation") returns "No node
matching found" even when every individual word exists on a real, relevant
node, because no node label ever literally contains the entire multi-word
phrase. This silently dead-ends on exactly the query shape `explain` is
otherwise suggested for, with no fallback and no signal that anything went
wrong (worse than noise: a hard, silent zero).

When the tiered lookup finds nothing and the phrase has more than one token,
`explain` now falls back to the same per-token bag-of-words scoring `query`
already uses (`_score_nodes`) and lists the top candidates by term overlap,
in the same numbered-candidate format the existing ambiguity guard (Graphify-Labs#1613)
uses, instead of a bare dead end. A genuine single-word miss is unaffected —
gated on token count, since a one-word probe would score identically to the
substring tier already tried and has nothing new to find.

Regression tests: multi-word phrase with real term overlap surfaces
candidates and excludes unrelated nodes; multi-word phrase with zero overlap
still gets the honest original message; single-word miss is byte-identical
to prior behavior. Full suite (2766 tests, 1 pre-existing unrelated failure)
and ruff pass. Verified live against a real repo's graph.json: both
previously-zero `explain` queries now surface their real target
(`ratingsAggregation.ts`, `backdrops.handler.ts`) instead of nothing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants