Skip to content

feat(threat-detect): injection/exfil detection with severity tiers + entropy guard (#156)#180

Merged
edheltzel merged 2 commits into
mainfrom
worktree-feat+156-injection-exfil-detection
Jun 24, 2026
Merged

feat(threat-detect): injection/exfil detection with severity tiers + entropy guard (#156)#180
edheltzel merged 2 commits into
mainfrom
worktree-feat+156-injection-exfil-detection

Conversation

@edheltzel

@edheltzel edheltzel commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Closes #156. A detect-and-surface content-threat layer beside scrub()SECRET_PATTERNS / scrub() are untouched. Canonical hooks/lib/threat-detect.ts re-exported via src/lib/threat-detect.ts (mirrors the #50/#157 seam — one pattern source for CLI + MCP + hooks).

Ruling: FLAG ONLY — never mutates, never blocks (RedTeam NO-GO + Ed)

The original entropy tier redacted/blocked anonymous high-entropy tokens. RedTeam proved this corrupts public, non-secret tokens — you can't tell a random secret from a random public ID by entropy + shape (base64url incl. the universal JWT header eyJhbGciOi…, base58 IPFS CIDs, long nanoids all look identical). Ed ruled: demote to flag everywhere. The layer now only surfaces findings; every persisted record is byte-identical on every path, and no write is ever blocked. (Net: mostly deletion — −308/+216.)

What it does

  • 11 injection/exfil regexes (instruction-override ×3, role-hijack ×3, env-exfil ×2, dotenv/netrc/ssh reads ×3) → flag findings. File-read patterns require a shell verb governing the path; env-exfil requires curl/wget + a KEY/TOKEN/SECRET-named interpolation — so prose mentioning ~/.ssh, a .env file, or DB_URL stays clean.
  • Anonymous high-entropy token → flag finding (closes the feat(hooks): minimal write-safety guard — Part of #51 (P0) #128 residual as a surfaced signal, not a redaction). Structural guards (exclude +/=/base64, length 28–72, pure-hex SHAs, UUID, 3-class + ≥3 digits, letter-run, Shannon ≥4.0) are now flag-noise reduction, not a safety boundary.

Surfacing per path (no content injection)

  • session (extract-core): findings in result.threats
  • import-legacy (CLI): non-fatal [WARN] log
  • memory_add (MCP): non-fatal stderr log; the record is stored exactly as provided (no block)

Precision is now total

Nothing mutates or blocks, so a false positive is at worst a spurious flag. Verified by tests:

  • ~/.ssh / .env / DB_URL / quoted "ignore previous instructions" → clean or flag, never altered
  • git SHA / md5 / sha256 / UUID / base64 image blob / camelCase identifier → not even flagged (noise reduction)
  • PUBLIC tokens — universal JWT header, base58 IPFS CIDv0, length-31 nanoid, base64url — are flagged as pure flags (only {category, span}, no action/replacement text) → persist byte-identical, never block
  • deletion locked: no scanForThreats / no redact-block surface is exported

Accepted tradeoff (module header)

Anonymous high-entropy secrets are flagged, not redacted; anchored known-prefix secrets remain scrub()'s job on the scrub paths (unchanged).

Gate

bun run lint clean · bun run build clean · bun test 1164 pass / 0 fail (63 new). Measured-entropy assertions pin the detection boundary.

Flagged out-of-scope gap

memory_add does not call scrub() today — out of scope for #156 (#51 territory); recommend a follow-up issue.

Do not merge — awaiting Themis's re-attack.

…entropy guard (#156)

Add a SEPARATE content-threat detection layer beside scrub() (write-safety.ts
is untouched: SECRET_PATTERNS and scrub() unchanged). Canonical, self-contained
hooks/lib/threat-detect.ts re-exported via src/lib/threat-detect.ts (mirrors the
#50/#157 seam) — one pattern source consumed by CLI, MCP, and hooks.

Precision is earned structurally (this is the #137-class false-positive
minefield): all 11 injection/exfil regexes are flag-tier ONLY, so an
over-matching regex can never mutate or block a legit note — at most a benign
annotation. The only mutating tier is redact (anchorless high-entropy token,
closing the PR #128 residual); the only blocking action is redact-tier on the
explicit memory_add path.

- 11 flag-tier regexes: instruction-override, role-hijack, env-exfil,
  dotenv/netrc/ssh reads. File-read patterns require a shell verb governing the
  path and env-exfil requires curl/wget + a secret-named interpolation, so prose
  mentioning ~/.ssh, a .env file, or DB_URL stays clean.
- Anchorless high-entropy token (redact-tier) via a layered FP-guard chain:
  exclude +/= (base64/data-URI), length 28-72, pure-hex (git SHA/md5/sha256),
  UUID, non-3-class / digit-sparse strings, word-structured identifiers (long
  single-case letter run), then a Shannon-entropy floor. The threshold is pinned
  by MEASURED fixture values in the test corpus, not a magic number.
- Per-path policy: session + import-legacy redact (never block — no block-storms);
  memory_add blocks a suspected bare credential with a fixable error.
- Wired AFTER scrub() on the two scrub paths; memory_add wired so the block tier
  works (memory_add still lacks scrub() — out of scope for #156, follow-up).

Tests: tests/hooks/threat-detect.test.ts (TP + must-not-block corpus, measured
entropy assertions, per-path policy, flag-is-byte-identical) and
tests/lib/threat-detect-single-source.test.ts (src re-export === hooks canonical).
…ate/block (#156)

RedTeam NO-GO + Ed's ruling: you cannot distinguish a random SECRET from a
random PUBLIC identifier by entropy + shape — base64url (incl. the universal JWT
header eyJhbGciOi…), base58 IPFS CIDs, and long nanoids all look identical to an
anonymous secret. Redacting/blocking on that signal silently corrupts public,
non-secret content. So the entire threat-detect layer becomes DETECT-AND-SURFACE:
it never mutates persisted text and never blocks a write.

Mostly deletion:
- Remove the redact tier (span replacement / [THREAT-REDACTED] marker) and the
  block tier (write rejection) entirely. scanForThreats, resolveAction, the
  per-path policy matrix, and the IngestionPath/ResolvedThreat/ThreatScanResult/
  ThreatAction/ThreatSeverity types are gone.
- KEEP detection unchanged: detectThreats(text) still returns flag findings
  (category + span) for the 11 injection/exfil regexes AND anonymous
  high-entropy tokens. The entropy structural guards remain as flag-noise
  reduction (not a safety boundary now).
- Surfacing per path (no content injection): extract-core returns findings in
  result.threats; import-legacy logs a non-fatal [WARN]; memory_add logs a
  non-fatal stderr line. Every persisted record is byte-identical on every path.

Accepted tradeoff (documented in the module header): anonymous high-entropy
secrets are FLAGGED, not redacted; anchored known-prefix secrets remain handled
by scrub() on the scrub paths (unchanged). scrub()/SECRET_PATTERNS untouched.

Tests: base64url token reclassified to flag-and-persist-unchanged; added the
literal JWT header, a base58 IPFS CIDv0, and a length-31 nanoid as
flag-only/persist-byte-identical cases; locked the deletion (no scanForThreats /
no redact-block surface). bun test green, lint + build clean.
@edheltzel edheltzel merged commit 84f0b0a into main Jun 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add injection/exfil detection to the content scanner (split from #50)

1 participant