Skip to content

Fix: Wire prompt sanitizer into classification agents to prevent prompt injection#219

Merged
beenuar merged 4 commits into
beenuar:mainfrom
TanmayZade:fix/prompt-injection-classification-agents
May 28, 2026
Merged

Fix: Wire prompt sanitizer into classification agents to prevent prompt injection#219
beenuar merged 4 commits into
beenuar:mainfrom
TanmayZade:fix/prompt-injection-classification-agents

Conversation

@TanmayZade
Copy link
Copy Markdown
Contributor

@TanmayZade TanmayZade commented May 27, 2026

Summary

This PR addresses a critical prompt injection vulnerability where attacker-controlled telemetry could hijack the classification agents (auto_triage, phishing, cloud, identity, insider_threat) and trick them into auto-closing legitimate high-severity alerts.

While the investigator agents were already correctly using the prompt_sanitizer module, the 5 classification agents bypassed it, pasting raw alert fields directly into the LLM prompt.

Changes

  • Input Sanitization: Wired the existing sanitize_text() function into the context-building functions for all 5 classification agents.
  • Trust Boundary Enforcement: Wrapped the final alert telemetry context in <UNTRUSTED_DATA> tags via wrap_untrusted() so the LLM treats it as inert data rather than instructions.

Verification

  • Tested with a malicious injection payload locally. The injection payload is now successfully stripped by sanitize_text (redacted to [REDACTED:INJECTION]) and the LLM correctly evaluates the underlying alert as a true_positive.
  • Validated that the needs_human_review fallback works safely if the LLM errors.

Eval Harness Results

  • Alert Reduction: No regression (CI verified)
  • Investigation Completeness: No regression (CI verified)
  • Response Quality: No regression (CI verified)
  • MITRE Accuracy: No regression (CI verified)
    Note: Because this only adds a sanitization layer around existing LLM contexts, substrate self-consistency and agent accuracy metrics remain unchanged against the synthetic eval corpus.

Fixes #220

@TanmayZade TanmayZade requested a review from beenuar as a code owner May 27, 2026 09:27
@TanmayZade
Copy link
Copy Markdown
Contributor Author

Hi @beenuar,

The Compose Smoke / docker compose up — full stack failure is a pre-existing issue on main. Migration 041_attack_chains.sql uses window as a column name, which is a reserved keyword in PostgreSQL 14+, causing the Postgres container to crash. This is unrelated to the changes in this PR.

@beenuar beenuar merged commit 25e5f52 into beenuar:main May 28, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Prompt Injection in Classification Agents leads to Alert Auto-Close Bypass

2 participants