RewardHackWatch is designed to detect reward hacking and emergent misalignment in code/task environments where LLM agents autonomously execute actions.
| Threat | Description |
|---|---|
| Test Manipulation | sys.exit(0), empty tests, mocked results |
| Evaluation Gaming | Exploiting metrics, gaming benchmarks |
| CoT Deception | Hidden intent in chain-of-thought reasoning |
| Reward Tampering | Manipulating reward signals or logging |
| Obfuscation | Learning to hide hacking intent over time |
| Generalization | Hack behavior generalizing to misalignment |
| Not Covered | Why |
|---|---|
| Chat Jailbreaks | Use Constitutional AI / RLHF |
| Prompt Injection | Use input validation / sandboxing |
| Social Engineering | Out of scope for code analysis |
| Direct Harm Generation | Use content filters |
| Model Weight Attacks | Requires runtime protections |
We detect the transition point where task-specific reward hacking generalizes into broader misalignment behaviors (alignment faking, oversight subversion, etc.). This is the critical window for intervention.
- Pattern-Based Detection: Novel hacking techniques may evade pattern detectors
- False Positives: Some legitimate code patterns may trigger alerts
- LLM Judge Reliability: Judges can be fooled by sophisticated obfuscation
- Temporal Analysis: Requires multiple checkpoints for obfuscation detection
- Language Support: Currently focused on Python code analysis
If you discover a security vulnerability in RewardHackWatch:
- Open a public GitHub issue
- Discuss on public forums
- Exploit the vulnerability
-
Email: Send details to [PLACEHOLDER_EMAIL]
-
Include:
- Description of the vulnerability
- Steps to reproduce
- Potential impact
- Suggested fix (if any)
-
Response Time: We aim to respond within 72 hours
- Day 0: Vulnerability reported
- Day 1-3: Initial response and acknowledgment
- Day 7-14: Assessment and fix development
- Day 30: Coordinated disclosure (if applicable)
When using RewardHackWatch:
# Use environment variables
export ANTHROPIC_API_KEY="your-key"
# Never commit to git
echo ".env" >> .gitignore- Isolate: Run in sandboxed environment
- Limit: Restrict network access
- Monitor: Log all alert triggers
- Review: Human review of CRITICAL alerts
If using webhook alerts:
config = MonitorConfig(
enable_webhook=True,
webhook_url="https://your-secure-endpoint.com/alerts",
)- Use HTTPS only
- Implement webhook authentication
- Rate limit incoming requests
We minimize dependencies and keep them updated:
# Check for vulnerabilities
pip-audit
# Update dependencies
pip install --upgrade -e ".[dev]"We thank the following for security research contributions:
- [Your acknowledgments here]
- Security issues: [PLACEHOLDER_EMAIL]
- General questions: Open a GitHub issue