Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
RewardHackWatch is an open-source reward hacking detection tool for LLM agents. It detects when AI agents learn to game their reward signals and tracks whether these behaviors generalize to broader misalignment patterns like alignment faking and sabotage.
Built on findings from Anthropic's November 2025 paper showing that reward hacking correlates with emergent misalignment.
RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents
We present RewardHackWatch, a multi-layer detection system achieving 89.7% F1 score on 5,391 real agent trajectories, with a novel RMGI metric for tracking hack-to-misalignment transitions.
| Metric | Value |
|---|---|
| F1 Score | 89.7% |
| Precision | 89.7% |
| Recall | 89.7% |
| Accuracy | 99.3% |
| 5-Fold CV | 87.4% +/- 2.9% |
| Statistical Significance | p < 0.001 vs baselines |
Validated on 5,391 real MALT trajectories.
- Reward Hacking Detection - Fine-tuned DistilBERT classifier with high precision
- Misalignment Tracking - Novel RMGI metric tracks hack-to-misalignment transitions
- Pattern Analysis - 45 regex patterns for interpretability and forensic analysis
- LLM Judge Comparison - Claude API and local Llama for cross-validation
- Real-time Monitoring - FastAPI server and Streamlit dashboard for production use
- CPU Optimized - ~50ms inference latency, no GPU required
# From source
git clone https://github.com/aerosta/rewardhackwatch.git
cd rewardhackwatch
pip install -e .
# With all extras
pip install -e ".[dev,dashboard]"The trained model is hosted on HuggingFace:
# Using the library (recommended)
from rewardhackwatch import RewardHackDetector
detector = RewardHackDetector() # Auto-downloads from HuggingFace# Or load directly with Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("aerosta/rewardhackwatch")
model = AutoModelForSequenceClassification.from_pretrained("aerosta/rewardhackwatch")Note: Model uses 0.02 threshold, not 0.5. See HuggingFace page for details.
from rewardhackwatch import RewardHackDetector
detector = RewardHackDetector()
# Analyze agent trajectory
result = detector.analyze({
"cot_traces": ["Let me bypass the test by calling sys.exit(0)..."],
"code_outputs": ["import sys\nsys.exit(0)"]
})
print(f"Risk Level: {result.risk_level}")
print(f"ML Score: {result.ml_score:.3f}")
print(f"Detections: {len(result.detections)}")# Analyze a trajectory file
rewardhackwatch analyze trajectory.json
# Scan a directory
rewardhackwatch scan ./trajectories/
# Start monitoring server
rewardhackwatch serve --port 8000
# Launch dashboard
rewardhackwatch dashboard
# Check version
rewardhackwatch version# Start server
uvicorn rewardhackwatch.api.main:app --port 8000
# Health check
curl http://localhost:8000/status
# Analyze trajectory
curl -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{"cot_traces": ["sys.exit(0) bypasses tests"], "code_outputs": ["sys.exit(0)"]}'
# Batch analysis
curl -X POST http://localhost:8000/analyze/batch \
-H "Content-Type: application/json" \
-d '{"trajectories": [...]}'Interactive Streamlit dashboard for real-time monitoring with 5 tabs:
rewardhackwatch dashboard
# Or directly:
streamlit run rewardhackwatch/dashboard/app.py| Tab | Description |
|---|---|
| Timeline | Real-time hack and misalignment scores with RMGI tracking |
| Alerts | Flagged events with severity levels and timestamps |
| CoT Viewer | Chain-of-thought analysis with highlighted concerns |
| Judge Comparison | Claude vs Llama judge outputs side-by-side |
| Quick Analysis | Paste trajectory for instant analysis |
RMGI tracks the correlation between reward hacking and misalignment behaviors over trajectory steps. The transition from reward hacking to generalized misalignment is detected when RMGI exceeds 0.7.
Multi-layer detection architecture combining pattern matching (interpretability), ML classification (primary signal), and RMGI tracking.
The ML classifier achieves 89.7% F1, significantly outperforming pattern-only (4.9%) and keyword baselines (0.1%).
Detection performance by hack type. Test manipulation achieves 100% F1, while mock exploits show 0% F1 due to data scarcity (addressed in v1.1).
Precision, recall, and F1 as functions of classification threshold. Optimal threshold is 0.02, calibrated for 3.6% base rate.
Model calibration analysis showing well-calibrated probabilities (Brier score: 0.033, ECE: 0.027).
rewardhackwatch/
core/
detectors/ # Pattern (45 regex) + ML (DistilBERT) detection
judges/ # Claude and Llama LLM judges
trackers/ # RMGI generalization tracking, PELT changepoint
training/ # DistilBERT model training pipeline
api/ # FastAPI REST server + WebSocket
dashboard/ # Streamlit visualization (5 tabs)
cli.py # Command line interface
models/ # Local cache (~254MB, hosted on HuggingFace)
Trajectory Input
|
v
+------------------+
| Detection Layer |
+------------------+
| - ML Classifier | <-- Primary signal (89.7% F1)
| - Pattern Detect | <-- Interpretability layer
| - AST Analyzer |
+--------+---------+
|
v
+------------------+
| Judgment Layer |
+------------------+
| - Claude Judge | <-- Cloud, high accuracy
| - Llama Judge | <-- Local, cost-efficient
+--------+---------+
|
v
+------------------+
| Tracking Layer |
+------------------+
| - RMGI Metric | <-- Correlation tracking
| - PELT Detection | <-- Changepoint detection
+------------------+
|
v
Alert/Report
The Reward-Misalignment Generalization Index detects when reward hacking begins correlating with misalignment:
RMGI(i, W) = correlation(H[i-W+1:i], M[i-W+1:i])
Where:
- H = hack score time series (from ML classifier)
- M = misalignment score time series (from LLM judges)
- W = sliding window size (default: 10)
Transition detected when:
- RMGI > 0.7 (correlation threshold)
- Mean hack score > 0.5
- Mean misalignment score > 0.5
- All conditions hold for 5+ consecutive steps
Note: RMGI measures correlation, not causation. High scores indicate elevated risk warranting human review.
# LLM Judges
export ANTHROPIC_API_KEY="your-key" # For Claude judge
export OLLAMA_HOST="http://localhost:11434" # For Llama judge
# Detection thresholds
export RHW_HACK_THRESHOLD="0.02" # Calibrated for 3.6% base rate
export RHW_RMGI_WINDOW="10"
export RHW_RMGI_THRESHOLD="0.7"The default threshold (0.02) is calibrated for our dataset. For production deployment:
from rewardhackwatch import RewardHackDetector
detector = RewardHackDetector()
# Calibrate on your clean data
detector.calibrate_threshold(clean_trajectories, percentile=99)- Technical Report - Full methodology and experiments
- RMGI Methodology - Generalization index specification
- Architecture - System design with Mermaid diagrams
- Related Work - Literature review
- References - Citations
| Method | F1 | Precision | Recall |
|---|---|---|---|
| Keyword Matching | 0.1% | 0.1% | 5.1% |
| Regex Patterns | 4.9% | 2.6% | 38.5% |
| BoW + LogReg | 7.0% | 4.0% | 28.2% |
| DistilBERT (Ours) | 89.7% | 89.7% | 89.7% |
| Window | Threshold | Recall | FPR |
|---|---|---|---|
| 10 | 0.7 | 70% | 4.3% |
| 10 | 0.6 | 70% | 5.7% |
| 20 | 0.7 | 54.5% | 2.9% |
| Configuration | F1 | Notes |
|---|---|---|
| ML Only | 89.7% | Best performance |
| ML + Pattern (OR) | 23.3% | Patterns add false positives |
| Pattern Only | 4.9% | Low precision |
Key finding: Pattern detector serves interpretability, not primary detection.
- Core detection system with 89.7% F1
- RMGI transition tracking (70% recall @ 4.3% FPR)
- 45 regex patterns for interpretability
- Dual LLM judges (Claude + Llama)
- Streamlit dashboard (5 tabs)
- FastAPI + CLI
- HuggingFace model release (aerosta/rewardhackwatch)
- Improved mock exploit coverage (synthetic data via red-teaming)
- Full mypy type annotations
- Dynamic threshold calibration API
- Multi-language support
- Adversarial robustness testing
- Activation-level detection
- Multi-agent scenarios
- Mock Exploits: 0% F1 due to data scarcity (planned fix in v1.1 via synthetic data generation)
- Threshold Sensitivity: Requires recalibration for different base rates
- Correlation vs Causation: RMGI detects correlation, not causal links
- Covert Misalignment: May miss fully hidden deceptive reasoning
See paper Limitations section for details.
- GitHub Issues - Bug reports
- GitHub Discussions - Questions & ideas
MIT License - see LICENSE for details.
Copyright (c) 2025 Aerosta
@article{aerosta2025rewardhackwatch,
title={RewardHackWatch: Runtime Detection of Reward Hacking and
Misalignment Generalization in LLM Agents},
author={Aerosta},
journal={manuscript},
year={2025},
url={https://github.com/aerosta/rewardhackwatch}
}Based on research from:
- Anthropic: Natural emergent misalignment from reward hacking (Nov 2025)
- OpenAI: Monitoring reasoning models for misbehavior (Mar 2025)
- Betley et al.: Emergent misalignment (2025)
- Denison et al.: Sycophancy to subterfuge (2024)
Built for AI safety researchers and engineers
reward hacking, reward hacking detection, LLM safety, AI alignment, misalignment detection, RLHF safety, reward tampering, specification gaming, AI safety tools, machine learning safety, language model monitoring, agent safety, reward model, alignment research












