-
Notifications
You must be signed in to change notification settings - Fork 1
Adversarial Testing
Abhishek Gahlot edited this page Mar 27, 2026
·
1 revision
DeepGym includes tools to catch reward hacking before you waste GPU hours training against a broken verifier.
In RL training, the model optimizes whatever signal you give it. If your verifier has a loophole -- say it only checks output format and not correctness -- the model will find it. Adversarial testing catches these ahead of time.
| Strategy | What it does | What it catches |
|---|---|---|
empty |
Submits empty/null code | Verifiers that don't check for missing functions |
hardcoded |
Hardcodes expected outputs | Verifiers with predictable test cases |
trivial |
Returns placeholder values |
return 0 passing edge cases |
overflow |
Sends NaN, Inf, huge numbers | Missing type validation |
pattern |
Mirrors test structure | Verifiers that leak test data |
llm_attack |
Uses Claude to craft adversarial code | Subtle logical flaws |
deepgym audit \
--task "Write coin_change(coins, amount)..." \
--verifier verifier.py \
--strategies empty hardcoded trivial overflow pattern \
--jsonfrom deepgym.adversarial import AdversarialTester
from deepgym import load_environment, DeepGym
dg = DeepGym(mode='local')
env = load_environment('coin_change')
tester = AdversarialTester(dg, pass_threshold=0.5) # score >= 0.5 = exploit
report = tester.test(env, strategies=['empty', 'hardcoded', 'trivial'])
print(f'Attacks run: {report.attacks_run}')
print(f'Exploits found: {report.exploits_found}')
print(f'Robust: {report.is_robust}')
for attack in report.results:
status = 'EXPLOITED' if attack.exploited else 'blocked'
print(f' {attack.strategy}: {status} (score={attack.score:.2f})')RewardAuditor goes deeper -- it analyzes verifier source code for patterns that suggest vulnerability:
from deepgym.reward_qa import RewardAuditor
auditor = RewardAuditor(dg)
report = auditor.audit(
env,
verifier_id='v1',
benchmark='my-benchmark',
strategies=['empty', 'hardcoded', 'llm_attack'],
persist=True,
db_path=Path('~/.deepgym/exploits.db'),
)
print(f'Risk level: {report.risk_level}') # low / medium / high / critical
print(f'Risk score: {report.risk_score:.2f}')
print(f'Patterns: {report.patterns}')
for rec in report.recommendations:
print(f' - {rec}')| Pattern | What it means |
|---|---|
static-inputs |
No randomization in test cases |
few-test-cases |
Too few assertions |
output-only-check |
Doesn't inspect implementation |
no-type-validation |
Accepts wrong types |
predictable-tests |
Test cases are guessable |
structure-leakage |
Test structure exposed to solution |
| Level | Score | Action |
|---|---|---|
| Low | < 0.35 | Verifier is solid |
| Medium | 0.35 - 0.65 | Review recommendations |
| High | 0.65 - 0.85 | Fix before training |
| Critical | >= 0.85 | Don't use for training |
Persist audit results to SQLite for tracking across verifier versions:
report = auditor.audit(env, verifier_id='v2', benchmark='coin_change', persist=True)