Adversarial Testing

DeepGym includes tools to catch reward hacking before you waste GPU hours training against a broken verifier.

Why bother

In RL training, the model optimizes whatever signal you give it. If your verifier has a loophole -- say it only checks output format and not correctness -- the model will find it. Adversarial testing catches these ahead of time.

Attack strategies

Strategy	What it does	What it catches
`empty`	Submits empty/null code	Verifiers that don't check for missing functions
`hardcoded`	Hardcodes expected outputs	Verifiers with predictable test cases
`trivial`	Returns placeholder values	`return 0` passing edge cases
`overflow`	Sends NaN, Inf, huge numbers	Missing type validation
`pattern`	Mirrors test structure	Verifiers that leak test data
`llm_attack`	Uses Claude to craft adversarial code	Subtle logical flaws

Running an audit

CLI

deepgym audit \
  --task "Write coin_change(coins, amount)..." \
  --verifier verifier.py \
  --strategies empty hardcoded trivial overflow pattern \
  --json

Python

from deepgym.adversarial import AdversarialTester
from deepgym import load_environment, DeepGym

dg = DeepGym(mode='local')
env = load_environment('coin_change')

tester = AdversarialTester(dg, pass_threshold=0.5)  # score >= 0.5 = exploit
report = tester.test(env, strategies=['empty', 'hardcoded', 'trivial'])

print(f'Attacks run: {report.attacks_run}')
print(f'Exploits found: {report.exploits_found}')
print(f'Robust: {report.is_robust}')

for attack in report.results:
    status = 'EXPLOITED' if attack.exploited else 'blocked'
    print(f'  {attack.strategy}: {status} (score={attack.score:.2f})')

Verifier auditing

RewardAuditor goes deeper -- it analyzes verifier source code for patterns that suggest vulnerability:

from deepgym.reward_qa import RewardAuditor

auditor = RewardAuditor(dg)
report = auditor.audit(
    env,
    verifier_id='v1',
    benchmark='my-benchmark',
    strategies=['empty', 'hardcoded', 'llm_attack'],
    persist=True,
    db_path=Path('~/.deepgym/exploits.db'),
)

print(f'Risk level: {report.risk_level}')    # low / medium / high / critical
print(f'Risk score: {report.risk_score:.2f}')
print(f'Patterns: {report.patterns}')
for rec in report.recommendations:
    print(f'  - {rec}')

Detected patterns

Pattern	What it means
`static-inputs`	No randomization in test cases
`few-test-cases`	Too few assertions
`output-only-check`	Doesn't inspect implementation
`no-type-validation`	Accepts wrong types
`predictable-tests`	Test cases are guessable
`structure-leakage`	Test structure exposed to solution

Risk levels

Level	Score	Action
Low	< 0.35	Verifier is solid
Medium	0.35 - 0.65	Review recommendations
High	0.65 - 0.85	Fix before training
Critical	>= 0.85	Don't use for training

Exploit database

Persist audit results to SQLite for tracking across verifier versions:

report = auditor.audit(env, verifier_id='v2', benchmark='coin_change', persist=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adversarial Testing

Adversarial Testing

Why bother

Attack strategies

Running an audit

CLI

Python

Verifier auditing

Detected patterns

Risk levels

Exploit database

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally