Skip to content

Adversarial Testing

Abhishek Gahlot edited this page Mar 27, 2026 · 1 revision

Adversarial Testing

DeepGym includes tools to catch reward hacking before you waste GPU hours training against a broken verifier.

Why bother

In RL training, the model optimizes whatever signal you give it. If your verifier has a loophole -- say it only checks output format and not correctness -- the model will find it. Adversarial testing catches these ahead of time.

Attack strategies

Strategy What it does What it catches
empty Submits empty/null code Verifiers that don't check for missing functions
hardcoded Hardcodes expected outputs Verifiers with predictable test cases
trivial Returns placeholder values return 0 passing edge cases
overflow Sends NaN, Inf, huge numbers Missing type validation
pattern Mirrors test structure Verifiers that leak test data
llm_attack Uses Claude to craft adversarial code Subtle logical flaws

Running an audit

CLI

deepgym audit \
  --task "Write coin_change(coins, amount)..." \
  --verifier verifier.py \
  --strategies empty hardcoded trivial overflow pattern \
  --json

Python

from deepgym.adversarial import AdversarialTester
from deepgym import load_environment, DeepGym

dg = DeepGym(mode='local')
env = load_environment('coin_change')

tester = AdversarialTester(dg, pass_threshold=0.5)  # score >= 0.5 = exploit
report = tester.test(env, strategies=['empty', 'hardcoded', 'trivial'])

print(f'Attacks run: {report.attacks_run}')
print(f'Exploits found: {report.exploits_found}')
print(f'Robust: {report.is_robust}')

for attack in report.results:
    status = 'EXPLOITED' if attack.exploited else 'blocked'
    print(f'  {attack.strategy}: {status} (score={attack.score:.2f})')

Verifier auditing

RewardAuditor goes deeper -- it analyzes verifier source code for patterns that suggest vulnerability:

from deepgym.reward_qa import RewardAuditor

auditor = RewardAuditor(dg)
report = auditor.audit(
    env,
    verifier_id='v1',
    benchmark='my-benchmark',
    strategies=['empty', 'hardcoded', 'llm_attack'],
    persist=True,
    db_path=Path('~/.deepgym/exploits.db'),
)

print(f'Risk level: {report.risk_level}')    # low / medium / high / critical
print(f'Risk score: {report.risk_score:.2f}')
print(f'Patterns: {report.patterns}')
for rec in report.recommendations:
    print(f'  - {rec}')

Detected patterns

Pattern What it means
static-inputs No randomization in test cases
few-test-cases Too few assertions
output-only-check Doesn't inspect implementation
no-type-validation Accepts wrong types
predictable-tests Test cases are guessable
structure-leakage Test structure exposed to solution

Risk levels

Level Score Action
Low < 0.35 Verifier is solid
Medium 0.35 - 0.65 Review recommendations
High 0.65 - 0.85 Fix before training
Critical >= 0.85 Don't use for training

Exploit database

Persist audit results to SQLite for tracking across verifier versions:

report = auditor.audit(env, verifier_id='v2', benchmark='coin_change', persist=True)

Clone this wiki locally