Skip to content

Verifier Protocol

Abhishek Gahlot edited this page Mar 27, 2026 · 1 revision

Verifier Protocol

Verifiers receive model-generated code, run tests, and produce a structured JSON result.

JSON output

Every verifier prints JSON to stdout:

{
  "schema_version": "1.0",
  "score": 0.85,
  "passed": true,
  "details": "17/20 tests passed",
  "reward_components": {
    "correctness": 0.85,
    "efficiency": 0.92
  },
  "metrics": {
    "execution_time_ms": 142,
    "memory_mb": 24
  },
  "seed": 42,
  "truncated": false,
  "error_type": null,
  "cases": [
    {
      "id": "test_0",
      "passed": true,
      "score": 1.0,
      "input_summary": "coins=[1,2,5] amount=11",
      "expected_summary": "3",
      "actual_summary": "3",
      "execution_time_ms": 2.1
    },
    {
      "id": "test_1",
      "passed": false,
      "score": 0.0,
      "input_summary": "coins=[2] amount=3",
      "expected_summary": "-1",
      "actual_summary": "2",
      "error": "Wrong answer"
    }
  ]
}

Required fields

Field Type Constraint
score float 0.0 - 1.0, clamped
passed bool

Optional fields

Field Type What it is
schema_version str Always "1.0"
details str Human-readable summary
reward_components dict[str, float] Named sub-scores for shaped rewards
metrics dict Execution metrics
seed int Random seed used (reproducibility)
truncated bool true if timed out
error_type str timeout, verifier_error, sandbox_error
cases list[CaseResult] Per-test breakdown

Exit codes

Code Meaning
0 Solution passed
1 Solution failed (verifier ran fine, solution didn't)
2 Verifier itself errored

Writing verifiers

Method 1: inline function body

Write just the body. DeepGym wraps it into _run_verifier(solution_path, test_cases_path=None).

Return a float, bool, or dict. The wrapper normalizes it to JSON.

env = Environment(
    task='Write `factorial(n)` ...',
    verifier_code='''
import importlib.util
spec = importlib.util.spec_from_file_location("sol", solution_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

cases = [(0, 1), (1, 1), (5, 120), (10, 3628800)]
passed = sum(1 for n, expected in cases if mod.factorial(n) == expected)
return passed / len(cases)
''',
)

Return types:

  • bool: True -> score 1.0, False -> score 0.0
  • float: used as score directly (clamped to 0.0-1.0)
  • dict: must have score and passed, rest is optional

Method 2: standalone verifier file

A full Python script that runs as python verifier.py <solution_path>:

#!/usr/bin/env python3
import importlib.util
import json
import sys
import random

def verify(solution_path, test_cases_path=None):
    spec = importlib.util.spec_from_file_location('solution', solution_path)
    solution = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(solution)

    seed = 42
    rng = random.Random(seed)

    # fixed test cases
    test_cases = [(0, 1), (1, 1), (5, 120), (10, 3628800)]

    # randomized test cases (deterministic via seed)
    for _ in range(16):
        n = rng.randint(0, 15)
        expected = 1
        for i in range(2, n + 1):
            expected *= i
        test_cases.append((n, expected))

    cases = []
    passed = 0
    for i, (n, expected) in enumerate(test_cases):
        try:
            actual = solution.factorial(n)
            ok = actual == expected
        except Exception as e:
            actual = str(e)
            ok = False

        passed += ok
        cases.append({
            'id': f'test_{i}',
            'passed': ok,
            'score': 1.0 if ok else 0.0,
            'input_summary': f'n={n}',
            'expected_summary': str(expected),
            'actual_summary': str(actual),
        })

    score = passed / len(test_cases)
    return {
        'schema_version': '1.0',
        'score': score,
        'passed': score == 1.0,
        'details': f'{passed}/{len(test_cases)} passed',
        'cases': cases,
        'seed': seed,
    }

if __name__ == '__main__':
    result = verify(sys.argv[1])
    print(json.dumps(result))
    sys.exit(0 if result['passed'] else 1)

Shaped rewards

Use reward_components for a richer training signal than pass/fail:

return {
    'score': 0.85,
    'passed': False,
    'reward_components': {
        'correctness': 0.8,
        'efficiency': 0.9,
        'style': 1.0,
    },
}

Then in your training loop:

result = dg.run(env, solution)
if result.reward_components:
    correctness = result.reward_components.get('correctness', 0.0)
    efficiency = result.reward_components.get('efficiency', 0.0)
    reward = 0.7 * correctness + 0.3 * efficiency

Per-test cases

The cases field is where the real value is. Instead of a single 0 or 1, your model learns which specific tests it got right:

result = dg.run(env, solution)
if result.cases:
    for case in result.cases:
        print(f"{case.id}: {'PASS' if case.passed else 'FAIL'} "
              f"(input: {case.input_summary})")

Or through the reward function:

from deepgym.integrations.reward import RewardFunction

reward_fn = RewardFunction(env)
per_test = reward_fn.per_test_rewards([solution])
# [{'test_0': 1.0, 'test_1': 0.0, 'test_2': 1.0, 'overall': 0.67}]

Tips

  1. Use deterministic seeds. GRPO needs reproducibility. Same solution, same score, every time.
  2. 15-20 test cases is a good range. Too few and the verifier is exploitable. Too many and scoring is slow.
  3. Mix fixed and randomized tests. Fixed catches known edge cases. Random prevents hardcoding.
  4. Return per-test cases. Denser reward signal.
  5. Catch exceptions in tests. A crashing verifier should return score 0, not bring down the pipeline.
  6. Run adversarial testing. Use deepgym audit before training. See Adversarial Testing.

Clone this wiki locally