Verifier Protocol

Verifiers receive model-generated code, run tests, and produce a structured JSON result.

JSON output

Every verifier prints JSON to stdout:

{
  "schema_version": "1.0",
  "score": 0.85,
  "passed": true,
  "details": "17/20 tests passed",
  "reward_components": {
    "correctness": 0.85,
    "efficiency": 0.92
  },
  "metrics": {
    "execution_time_ms": 142,
    "memory_mb": 24
  },
  "seed": 42,
  "truncated": false,
  "error_type": null,
  "cases": [
    {
      "id": "test_0",
      "passed": true,
      "score": 1.0,
      "input_summary": "coins=[1,2,5] amount=11",
      "expected_summary": "3",
      "actual_summary": "3",
      "execution_time_ms": 2.1
    },
    {
      "id": "test_1",
      "passed": false,
      "score": 0.0,
      "input_summary": "coins=[2] amount=3",
      "expected_summary": "-1",
      "actual_summary": "2",
      "error": "Wrong answer"
    }
  ]
}

Required fields

Field	Type	Constraint
`score`	`float`	0.0 - 1.0, clamped
`passed`	`bool`

Optional fields

Field	Type	What it is
`schema_version`	`str`	Always `"1.0"`
`details`	`str`	Human-readable summary
`reward_components`	`dict[str, float]`	Named sub-scores for shaped rewards
`metrics`	`dict`	Execution metrics
`seed`	`int`	Random seed used (reproducibility)
`truncated`	`bool`	`true` if timed out
`error_type`	`str`	`timeout`, `verifier_error`, `sandbox_error`
`cases`	`list[CaseResult]`	Per-test breakdown

Exit codes

Code	Meaning
`0`	Solution passed
`1`	Solution failed (verifier ran fine, solution didn't)
`2`	Verifier itself errored

Writing verifiers

Method 1: inline function body

Write just the body. DeepGym wraps it into _run_verifier(solution_path, test_cases_path=None).

Return a float, bool, or dict. The wrapper normalizes it to JSON.

env = Environment(
    task='Write `factorial(n)` ...',
    verifier_code='''
import importlib.util
spec = importlib.util.spec_from_file_location("sol", solution_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

cases = [(0, 1), (1, 1), (5, 120), (10, 3628800)]
passed = sum(1 for n, expected in cases if mod.factorial(n) == expected)
return passed / len(cases)
''',
)

Return types:

bool: True -> score 1.0, False -> score 0.0
float: used as score directly (clamped to 0.0-1.0)
dict: must have score and passed, rest is optional

Method 2: standalone verifier file

A full Python script that runs as python verifier.py <solution_path>:

#!/usr/bin/env python3
import importlib.util
import json
import sys
import random

def verify(solution_path, test_cases_path=None):
    spec = importlib.util.spec_from_file_location('solution', solution_path)
    solution = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(solution)

    seed = 42
    rng = random.Random(seed)

    # fixed test cases
    test_cases = [(0, 1), (1, 1), (5, 120), (10, 3628800)]

    # randomized test cases (deterministic via seed)
    for _ in range(16):
        n = rng.randint(0, 15)
        expected = 1
        for i in range(2, n + 1):
            expected *= i
        test_cases.append((n, expected))

    cases = []
    passed = 0
    for i, (n, expected) in enumerate(test_cases):
        try:
            actual = solution.factorial(n)
            ok = actual == expected
        except Exception as e:
            actual = str(e)
            ok = False

        passed += ok
        cases.append({
            'id': f'test_{i}',
            'passed': ok,
            'score': 1.0 if ok else 0.0,
            'input_summary': f'n={n}',
            'expected_summary': str(expected),
            'actual_summary': str(actual),
        })

    score = passed / len(test_cases)
    return {
        'schema_version': '1.0',
        'score': score,
        'passed': score == 1.0,
        'details': f'{passed}/{len(test_cases)} passed',
        'cases': cases,
        'seed': seed,
    }

if __name__ == '__main__':
    result = verify(sys.argv[1])
    print(json.dumps(result))
    sys.exit(0 if result['passed'] else 1)

Shaped rewards

Use reward_components for a richer training signal than pass/fail:

return {
    'score': 0.85,
    'passed': False,
    'reward_components': {
        'correctness': 0.8,
        'efficiency': 0.9,
        'style': 1.0,
    },
}

Then in your training loop:

result = dg.run(env, solution)
if result.reward_components:
    correctness = result.reward_components.get('correctness', 0.0)
    efficiency = result.reward_components.get('efficiency', 0.0)
    reward = 0.7 * correctness + 0.3 * efficiency

Per-test cases

The cases field is where the real value is. Instead of a single 0 or 1, your model learns which specific tests it got right:

result = dg.run(env, solution)
if result.cases:
    for case in result.cases:
        print(f"{case.id}: {'PASS' if case.passed else 'FAIL'} "
              f"(input: {case.input_summary})")

Or through the reward function:

from deepgym.integrations.reward import RewardFunction

reward_fn = RewardFunction(env)
per_test = reward_fn.per_test_rewards([solution])
# [{'test_0': 1.0, 'test_1': 0.0, 'test_2': 1.0, 'overall': 0.67}]

Tips

Use deterministic seeds. GRPO needs reproducibility. Same solution, same score, every time.
15-20 test cases is a good range. Too few and the verifier is exploitable. Too many and scoring is slow.
Mix fixed and randomized tests. Fixed catches known edge cases. Random prevents hardcoding.
Return per-test cases. Denser reward signal.
Catch exceptions in tests. A crashing verifier should return score 0, not bring down the pipeline.
Run adversarial testing. Use deepgym audit before training. See Adversarial Testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verifier Protocol

Verifier Protocol

JSON output

Required fields

Optional fields

Exit codes

Writing verifiers

Method 1: inline function body

Method 2: standalone verifier file

Shaped rewards

Per-test cases

Tips

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally