-
Notifications
You must be signed in to change notification settings - Fork 1
Verifier Protocol
Abhishek Gahlot edited this page Mar 27, 2026
·
1 revision
Verifiers receive model-generated code, run tests, and produce a structured JSON result.
Every verifier prints JSON to stdout:
{
"schema_version": "1.0",
"score": 0.85,
"passed": true,
"details": "17/20 tests passed",
"reward_components": {
"correctness": 0.85,
"efficiency": 0.92
},
"metrics": {
"execution_time_ms": 142,
"memory_mb": 24
},
"seed": 42,
"truncated": false,
"error_type": null,
"cases": [
{
"id": "test_0",
"passed": true,
"score": 1.0,
"input_summary": "coins=[1,2,5] amount=11",
"expected_summary": "3",
"actual_summary": "3",
"execution_time_ms": 2.1
},
{
"id": "test_1",
"passed": false,
"score": 0.0,
"input_summary": "coins=[2] amount=3",
"expected_summary": "-1",
"actual_summary": "2",
"error": "Wrong answer"
}
]
}| Field | Type | Constraint |
|---|---|---|
score |
float |
0.0 - 1.0, clamped |
passed |
bool |
| Field | Type | What it is |
|---|---|---|
schema_version |
str |
Always "1.0"
|
details |
str |
Human-readable summary |
reward_components |
dict[str, float] |
Named sub-scores for shaped rewards |
metrics |
dict |
Execution metrics |
seed |
int |
Random seed used (reproducibility) |
truncated |
bool |
true if timed out |
error_type |
str |
timeout, verifier_error, sandbox_error
|
cases |
list[CaseResult] |
Per-test breakdown |
| Code | Meaning |
|---|---|
0 |
Solution passed |
1 |
Solution failed (verifier ran fine, solution didn't) |
2 |
Verifier itself errored |
Write just the body. DeepGym wraps it into _run_verifier(solution_path, test_cases_path=None).
Return a float, bool, or dict. The wrapper normalizes it to JSON.
env = Environment(
task='Write `factorial(n)` ...',
verifier_code='''
import importlib.util
spec = importlib.util.spec_from_file_location("sol", solution_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
cases = [(0, 1), (1, 1), (5, 120), (10, 3628800)]
passed = sum(1 for n, expected in cases if mod.factorial(n) == expected)
return passed / len(cases)
''',
)Return types:
-
bool:True-> score 1.0,False-> score 0.0 -
float: used as score directly (clamped to 0.0-1.0) -
dict: must havescoreandpassed, rest is optional
A full Python script that runs as python verifier.py <solution_path>:
#!/usr/bin/env python3
import importlib.util
import json
import sys
import random
def verify(solution_path, test_cases_path=None):
spec = importlib.util.spec_from_file_location('solution', solution_path)
solution = importlib.util.module_from_spec(spec)
spec.loader.exec_module(solution)
seed = 42
rng = random.Random(seed)
# fixed test cases
test_cases = [(0, 1), (1, 1), (5, 120), (10, 3628800)]
# randomized test cases (deterministic via seed)
for _ in range(16):
n = rng.randint(0, 15)
expected = 1
for i in range(2, n + 1):
expected *= i
test_cases.append((n, expected))
cases = []
passed = 0
for i, (n, expected) in enumerate(test_cases):
try:
actual = solution.factorial(n)
ok = actual == expected
except Exception as e:
actual = str(e)
ok = False
passed += ok
cases.append({
'id': f'test_{i}',
'passed': ok,
'score': 1.0 if ok else 0.0,
'input_summary': f'n={n}',
'expected_summary': str(expected),
'actual_summary': str(actual),
})
score = passed / len(test_cases)
return {
'schema_version': '1.0',
'score': score,
'passed': score == 1.0,
'details': f'{passed}/{len(test_cases)} passed',
'cases': cases,
'seed': seed,
}
if __name__ == '__main__':
result = verify(sys.argv[1])
print(json.dumps(result))
sys.exit(0 if result['passed'] else 1)Use reward_components for a richer training signal than pass/fail:
return {
'score': 0.85,
'passed': False,
'reward_components': {
'correctness': 0.8,
'efficiency': 0.9,
'style': 1.0,
},
}Then in your training loop:
result = dg.run(env, solution)
if result.reward_components:
correctness = result.reward_components.get('correctness', 0.0)
efficiency = result.reward_components.get('efficiency', 0.0)
reward = 0.7 * correctness + 0.3 * efficiencyThe cases field is where the real value is. Instead of a single 0 or 1, your model learns which specific tests it got right:
result = dg.run(env, solution)
if result.cases:
for case in result.cases:
print(f"{case.id}: {'PASS' if case.passed else 'FAIL'} "
f"(input: {case.input_summary})")Or through the reward function:
from deepgym.integrations.reward import RewardFunction
reward_fn = RewardFunction(env)
per_test = reward_fn.per_test_rewards([solution])
# [{'test_0': 1.0, 'test_1': 0.0, 'test_2': 1.0, 'overall': 0.67}]- Use deterministic seeds. GRPO needs reproducibility. Same solution, same score, every time.
- 15-20 test cases is a good range. Too few and the verifier is exploitable. Too many and scoring is slow.
- Mix fixed and randomized tests. Fixed catches known edge cases. Random prevents hardcoding.
- Return per-test cases. Denser reward signal.
- Catch exceptions in tests. A crashing verifier should return score 0, not bring down the pipeline.
-
Run adversarial testing. Use
deepgym auditbefore training. See Adversarial Testing.