Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
ea8d2de
attempt 1
mikasenghaas Jan 29, 2026
394dc00
stateful load/save
mikasenghaas Jan 29, 2026
8642aac
functional
mikasenghaas Jan 29, 2026
c5cebdb
simpler
mikasenghaas Jan 29, 2026
4b0ea24
remove old stuff
mikasenghaas Jan 29, 2026
ef51ce6
less git diff
mikasenghaas Jan 29, 2026
3999037
fix
mikasenghaas Jan 29, 2026
e73288a
update toml config
mikasenghaas Jan 29, 2026
c6d50a5
refactor to use callbacks consistently
mikasenghaas Feb 2, 2026
2cf9e62
correct usage of callbacks
mikasenghaas Feb 2, 2026
a94a622
deprecate use_tqdm
mikasenghaas Feb 2, 2026
a854aba
add docs
mikasenghaas Feb 2, 2026
c03d7e2
fix group increments and progress init
mikasenghaas Feb 2, 2026
9ffd82b
fix error rate by computing in metadata
mikasenghaas Feb 2, 2026
6b36e9e
to not trigger assert
mikasenghaas Feb 2, 2026
0f6ec75
remove hf ref
mikasenghaas Feb 2, 2026
2b171c1
do not show tqdm in gepa
mikasenghaas Feb 2, 2026
8c34eb3
Merge remote-tracking branch 'origin/main' into resume-evals
hallerite Feb 5, 2026
721fb30
fix(eval): harden resume by tolerating partial JSONL tail and validat…
hallerite Feb 5, 2026
a486abd
fix style
hallerite Feb 5, 2026
15d2e21
allow increased num_examples
hallerite Feb 6, 2026
49fd285
Fix typo: 'evaluaton' -> 'evaluation' in resume log message
cursoragent Feb 6, 2026
800d891
Remove unused self.logger from GenerateOutputsBuilder
cursoragent Feb 6, 2026
05089df
Reuse metadata from build_metadata() instead of calling it twice per …
cursoragent Feb 6, 2026
c588afd
Make eval `--resume` optional and auto-detect latest incomplete run (…
willccbb Feb 6, 2026
355b998
mc
willccbb Feb 6, 2026
78f31b6
Fix append handling corrupt outputs
willccbb Feb 6, 2026
59c02f1
Fix resume append corruption
willccbb Feb 6, 2026
2d2737f
Fix resume output appending
willccbb Feb 6, 2026
eb54360
Fix resume append and typing errors
willccbb Feb 6, 2026
723c4bd
set path create time directly
mikasenghaas Feb 6, 2026
e6276bc
use -R shorthand for resume, -i for independent scoring
mikasenghaas Feb 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 52 additions & 2 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This section explains how to run evaluations with Verifiers environments. See [E
- [Evaluation Scope](#evaluation-scope)
- [Concurrency](#concurrency)
- [Output and Saving](#output-and-saving)
- [Resuming Evaluations](#resuming-evaluations)
- [Environment Defaults](#environment-defaults)
- [Multi-Environment Evaluation](#multi-environment-evaluation)
- [TOML Configuration](#toml-configuration)
Expand Down Expand Up @@ -124,6 +125,7 @@ Multiple rollouts per example enable metrics like pass@k and help measure varian
| `--max-concurrent-generation` | — | same as `-c` | Concurrent generation requests |
| `--max-concurrent-scoring` | — | same as `-c` | Concurrent scoring requests |
| `--no-interleave-scoring` | `-N` | false | Disable interleaved scoring |
| `--independent-scoring` | `-i` | false | Score each rollout individually instead of by group |
| `--max-retries` | — | 0 | Retries per rollout on transient `InfraError` |

By default, scoring runs interleaved with generation. Use `--no-interleave-scoring` to score all rollouts after generation completes.
Expand All @@ -138,12 +140,60 @@ The `--max-retries` flag enables automatic retry with exponential backoff when r
| `--tui` | `-u` | false | Use alternate screen mode (TUI) for display |
| `--debug` | `-d` | false | Disable Rich display; use normal logging and tqdm progress |
| `--save-results` | `-s` | false | Save results to disk |
| `--save-every` | `-f` | -1 | Save checkpoint every N rollouts |
| `--resume [PATH]` | `-R` | | Resume from a previous run (auto-detect latest matching incomplete run if PATH omitted) |
| `--state-columns` | `-C` | — | Extra state columns to save (comma-separated) |
| `--save-to-hf-hub` | `-H` | false | Push results to Hugging Face Hub |
| `--hf-hub-dataset-name` | `-D` | — | Dataset name for HF Hub |

Results are saved to `./outputs/evals/{env_id}--{model}/` as a Hugging Face dataset.
Results are saved to `./outputs/evals/{env_id}--{model}/{run_id}/`, containing:

- `results.jsonl` — rollout outputs, one per line
- `metadata.json` — evaluation configuration and aggregate metrics

### Resuming Evaluations

Long-running evaluations can be interrupted and resumed using checkpointing. When `--save-results` is enabled, results are saved incrementally after each completed group of rollouts. Use `--resume` to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run.

**Running with checkpoints:**

```bash
prime eval run my-env -n 1000 -s
```

With `-s` (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption.

**Resuming from a checkpoint:**

```bash
prime eval run my-env -n 1000 -s --resume ./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/abc12345
```

When a resume path is provided, it must point to a valid evaluation results directory containing both `results.jsonl` and `metadata.json`. With `--resume` and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching `env_id`, `model`, and `rollouts_per_example` where saved `num_examples` is less than or equal to the current run. When resuming:

1. Existing completed rollouts are loaded from the checkpoint
2. Remaining rollouts are computed based on the example ids and group size
3. Only incomplete rollouts are executed
4. New results are appended to the existing checkpoint

If all rollouts are already complete, the evaluation returns immediately with the existing results.

**Configuration compatibility:**

When resuming, the current run configuration should match the original run. Mismatches in parameters like `--model`, `--env-args`, or `--rollouts-per-example` can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing `--num-examples` if you need additional rollouts beyond the original target.

**Example workflow:**

```bash
# Start a large evaluation with checkpointing
prime eval run math-python -n 500 -r 3 -s

# If interrupted, find the run directory
ls ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/

# Resume from the checkpoint
prime eval run math-python -n 500 -r 3 -s \
--resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
```

The `--state-columns` flag allows saving environment-specific state fields that your environment stores during rollouts:

Expand Down
3 changes: 1 addition & 2 deletions docs/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -598,11 +598,10 @@ class EvalConfig(BaseModel):
independent_scoring: bool = False
extra_env_kwargs: dict = {}
max_retries: int = 0
print_results: bool = False
verbose: bool = False
state_columns: list[str] | None = None
save_results: bool = False
save_every: int = -1
resume_path: Path | None = None
save_to_hf_hub: bool = False
hf_hub_dataset_name: str | None = None
```
Expand Down
32 changes: 32 additions & 0 deletions tests/test_environment_extra.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from __future__ import annotations

import asyncio
import json
from typing import Callable

import pytest
Expand Down Expand Up @@ -222,3 +223,34 @@ def test_make_dataset_basic_without_tools(make_metadata, make_output):
results = GenerateOutputs(outputs=[make_output()], metadata=make_metadata())
ds = build_dataset(results)
assert len(ds) == 1 and "foo" in ds.column_names


@pytest.mark.asyncio
async def test_generate_resume_raises_on_metadata_mismatch(
tmp_path, mock_openai_client, make_dummy_env, make_input
):
env = make_dummy_env(mock_openai_client)

results_path = tmp_path / "resume"
results_path.mkdir()
(results_path / "results.jsonl").write_text("", encoding="utf-8")
(results_path / "metadata.json").write_text(
json.dumps(
{
"env_id": env.env_id,
"model": "test-model",
"num_examples": 2,
"rollouts_per_example": 1,
}
),
encoding="utf-8",
)

inputs = [make_input(example_id=0)]
with pytest.raises(ValueError, match="metadata mismatch"):
await env.generate(
inputs=inputs,
client=mock_openai_client,
model="test-model",
results_path=results_path,
)
94 changes: 94 additions & 0 deletions tests/test_eval_cli.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import argparse
import os
import tempfile
import time
from pathlib import Path
from types import SimpleNamespace

Expand Down Expand Up @@ -42,6 +44,7 @@ def _run_cli(monkeypatch, overrides, capture_all_configs: bool = False):
"no_interleave_scoring": False,
"state_columns": [],
"save_results": False,
"resume": None,
"save_every": -1,
"save_to_hf_hub": False,
"hf_hub_dataset_name": "",
Expand Down Expand Up @@ -459,3 +462,94 @@ def test_load_toml_config_invalid_global_field():
f.flush()
with pytest.raises(ValueError):
load_toml_config(Path(f.name))


def test_cli_resume_explicit_path(monkeypatch, run_cli, tmp_path: Path):
"""--resume with explicit path sets resume_path."""
resume_dir = tmp_path / "resume"
resume_dir.mkdir(parents=True)
(resume_dir / "results.jsonl").write_text("", encoding="utf-8")
(resume_dir / "metadata.json").write_text("{}", encoding="utf-8")

captured = run_cli(
monkeypatch,
{
"resume": str(resume_dir),
},
)

assert captured["configs"][0].resume_path == resume_dir


def test_cli_resume_auto_detects_latest_incomplete(
monkeypatch, run_cli, tmp_path: Path
):
"""--resume with no path auto-detects latest matching incomplete run."""
env_id = "dummy-env"
model = "gpt-4.1-mini"
run_base = tmp_path / "outputs" / "evals" / f"{env_id}--{model.replace('/', '--')}"
old_run = run_base / "oldrun"
new_run = run_base / "newrun"
old_run.mkdir(parents=True)
new_run.mkdir(parents=True)

metadata = (
'{"env_id":"dummy-env","model":"gpt-4.1-mini",'
'"num_examples":4,"rollouts_per_example":1}'
)
(old_run / "metadata.json").write_text(metadata, encoding="utf-8")
(new_run / "metadata.json").write_text(metadata, encoding="utf-8")

(old_run / "results.jsonl").write_text('{"example_id":0}\n', encoding="utf-8")
(new_run / "results.jsonl").write_text(
'{"example_id":0}\n{"example_id":1}\n', encoding="utf-8"
)
now = time.time()
os.utime(old_run, (now, now))
os.utime(new_run, (now + 1, now + 1))

monkeypatch.chdir(tmp_path)
captured = run_cli(
monkeypatch,
{
"resume": True,
"num_examples": 4,
"rollouts_per_example": 1,
"env_dir_path": str(tmp_path / "environments"),
},
)

assert captured["configs"][0].resume_path is not None
assert captured["configs"][0].resume_path.resolve() == new_run.resolve()


def test_cli_toml_resume_false_disables_global_resume(monkeypatch, run_cli):
"""Per-eval resume=false overrides global resume=true in TOML configs."""
with tempfile.NamedTemporaryFile(suffix=".toml", delete=False, mode="w") as f:
f.write(
"resume = true\n"
"\n"
"[[eval]]\n"
'env_id = "env-a"\n'
"\n"
"[[eval]]\n"
'env_id = "env-b"\n'
"resume = false\n"
)
f.flush()
captured = run_cli(
monkeypatch,
{
"env_id_or_config": f.name,
"num_examples": 1,
"rollouts_per_example": 1,
"env_dir_path": "./environments",
},
)

configs = captured["configs"]
assert len(configs) == 2
assert configs[0].env_id == "env-a"
assert configs[0].resume_path is None
assert configs[1].env_id == "env-b"
assert configs[1].resume_path is None
94 changes: 94 additions & 0 deletions tests/test_path_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import os
from pathlib import Path

from verifiers.utils.path_utils import (
find_latest_incomplete_eval_results_path,
is_valid_eval_results_path,
)


def test_find_latest_incomplete_eval_results_path_picks_newest_matching(
tmp_path: Path, monkeypatch
):
env_id = "dummy-env"
model = "openai/gpt-4.1-mini"
runs_dir = (
tmp_path
/ "outputs"
/ "evals"
/ f"{env_id}--{model.replace('/', '--')}"
)

old_run = runs_dir / "11111111"
new_run = runs_dir / "22222222"
complete_run = runs_dir / "33333333"
for run in [old_run, new_run, complete_run]:
run.mkdir(parents=True)

metadata = (
'{"env_id":"dummy-env","model":"openai/gpt-4.1-mini",'
'"num_examples":4,"rollouts_per_example":1}'
)
for run in [old_run, new_run, complete_run]:
(run / "metadata.json").write_text(metadata, encoding="utf-8")

(old_run / "results.jsonl").write_text('{"example_id":0}\n', encoding="utf-8")
(new_run / "results.jsonl").write_text(
'{"example_id":0}\n{"example_id":1}\n', encoding="utf-8"
)
(complete_run / "results.jsonl").write_text(
'{"example_id":0}\n{"example_id":1}\n{"example_id":2}\n{"example_id":3}\n',
encoding="utf-8",
)

os.utime(old_run, (1, 1))
os.utime(new_run, (2, 2))
os.utime(complete_run, (3, 3))

monkeypatch.chdir(tmp_path)

result = find_latest_incomplete_eval_results_path(
env_id=env_id,
model=model,
num_examples=4,
rollouts_per_example=1,
env_dir_path=str(tmp_path / "environments"),
)

assert result is not None
assert result.resolve() == new_run.resolve()


def test_find_latest_incomplete_eval_results_path_returns_none_when_no_match(
tmp_path: Path, monkeypatch
):
monkeypatch.chdir(tmp_path)

result = find_latest_incomplete_eval_results_path(
env_id="dummy-env",
model="openai/gpt-4.1-mini",
num_examples=4,
rollouts_per_example=1,
env_dir_path=str(tmp_path / "environments"),
)
assert result is None


def test_is_valid_eval_results_path_requires_files(tmp_path: Path):
run_dir = tmp_path / "run"
run_dir.mkdir()

(run_dir / "results.jsonl").mkdir()
(run_dir / "metadata.json").mkdir()

assert not is_valid_eval_results_path(run_dir)


def test_is_valid_eval_results_path_accepts_expected_layout(tmp_path: Path):
run_dir = tmp_path / "run"
run_dir.mkdir()

(run_dir / "results.jsonl").write_text("", encoding="utf-8")
(run_dir / "metadata.json").write_text("{}", encoding="utf-8")

assert is_valid_eval_results_path(run_dir)
Loading
Loading