Make eval `--resume` optional and auto-detect latest incomplete run by willccbb · Pull Request #842 · PrimeIntellect-ai/verifiers

willccbb · 2026-02-06T12:59:04Z

Motivation

Allow resuming evaluations without always supplying an explicit results path by making --resume accept an optional path and auto-detect the most recent incomplete matching run.
Keep existing explicit-path behavior and preserve TOML backward compatibility for resume_path.

Description

Replace CLI flag --resume-path with --resume [PATH] so --resume <path> validates and resumes the provided directory while --resume (no path) triggers auto-detection.
Add auto-resume utilities in verifiers/utils/path_utils.py: get_eval_runs_dir, _count_saved_rollouts, and find_latest_incomplete_eval_results_path which locate per-env run directories, count completed rollouts, and pick the newest incomplete run matching env_id, model, rollouts_per_example, and num_examples compatibility.
Wire auto-detection into the CLI flow in verifiers/scripts/eval.py, including explicit-path validation and logging; keep resume_path in the produced EvalConfig for downstream code.
Accept legacy TOML field by allowing resume in load_toml_config and mapping resume_path -> resume for backwards compatibility.
Update docs in docs/evaluation.md to document --resume [PATH] and the no-path auto-resume behavior.
Add tests: CLI tests updated in tests/test_eval_cli.py (explicit path + auto-detect cases) and new unit tests in tests/test_path_utils.py for candidate selection logic.

Testing

Ran style checks: uv run ruff check --fix verifiers/scripts/eval.py verifiers/utils/path_utils.py verifiers/utils/eval_utils.py tests/test_eval_cli.py tests/test_path_utils.py and the fixes completed successfully.
Ran unit tests: uv run pytest tests/test_eval_cli.py tests/test_path_utils.py -q and both test files passed.

Codex Task

Note

Medium Risk
Touches evaluation checkpoint/resume and output path selection; incorrect matching or rollout counting could resume the wrong run or skip/redo work, though changes are well-covered by new tests.

Overview
Adds a new prime eval resume mode by replacing --resume-path with --resume [PATH]: passing a path still validates and resumes that run, while --resume with no path now auto-detects the most recent incomplete matching run.

Implements auto-detection in path_utils (run directory discovery, rollout counting, newest-incomplete selection) and wires it into verifiers/scripts/eval.py, including TOML support (resume plus backward-compatible resume_path) and stricter results-dir validation. Updates evaluation docs and extends tests to cover explicit resume, auto-detect, and TOML override behavior.

^{Written by Cursor Bugbot for commit 08d1c16. This will update automatically on new commits. Configure here.}

verifiers/scripts/eval.py

verifiers/utils/path_utils.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-06T13:33:04Z

verifiers/utils/path_utils.py

+                )
+                break
+            count += 1
+    return count


Rollout counting diverges from load logic on corruption

Low Severity

_count_saved_rollouts unconditionally breaks on the first JSONDecodeError, assuming every malformed line is trailing. The existing load_outputs in save_utils.py explicitly checks whether valid lines exist after a malformed one, and raises for mid-file corruption. This inconsistency means find_latest_incomplete_eval_results_path can select a run with mid-file corruption (since _count_saved_rollouts undercounts), but the actual resume via load_outputs then crashes with an unhandled JSONDecodeError.

cursor · 2026-02-06T13:33:04Z

tests/test_eval_cli.py

+    )
+
+    assert captured["configs"][0].resume_path is not None
+    assert captured["configs"][0].resume_path.resolve() == new_run.resolve()


Flaky test missing explicit directory timestamp control

Medium Severity

test_cli_resume_auto_detects_latest_incomplete relies on new_run having a later st_mtime than old_run but never sets explicit timestamps. On filesystems with coarse time granularity (e.g. 1-second resolution), both directories can share the same mtime, making the sort order non-deterministic and the assertion on new_run flaky. The analogous test in test_path_utils.py correctly uses os.utime to guarantee ordering.

* attempt 1 * stateful load/save * functional * simpler * remove old stuff * less git diff * fix * update toml config * refactor to use callbacks consistently * correct usage of callbacks * deprecate use_tqdm * add docs * fix group increments and progress init * fix error rate by computing in metadata * to not trigger assert * remove hf ref * do not show tqdm in gepa * fix(eval): harden resume by tolerating partial JSONL tail and validating metadata * fix style * allow increased num_examples * Fix typo: 'evaluaton' -> 'evaluation' in resume log message Co-authored-by: will brown <willccbb@users.noreply.github.com> * Remove unused self.logger from GenerateOutputsBuilder The constructor created self.logger but it was never used in any method. The module-level logger is used elsewhere in the file for all logging. Co-authored-by: will brown <willccbb@users.noreply.github.com> * Reuse metadata from build_metadata() instead of calling it twice per iteration The build_metadata() method was called twice per iteration in the as_completed loop—once to pass to on_progress, and again to save. Since build_metadata() computes averages over all accumulated outputs, this duplication was wasteful. Now the metadata computed for on_progress is reused for the save operation. Co-authored-by: will brown <willccbb@users.noreply.github.com> * Make eval `--resume` optional and auto-detect latest incomplete run (#842) * Add optional --resume auto-detection for eval runs * Fix resume=false handling and dedupe output path resolution * Harden eval results path validation to require files * Fix append handling corrupt outputs * Fix resume append corruption * Fix resume output appending * Fix resume append and typing errors * set path create time directly * use -R shorthand for resume, -i for independent scoring --------- Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: will brown <willccbb@users.noreply.github.com> Co-authored-by: will brown <williambrown97@gmail.com>

Add optional --resume auto-detection for eval runs

dd40b23

willccbb added the codex label Feb 6, 2026 — with ChatGPT Codex Connector

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/scripts/eval.py Outdated Show resolved Hide resolved

verifiers/utils/path_utils.py Show resolved Hide resolved

willccbb added 2 commits February 6, 2026 05:13

Fix resume=false handling and dedupe output path resolution

36ac949

Harden eval results path validation to require files

08d1c16

willccbb merged commit c588afd into resume-evals Feb 6, 2026
2 of 3 checks passed

cursor bot reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Make eval `--resume` optional and auto-detect latest incomplete run#842

Make eval `--resume` optional and auto-detect latest incomplete run#842
willccbb merged 3 commits intoresume-evalsfrom
codex/make-resume-path-optional-and-auto-detect

willccbb commented Feb 6, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 6, 2026

Uh oh!

cursor bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

willccbb commented Feb 6, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Rollout counting diverges from load logic on corruption

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Flaky test missing explicit directory timestamp control

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

willccbb commented Feb 6, 2026 •

edited by cursor bot

Loading