Skip to content

Comments

Make eval --resume optional and auto-detect latest incomplete run#842

Merged
willccbb merged 3 commits intoresume-evalsfrom
codex/make-resume-path-optional-and-auto-detect
Feb 6, 2026
Merged

Make eval --resume optional and auto-detect latest incomplete run#842
willccbb merged 3 commits intoresume-evalsfrom
codex/make-resume-path-optional-and-auto-detect

Conversation

@willccbb
Copy link
Member

@willccbb willccbb commented Feb 6, 2026

Motivation

  • Allow resuming evaluations without always supplying an explicit results path by making --resume accept an optional path and auto-detect the most recent incomplete matching run.
  • Keep existing explicit-path behavior and preserve TOML backward compatibility for resume_path.

Description

  • Replace CLI flag --resume-path with --resume [PATH] so --resume <path> validates and resumes the provided directory while --resume (no path) triggers auto-detection.
  • Add auto-resume utilities in verifiers/utils/path_utils.py: get_eval_runs_dir, _count_saved_rollouts, and find_latest_incomplete_eval_results_path which locate per-env run directories, count completed rollouts, and pick the newest incomplete run matching env_id, model, rollouts_per_example, and num_examples compatibility.
  • Wire auto-detection into the CLI flow in verifiers/scripts/eval.py, including explicit-path validation and logging; keep resume_path in the produced EvalConfig for downstream code.
  • Accept legacy TOML field by allowing resume in load_toml_config and mapping resume_path -> resume for backwards compatibility.
  • Update docs in docs/evaluation.md to document --resume [PATH] and the no-path auto-resume behavior.
  • Add tests: CLI tests updated in tests/test_eval_cli.py (explicit path + auto-detect cases) and new unit tests in tests/test_path_utils.py for candidate selection logic.

Testing

  • Ran style checks: uv run ruff check --fix verifiers/scripts/eval.py verifiers/utils/path_utils.py verifiers/utils/eval_utils.py tests/test_eval_cli.py tests/test_path_utils.py and the fixes completed successfully.
  • Ran unit tests: uv run pytest tests/test_eval_cli.py tests/test_path_utils.py -q and both test files passed.

Codex Task


Note

Medium Risk
Touches evaluation checkpoint/resume and output path selection; incorrect matching or rollout counting could resume the wrong run or skip/redo work, though changes are well-covered by new tests.

Overview
Adds a new prime eval resume mode by replacing --resume-path with --resume [PATH]: passing a path still validates and resumes that run, while --resume with no path now auto-detects the most recent incomplete matching run.

Implements auto-detection in path_utils (run directory discovery, rollout counting, newest-incomplete selection) and wires it into verifiers/scripts/eval.py, including TOML support (resume plus backward-compatible resume_path) and stricter results-dir validation. Updates evaluation docs and extends tests to cover explicit resume, auto-detect, and TOML override behavior.

Written by Cursor Bugbot for commit 08d1c16. This will update automatically on new commits. Configure here.

@willccbb willccbb merged commit c588afd into resume-evals Feb 6, 2026
2 of 3 checks passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

)
break
count += 1
return count
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollout counting diverges from load logic on corruption

Low Severity

_count_saved_rollouts unconditionally breaks on the first JSONDecodeError, assuming every malformed line is trailing. The existing load_outputs in save_utils.py explicitly checks whether valid lines exist after a malformed one, and raises for mid-file corruption. This inconsistency means find_latest_incomplete_eval_results_path can select a run with mid-file corruption (since _count_saved_rollouts undercounts), but the actual resume via load_outputs then crashes with an unhandled JSONDecodeError.

Fix in Cursor Fix in Web

)

assert captured["configs"][0].resume_path is not None
assert captured["configs"][0].resume_path.resolve() == new_run.resolve()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flaky test missing explicit directory timestamp control

Medium Severity

test_cli_resume_auto_detects_latest_incomplete relies on new_run having a later st_mtime than old_run but never sets explicit timestamps. On filesystems with coarse time granularity (e.g. 1-second resolution), both directories can share the same mtime, making the sort order non-deterministic and the assertion on new_run flaky. The analogous test in test_path_utils.py correctly uses os.utime to guarantee ordering.

Fix in Cursor Fix in Web

mikasenghaas added a commit that referenced this pull request Feb 6, 2026
* attempt 1

* stateful load/save

* functional

* simpler

* remove old stuff

* less git diff

* fix

* update toml config

* refactor to use callbacks consistently

* correct usage of callbacks

* deprecate use_tqdm

* add docs

* fix group increments and progress init

* fix error rate by computing in metadata

* to not trigger assert

* remove hf ref

* do not show tqdm in gepa

* fix(eval): harden resume by tolerating partial JSONL tail and validating metadata

* fix style

* allow increased num_examples

* Fix typo: 'evaluaton' -> 'evaluation' in resume log message

Co-authored-by: will brown <willccbb@users.noreply.github.com>

* Remove unused self.logger from GenerateOutputsBuilder

The constructor created self.logger but it was never used in any method.
The module-level logger is used elsewhere in the file for all logging.

Co-authored-by: will brown <willccbb@users.noreply.github.com>

* Reuse metadata from build_metadata() instead of calling it twice per iteration

The build_metadata() method was called twice per iteration in the
as_completed loop—once to pass to on_progress, and again to save.
Since build_metadata() computes averages over all accumulated outputs,
this duplication was wasteful. Now the metadata computed for on_progress
is reused for the save operation.

Co-authored-by: will brown <willccbb@users.noreply.github.com>

* Make eval `--resume` optional and auto-detect latest incomplete run (#842)

* Add optional --resume auto-detection for eval runs

* Fix resume=false handling and dedupe output path resolution

* Harden eval results path validation to require files

* Fix append handling corrupt outputs

* Fix resume append corruption

* Fix resume output appending

* Fix resume append and typing errors

* set path create time directly

* use -R shorthand for resume, -i for independent scoring

---------

Co-authored-by: hallerite <git@hallerite.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: will brown <willccbb@users.noreply.github.com>
Co-authored-by: will brown <williambrown97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant