Make eval --resume optional and auto-detect latest incomplete run#842
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| ) | ||
| break | ||
| count += 1 | ||
| return count |
There was a problem hiding this comment.
Rollout counting diverges from load logic on corruption
Low Severity
_count_saved_rollouts unconditionally breaks on the first JSONDecodeError, assuming every malformed line is trailing. The existing load_outputs in save_utils.py explicitly checks whether valid lines exist after a malformed one, and raises for mid-file corruption. This inconsistency means find_latest_incomplete_eval_results_path can select a run with mid-file corruption (since _count_saved_rollouts undercounts), but the actual resume via load_outputs then crashes with an unhandled JSONDecodeError.
| ) | ||
|
|
||
| assert captured["configs"][0].resume_path is not None | ||
| assert captured["configs"][0].resume_path.resolve() == new_run.resolve() |
There was a problem hiding this comment.
Flaky test missing explicit directory timestamp control
Medium Severity
test_cli_resume_auto_detects_latest_incomplete relies on new_run having a later st_mtime than old_run but never sets explicit timestamps. On filesystems with coarse time granularity (e.g. 1-second resolution), both directories can share the same mtime, making the sort order non-deterministic and the assertion on new_run flaky. The analogous test in test_path_utils.py correctly uses os.utime to guarantee ordering.
* attempt 1 * stateful load/save * functional * simpler * remove old stuff * less git diff * fix * update toml config * refactor to use callbacks consistently * correct usage of callbacks * deprecate use_tqdm * add docs * fix group increments and progress init * fix error rate by computing in metadata * to not trigger assert * remove hf ref * do not show tqdm in gepa * fix(eval): harden resume by tolerating partial JSONL tail and validating metadata * fix style * allow increased num_examples * Fix typo: 'evaluaton' -> 'evaluation' in resume log message Co-authored-by: will brown <willccbb@users.noreply.github.com> * Remove unused self.logger from GenerateOutputsBuilder The constructor created self.logger but it was never used in any method. The module-level logger is used elsewhere in the file for all logging. Co-authored-by: will brown <willccbb@users.noreply.github.com> * Reuse metadata from build_metadata() instead of calling it twice per iteration The build_metadata() method was called twice per iteration in the as_completed loop—once to pass to on_progress, and again to save. Since build_metadata() computes averages over all accumulated outputs, this duplication was wasteful. Now the metadata computed for on_progress is reused for the save operation. Co-authored-by: will brown <willccbb@users.noreply.github.com> * Make eval `--resume` optional and auto-detect latest incomplete run (#842) * Add optional --resume auto-detection for eval runs * Fix resume=false handling and dedupe output path resolution * Harden eval results path validation to require files * Fix append handling corrupt outputs * Fix resume append corruption * Fix resume output appending * Fix resume append and typing errors * set path create time directly * use -R shorthand for resume, -i for independent scoring --------- Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: will brown <willccbb@users.noreply.github.com> Co-authored-by: will brown <williambrown97@gmail.com>


Motivation
--resumeaccept an optional path and auto-detect the most recent incomplete matching run.resume_path.Description
--resume-pathwith--resume [PATH]so--resume <path>validates and resumes the provided directory while--resume(no path) triggers auto-detection.verifiers/utils/path_utils.py:get_eval_runs_dir,_count_saved_rollouts, andfind_latest_incomplete_eval_results_pathwhich locate per-env run directories, count completed rollouts, and pick the newest incomplete run matchingenv_id,model,rollouts_per_example, andnum_examplescompatibility.verifiers/scripts/eval.py, including explicit-path validation and logging; keepresume_pathin the producedEvalConfigfor downstream code.resumeinload_toml_configand mappingresume_path->resumefor backwards compatibility.docs/evaluation.mdto document--resume [PATH]and the no-path auto-resume behavior.tests/test_eval_cli.py(explicit path + auto-detect cases) and new unit tests intests/test_path_utils.pyfor candidate selection logic.Testing
uv run ruff check --fix verifiers/scripts/eval.py verifiers/utils/path_utils.py verifiers/utils/eval_utils.py tests/test_eval_cli.py tests/test_path_utils.pyand the fixes completed successfully.uv run pytest tests/test_eval_cli.py tests/test_path_utils.py -qand both test files passed.Codex Task
Note
Medium Risk
Touches evaluation checkpoint/resume and output path selection; incorrect matching or rollout counting could resume the wrong run or skip/redo work, though changes are well-covered by new tests.
Overview
Adds a new
prime evalresume mode by replacing--resume-pathwith--resume [PATH]: passing a path still validates and resumes that run, while--resumewith no path now auto-detects the most recent incomplete matching run.Implements auto-detection in
path_utils(run directory discovery, rollout counting, newest-incomplete selection) and wires it intoverifiers/scripts/eval.py, including TOML support (resumeplus backward-compatibleresume_path) and stricter results-dir validation. Updates evaluation docs and extends tests to cover explicit resume, auto-detect, and TOML override behavior.Written by Cursor Bugbot for commit 08d1c16. This will update automatically on new commits. Configure here.