Conversation
05501cb to
ff11125
Compare
e0d255f to
74f323f
Compare
c4d690d to
6c047c9
Compare
74f323f to
7c35822
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| # use env_state.total for actual resolved values | ||
| total_rollouts = env_state.total | ||
| num_examples = total_rollouts // config.rollouts_per_example | ||
| n = f"{num_examples}x{config.rollouts_per_example} ({total_rollouts} rollouts)" |
There was a problem hiding this comment.
Confusing negative values in TUI summary for early failures
Low Severity
When num_examples = -1 (meaning "all examples"), the TUI initializes total = config.num_examples * config.rollouts_per_example which produces a negative value (e.g., -3). If the evaluation fails before the on_start callback is invoked (e.g., during environment loading), env_state.total remains negative. In print_final_summary, the calculation num_examples = total_rollouts // config.rollouts_per_example produces negative values, resulting in confusing display output like "-1x3 (-3 rollouts)" in the summary table.
Additional Locations (1)
| metrics = { | ||
| name: metrics_accum[name] / completed for name in metrics_accum | ||
| } | ||
| error_rate = error_accum / completed |
There was a problem hiding this comment.
TUI reward average calculation differs from original tqdm
Medium Severity
The TUI's on_progress callback calculates the reward average by dividing reward_accum by completed (total states), but the original tqdm progress bar in environment.py divides by reward_count (only states with non-None rewards). When some states lack rewards (e.g., due to scoring errors), the TUI displays an incorrect, understated average. For example, if 8 of 10 states have reward=0.5 and 2 have reward=None, the original shows 0.5 but the TUI shows 0.4. The same issue affects the metrics averaging.
Description
This PR implements a live-updating TUI for multi-env logs (#734) using a opt-in
--tuiflag. This will be especially useful for large prod-scale evals with many envs. It is designed such that the defaultvf-evalis affected as little as possible. The only "structural" change is the introduction of the callback pattern intoEnvironment.generateto handle per-env events (such as metric aggregation, logs, etc.).Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Adds an opt-in live TUI for multi-environment evaluations and introduces callback-based progress reporting.
--tuiflag invf-evalto render a live Rich UI; falls back to standard output on non-TTYs. Implementsrun_evaluations_tuiandverifiers/utils/eval_tui.py.Environment.generate/evaluateaccepton_start,on_progress,on_log(types added inverifiers/types.py); tqdm disabled when callbacks provided; emits log on final save.--tuithroughverifiers/scripts/eval.py; adduse_tqdmtoEvalConfig; update example TOMLs (debug, single-turn, duplicate-env).--tuiindocs/evaluation.md; adjust tests to new evaluation call signature and args.Written by Cursor Bugbot for commit e2da252. This will update automatically on new commits. Configure here.