Skip to content

Comments

eval tui#735

Merged
mikasenghaas merged 70 commits intomainfrom
eval-tui
Jan 21, 2026
Merged

eval tui#735
mikasenghaas merged 70 commits intomainfrom
eval-tui

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 15, 2026

Description

This PR implements a live-updating TUI for multi-env logs (#734) using a opt-in --tui flag. This will be especially useful for large prod-scale evals with many envs. It is designed such that the default vf-eval is affected as little as possible. The only "structural" change is the introduction of the callback pattern into Environment.generate to handle per-env events (such as metric aggregation, logs, etc.).

uv run vf-eval configs/eval/debug.toml --tui
Screenshot 2026-01-15 at 9 44 33 PM Screenshot 2026-01-15 at 9 44 28 PM Screenshot 2026-01-15 at 9 44 37 PM

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Adds an opt-in live TUI for multi-environment evaluations and introduces callback-based progress reporting.

  • New TUI mode: --tui flag in vf-eval to render a live Rich UI; falls back to standard output on non-TTYs. Implements run_evaluations_tui and verifiers/utils/eval_tui.py.
  • Callback hooks: Environment.generate/evaluate accept on_start, on_progress, on_log (types added in verifiers/types.py); tqdm disabled when callbacks provided; emits log on final save.
  • CLI and config updates: Wire --tui through verifiers/scripts/eval.py; add use_tqdm to EvalConfig; update example TOMLs (debug, single-turn, duplicate-env).
  • Docs/tests: Document --tui in docs/evaluation.md; adjust tests to new evaluation call signature and args.

Written by Cursor Bugbot for commit e2da252. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas mentioned this pull request Jan 15, 2026
13 tasks
@mikasenghaas mikasenghaas changed the base branch from main to multi-env-evals January 15, 2026 17:28
@mikasenghaas mikasenghaas force-pushed the eval-tui branch 3 times, most recently from 05501cb to ff11125 Compare January 15, 2026 20:11
@mikasenghaas mikasenghaas requested a review from willccbb January 15, 2026 20:46
@mikasenghaas mikasenghaas force-pushed the eval-tui branch 2 times, most recently from e0d255f to 74f323f Compare January 16, 2026 10:36
cursor[bot]

This comment was marked as outdated.

@willccbb willccbb changed the base branch from multi-env-evals to main January 21, 2026 07:15
@willccbb willccbb marked this pull request as ready for review January 21, 2026 07:15
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@mikasenghaas mikasenghaas merged commit cb2e80f into main Jan 21, 2026
6 checks passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

# use env_state.total for actual resolved values
total_rollouts = env_state.total
num_examples = total_rollouts // config.rollouts_per_example
n = f"{num_examples}x{config.rollouts_per_example} ({total_rollouts} rollouts)"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing negative values in TUI summary for early failures

Low Severity

When num_examples = -1 (meaning "all examples"), the TUI initializes total = config.num_examples * config.rollouts_per_example which produces a negative value (e.g., -3). If the evaluation fails before the on_start callback is invoked (e.g., during environment loading), env_state.total remains negative. In print_final_summary, the calculation num_examples = total_rollouts // config.rollouts_per_example produces negative values, resulting in confusing display output like "-1x3 (-3 rollouts)" in the summary table.

Additional Locations (1)

Fix in Cursor Fix in Web

metrics = {
name: metrics_accum[name] / completed for name in metrics_accum
}
error_rate = error_accum / completed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TUI reward average calculation differs from original tqdm

Medium Severity

The TUI's on_progress callback calculates the reward average by dividing reward_accum by completed (total states), but the original tqdm progress bar in environment.py divides by reward_count (only states with non-None rewards). When some states lack rewards (e.g., due to scoring errors), the TUI displays an incorrect, understated average. For example, if 8 of 10 states have reward=0.5 and 2 have reward=None, the original shows 0.5 but the TUI shows 0.4. The same issue affects the metrics averaging.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants