eval tui by mikasenghaas · Pull Request #735 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-01-15T15:39:56Z

Description

This PR implements a live-updating TUI for multi-env logs (#734) using a opt-in --tui flag. This will be especially useful for large prod-scale evals with many envs. It is designed such that the default vf-eval is affected as little as possible. The only "structural" change is the introduction of the callback pattern into Environment.generate to handle per-env events (such as metric aggregation, logs, etc.).

uv run vf-eval configs/eval/debug.toml --tui

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Adds an opt-in live TUI for multi-environment evaluations and introduces callback-based progress reporting.

New TUI mode: --tui flag in vf-eval to render a live Rich UI; falls back to standard output on non-TTYs. Implements run_evaluations_tui and verifiers/utils/eval_tui.py.
Callback hooks: Environment.generate/evaluate accept on_start, on_progress, on_log (types added in verifiers/types.py); tqdm disabled when callbacks provided; emits log on final save.
CLI and config updates: Wire --tui through verifiers/scripts/eval.py; add use_tqdm to EvalConfig; update example TOMLs (debug, single-turn, duplicate-env).
Docs/tests: Document --tui in docs/evaluation.md; adjust tests to new evaluation call signature and args.

^{Written by Cursor Bugbot for commit e2da252. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-21T11:49:17Z

verifiers/utils/eval_tui.py

+            # use env_state.total for actual resolved values
+            total_rollouts = env_state.total
+            num_examples = total_rollouts // config.rollouts_per_example
+            n = f"{num_examples}x{config.rollouts_per_example} ({total_rollouts} rollouts)"


Confusing negative values in TUI summary for early failures

Low Severity

When num_examples = -1 (meaning "all examples"), the TUI initializes total = config.num_examples * config.rollouts_per_example which produces a negative value (e.g., -3). If the evaluation fails before the on_start callback is invoked (e.g., during environment loading), env_state.total remains negative. In print_final_summary, the calculation num_examples = total_rollouts // config.rollouts_per_example produces negative values, resulting in confusing display output like "-1x3 (-3 rollouts)" in the summary table.

Additional Locations (1)

verifiers/utils/eval_tui.py#L92-L98

cursor · 2026-01-21T11:49:17Z

verifiers/utils/eval_utils.py

+                metrics = {
+                    name: metrics_accum[name] / completed for name in metrics_accum
+                }
+                error_rate = error_accum / completed


TUI reward average calculation differs from original tqdm

Medium Severity

The TUI's on_progress callback calculates the reward average by dividing reward_accum by completed (total states), but the original tqdm progress bar in environment.py divides by reward_count (only states with non-None rewards). When some states lack rewards (e.g., due to scoring errors), the TUI displays an incorrect, understated average. For example, if 8 of 10 states have reward=0.5 and 2 have reward=None, the original shows 0.5 but the TUI shows 0.4. The same issue affects the metrics averaging.

mikasenghaas mentioned this pull request Jan 15, 2026

multi-env evals config #734

Merged

13 tasks

mikasenghaas changed the base branch from main to multi-env-evals January 15, 2026 17:28

mikasenghaas force-pushed the eval-tui branch 3 times, most recently from 05501cb to ff11125 Compare January 15, 2026 20:11

mikasenghaas requested a review from willccbb January 15, 2026 20:46

mikasenghaas force-pushed the eval-tui branch 2 times, most recently from e0d255f to 74f323f Compare January 16, 2026 10:36

mikasenghaas added 19 commits January 19, 2026 10:42

simple multi eval scaffolding via toml config

746ea9c

add debug config

8be1b4a

demote to debug log

138d30e

move around logs

b43e337

fix tests

64e68f5

support comma-separated list

8e9f335

fix precedence

f84ac0e

minor

681ebfb

fix schema validation

f7118f6

minor fix

8a1da80

update tests

1a1f278

add unit tests

e49c648

revert pbar desc

9716fc4

update docs

26416b8

typo

27b65fa

fix mutation

3979923

validation for env ids

ce63f9b

fix resolution issue

ad47e3f

move debug config

6c047c9

mikasenghaas force-pushed the multi-env-evals branch from c4d690d to 6c047c9 Compare January 19, 2026 11:02

mikasenghaas added 2 commits January 19, 2026 11:02

poc vf-eval tui

92b7a77

exit on input

8a2d8ba

mikasenghaas added 9 commits January 19, 2026 11:03

refactor accums

449346b

fix progress bar

d78c6d7

minor

0b9cf1d

minor

0ccb5e0

cleanup

352b86b

fix linter

d910314

cleanup

c5144da

resolve num examples diff

41001dc

fix

7c35822

mikasenghaas force-pushed the eval-tui branch from 74f323f to 7c35822 Compare January 19, 2026 11:03

mikasenghaas mentioned this pull request Jan 19, 2026

env server/client #744

Draft

19 tasks

willccbb added 2 commits January 20, 2026 21:10

mc

3a7fe48

tweaks to rendering to avoid scroll issues; configs

b3e7364

This comment was marked as outdated.

Sign in to view

willccbb approved these changes Jan 21, 2026

View reviewed changes

willccbb added 2 commits January 20, 2026 23:04

remove old config

305da44

merge bug fixes

76c969d

willccbb changed the base branch from multi-env-evals to main January 21, 2026 07:15

willccbb marked this pull request as ready for review January 21, 2026 07:15

willccbb added 2 commits January 20, 2026 23:19

docs; logging tweak

b70a15c

revert logging change

edcadd2

This comment was marked as outdated.

Sign in to view

mikasenghaas added 2 commits January 21, 2026 11:20

do not exit if no metrics

b50c827

show avg reward correctly

4525353

This comment was marked as outdated.

Sign in to view

mikasenghaas added 3 commits January 21, 2026 11:36

use env_idx to allow eval'ing the same env_id multiple times

f175aa5

guarantee metrics is dict

7c0ee87

simplify

e2da252

mikasenghaas merged commit cb2e80f into main Jan 21, 2026
6 checks passed

cursor bot reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

eval tui#735

eval tui#735
mikasenghaas merged 70 commits intomainfrom
eval-tui

mikasenghaas commented Jan 15, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 21, 2026

Uh oh!

cursor bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mikasenghaas commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 21, 2026

Choose a reason for hiding this comment

Confusing negative values in TUI summary for early failures

Uh oh!

cursor bot Jan 21, 2026

Choose a reason for hiding this comment

TUI reward average calculation differs from original tqdm

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jan 15, 2026 •

edited

Loading