Log eventloop lag during vf-eval by mikasenghaas · Pull Request #687 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-01-05T17:38:45Z

Description

Based on #686. Do not merge before

Logs event loop lag during vf-eval as env monitoring util to detect if an env overloads the main event loop, thereby degrading performance.

# low event loop lag
uv run vf-eval gsm8k -n1 -r1 -v

# high event loop lag (>1k parallel reqs)
uv run vf-eval gsm8k -n-1 -r1 -c-1 -v

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Introduces performance monitoring in evaluation and standardized metrics across environments.

Eval telemetry: New EventLoopLagMonitor (async_utils) integrated into run_evaluation and print_results to report event loop lag and detailed generation/scoring timings; adds print_time() formatter
Monitor rubrics: Auto-attached rubrics per env layer
- MultiTurnEnv: num_turns
- ToolEnv: total_tool_calls and per-tool call counts (replaces removed ToolRubric)
- SandboxEnv: sandbox_ready_wait_time, sandbox_command_execution_time
- PythonEnv: python_ready_wait_time
- New Environment.add_rubric() stacks rubrics automatically
State & metrics plumbing: Sandbox/Python states now track ready wait and command durations; methods updated to record timings
API cleanups: Remove ToolRubric export/file; math_python stops manually adding it—ToolEnv now tracks tool metrics by default
Docs: New/expanded docs/evaluation.md, enriched docs/environments.md (monitor rubrics), updated navigation and RTD redirect
Tests: EnvGroup tests updated to include num_turns metric in expectations

^{Written by Cursor Bugbot for commit 0d304d8. This will update automatically on new commits. Configure here.}

cursor · 2026-01-06T13:29:59Z

verifiers/utils/eval_utils.py

+        )
+        print(
+            f"event_loop_lag: med - {print_time(float(med_lag))}, p90 - {print_time(float(p90_lag))}, max - {print_time(float(max_lag))}"
+        )


Empty event loop lags array causes crash

The code checks if event_loop_lags is not None: but doesn't handle the case where event_loop_lags is an empty list. If the evaluation completes faster than the 0.1-second measure_interval before any lag measurements are recorded, get_lags() returns an empty list. Calling np.max(), np.median(), or np.percentile() on an empty numpy array raises a ValueError because these reduction operations cannot handle zero-size arrays. The condition needs to also check that the list is non-empty before computing statistics.

cursor

Per-tool metrics not tracked for subclass-added tools

The ToolMonitorRubric is created in ToolEnv.__init__ before subclasses add their tools. When SandboxEnv or PythonEnv call super().__init__(), self.tools is empty, so ToolMonitorRubric captures an empty tool_names list. Tools like bash and python are added via add_tool() only after the rubric is created, so their per-tool call counts won't be tracked. This is a regression from the old pattern where ToolRubric was explicitly created after the environment was fully initialized in math_python.py.

verifiers/envs/tool_env.py#L77-L78

verifiers/verifiers/envs/tool_env.py

Lines 77 to 78 in 0d304d8


	self.add_rubric(ToolMonitorRubric(tools=self.tools))

verifiers/envs/sandbox_env.py#L149-L150

verifiers/verifiers/envs/sandbox_env.py

Lines 149 to 150 in 0d304d8

    
           ) 
        
           self.add_rubric(SandboxMonitorRubric())

verifiers/envs/python_env.py#L201-L206

verifiers/verifiers/envs/python_env.py

Lines 201 to 206 in 0d304d8

    
           ) 
        
           self.add_rubric(PythonMonitorRubric()) 
        
           self.add_tool( 
        
               self.python, args_to_skip=["sandbox_id", "sandbox_state", "python_state"] 
        
           ) 
        
           self.remove_tool(self.bash)  # omit from agent tool list

mikasenghaas force-pushed the eventloop-lag branch 2 times, most recently from 9b309db to 31f4e84 Compare January 5, 2026 17:43

mikasenghaas requested a review from willccbb January 5, 2026 17:47

mikasenghaas changed the base branch from main to log-timing January 5, 2026 17:50

mikasenghaas force-pushed the eventloop-lag branch 2 times, most recently from 0ba6b27 to 8552e76 Compare January 5, 2026 18:17

willccbb approved these changes Jan 5, 2026

View reviewed changes

mikasenghaas force-pushed the eventloop-lag branch from 8552e76 to 3eee0ab Compare January 5, 2026 18:42

mikasenghaas added 5 commits January 6, 2026 13:21

add event loop lag monitor

dac3c22

print event loop lag metrics

024055c

pre-create np arr

68e3a38

do not print mean

51d2468

fix tests

0d304d8

mikasenghaas force-pushed the eventloop-lag branch from 3eee0ab to 0d304d8 Compare January 6, 2026 13:22

mikasenghaas marked this pull request as ready for review January 6, 2026 13:22

mikasenghaas changed the base branch from log-timing to main January 6, 2026 13:22

cursor bot reviewed Jan 6, 2026

View reviewed changes

willccbb merged commit e6bb2cc into main Jan 6, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Log eventloop lag during vf-eval#687

Log eventloop lag during vf-eval#687
willccbb merged 5 commits intomainfrom
eventloop-lag

mikasenghaas commented Jan 5, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot Jan 6, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	)
	self.add_rubric(PythonMonitorRubric())
	self.add_tool(
	self.python, args_to_skip=["sandbox_id", "sandbox_state", "python_state"]
	)
	self.remove_tool(self.bash) # omit from agent tool list

Comments

Conversation

mikasenghaas commented Jan 5, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot Jan 6, 2026

Choose a reason for hiding this comment

Empty event loop lags array causes crash

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Per-tool metrics not tracked for subclass-added tools

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jan 5, 2026 •

edited by cursor bot

Loading