Skip to content

Comments

Log eventloop lag during vf-eval#687

Merged
willccbb merged 5 commits intomainfrom
eventloop-lag
Jan 6, 2026
Merged

Log eventloop lag during vf-eval#687
willccbb merged 5 commits intomainfrom
eventloop-lag

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 5, 2026

Description

Based on #686. Do not merge before

Logs event loop lag during vf-eval as env monitoring util to detect if an env overloads the main event loop, thereby degrading performance.

# low event loop lag
uv run vf-eval gsm8k -n1 -r1 -v
Screenshot 2026-01-05 at 6 42 10 PM
# high event loop lag (>1k parallel reqs)
uv run vf-eval gsm8k -n-1 -r1 -c-1 -v
Screenshot 2026-01-05 at 6 41 24 PM

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Introduces performance monitoring in evaluation and standardized metrics across environments.

  • Eval telemetry: New EventLoopLagMonitor (async_utils) integrated into run_evaluation and print_results to report event loop lag and detailed generation/scoring timings; adds print_time() formatter
  • Monitor rubrics: Auto-attached rubrics per env layer
    • MultiTurnEnv: num_turns
    • ToolEnv: total_tool_calls and per-tool call counts (replaces removed ToolRubric)
    • SandboxEnv: sandbox_ready_wait_time, sandbox_command_execution_time
    • PythonEnv: python_ready_wait_time
    • New Environment.add_rubric() stacks rubrics automatically
  • State & metrics plumbing: Sandbox/Python states now track ready wait and command durations; methods updated to record timings
  • API cleanups: Remove ToolRubric export/file; math_python stops manually adding it—ToolEnv now tracks tool metrics by default
  • Docs: New/expanded docs/evaluation.md, enriched docs/environments.md (monitor rubrics), updated navigation and RTD redirect
  • Tests: EnvGroup tests updated to include num_turns metric in expectations

Written by Cursor Bugbot for commit 0d304d8. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas force-pushed the eventloop-lag branch 2 times, most recently from 9b309db to 31f4e84 Compare January 5, 2026 17:43
@mikasenghaas mikasenghaas requested a review from willccbb January 5, 2026 17:47
@mikasenghaas mikasenghaas changed the base branch from main to log-timing January 5, 2026 17:50
@mikasenghaas mikasenghaas force-pushed the eventloop-lag branch 2 times, most recently from 0ba6b27 to 8552e76 Compare January 5, 2026 18:17
@mikasenghaas mikasenghaas marked this pull request as ready for review January 6, 2026 13:22
@mikasenghaas mikasenghaas changed the base branch from log-timing to main January 6, 2026 13:22
)
print(
f"event_loop_lag: med - {print_time(float(med_lag))}, p90 - {print_time(float(p90_lag))}, max - {print_time(float(max_lag))}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty event loop lags array causes crash

The code checks if event_loop_lags is not None: but doesn't handle the case where event_loop_lags is an empty list. If the evaluation completes faster than the 0.1-second measure_interval before any lag measurements are recorded, get_lags() returns an empty list. Calling np.max(), np.median(), or np.percentile() on an empty numpy array raises a ValueError because these reduction operations cannot handle zero-size arrays. The condition needs to also check that the list is non-empty before computing statistics.

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-tool metrics not tracked for subclass-added tools

The ToolMonitorRubric is created in ToolEnv.__init__ before subclasses add their tools. When SandboxEnv or PythonEnv call super().__init__(), self.tools is empty, so ToolMonitorRubric captures an empty tool_names list. Tools like bash and python are added via add_tool() only after the rubric is created, so their per-tool call counts won't be tracked. This is a regression from the old pattern where ToolRubric was explicitly created after the environment was fully initialized in math_python.py.

verifiers/envs/tool_env.py#L77-L78

self.add_rubric(ToolMonitorRubric(tools=self.tools))

verifiers/envs/sandbox_env.py#L149-L150

)
self.add_rubric(SandboxMonitorRubric())

verifiers/envs/python_env.py#L201-L206

)
self.add_rubric(PythonMonitorRubric())
self.add_tool(
self.python, args_to_skip=["sandbox_id", "sandbox_state", "python_state"]
)
self.remove_tool(self.bash) # omit from agent tool list

Fix in Cursor Fix in Web


@willccbb willccbb merged commit e6bb2cc into main Jan 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants