Skip to content

Comments

fix math rubric timeouts#831

Merged
willccbb merged 10 commits intomainfrom
fix-math-rubric2
Feb 6, 2026
Merged

fix math rubric timeouts#831
willccbb merged 10 commits intomainfrom
fix-math-rubric2

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Feb 5, 2026

Description

Currently, the scoring timeout set in vf.MathRubric includes eventloop lag (e.g. the time it takes until verification even starts in the thread executor) which results in the rubric falsly marking answers with 0 reward because.

For example, in aime2024 with only 200 rollouts we see false timeouts because we get a burst of scoring requests because many rollouts finish simulatenously atfer 4K tokens.

uv run vf-eval aime2024 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B -b http://localhost:8000/v1 -n 20 -r 10 -c -1 -d -t 4096 -R
Screenshot 2026-02-05 at 3 00 37 PM

This PR fixes this by using asyncio timeouts only as a very high timeout time to prevent infinite hangs, the actual timing logic is thread-internal and only starts when scoring actually starts. I confirmed this work by running aime2024 at Avg@32 (960 parallel rollouts) at 4K context without any false positives (got the expected score of ~11%)

uv run vf-eval aime2024 -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B -b http://localhost:8000/v1 -n 30 -r 32 -c -1 -d -t 8192 -R
Screenshot 2026-02-05 at 4 09 26 PM

It is to be noted that verification that takes actually long, we only cancel the verification after 120s (hard timeout) but this should happen fairly little and given that it is in a thread it should not block the remainder of the env execution.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Changes scoring behavior and concurrency defaults for math verification; while aimed at reducing false negatives, it can affect throughput and timing/CPU characteristics under load and should be validated in large evaluations.

Overview
Fixes false timeouts in MathRubric by moving parsing/verification and elapsed-time measurement into the executor thread and only enforcing timeout_seconds based on that internal duration (so event-loop/executor queueing delay no longer zeros rewards).

Adds a hard HARD_TIMEOUT_SECONDS (120s) around the executor call to prevent indefinite hangs, bumps the default max_workers (10→50), and updates the timeout test to assert pass/fail behavior by timeout_seconds instead of wall-clock timing.

Written by Cursor Bugbot for commit 384487b. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas marked this pull request as ready for review February 5, 2026 15:09
@mikasenghaas mikasenghaas requested a review from willccbb February 5, 2026 15:09
@willccbb willccbb merged commit 036fff4 into main Feb 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants