add pass@k and pass^k metrics with configurable threshold by hallerite · Pull Request #944 · PrimeIntellect-ai/verifiers

hallerite · 2026-02-20T17:33:55Z

Description

Add pass@k metric (at least one correct in k samples) using the unbiased estimator: 1 - C(n-c, k) / C(n, k)
Add pass^k metric (all k samples correct) using the unbiased estimator: C(c, k) / C(n, k)
Add passed_threshold as a first-class Environment property (default 1.0) controlling what reward counts as "correct" for both metrics
Display pass@k and pass^k in eval output (print_rewards) and TUI progress display
Surface both metrics in GenerateMetadata for downstream consumers

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Touches core evaluation/metadata plumbing and adds new computed fields that downstream consumers may assume exist, but changes are additive and covered by targeted unit tests.

Overview
Adds aggregate sampling robustness metrics to evaluation outputs by computing pass@k (at least one correct) and pass^k (all correct) over rollouts, using power-of-two k values and a configurable correctness threshold.

Introduces verifiers.utils.metric_utils.compute_pass_at_k, threads pass_threshold through Environment and GenerateOutputsBuilder, and persists pass_at_k/pass_all_k/pass_threshold in GenerateMetadata (plus docs updates). Eval reporting now prints these metrics and injects them into the TUI progress display; comprehensive tests cover estimator behavior, thresholds, and incomplete rollout groups.

^{Written by Cursor Bugbot for commit 72ef851. This will update automatically on new commits. Configure here.}

tests/conftest.py

verifiers/utils/save_utils.py

verifiers/types.py

verifiers/utils/eval_utils.py

willccbb · 2026-02-22T22:19:23Z

@cursoragent change pass_hat_k to pass_all_k, change passed_threshold to pass_threshold, change pass_threshold default to 0.5

cursor · 2026-02-22T22:19:29Z

Implemented: renamed pass_hat_k to pass_all_k across code, metadata, TUI, and tests.
Implemented: renamed passed_threshold to pass_threshold everywhere.
Changed default: pass_threshold now defaults to 0.5 (also applied to compute_pass_at_k’s default threshold).
Updated tests to reflect new names and default.

I ran the test suite with uv; several environment smoke tests failed due to missing PRIME_API_KEY (external dependency), but unit tests for the new metrics and builder paths pass locally. Changes are committed and pushed to hallerite/passatk.

verifiers/envs/environment.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/utils/metric_utils.py

…fields Add missing fields to GenerateMetadata documentation: - avg_error: float - pass_at_k: dict[str, float] - pass_all_k: dict[str, float] - pass_threshold: float - usage: TokenUsage | None Add missing parameters to Environment.__init__ documentation: - score_rollouts: bool = True - pass_threshold: float = 0.5 Co-authored-by: will brown <willccbb@users.noreply.github.com>

Previously, examples with fewer outputs than rollouts_per_example were skipped per-k via 'continue' but still counted in the denominator (num_examples), implicitly contributing 0 and deflating the metric. During incremental progress updates (especially with independent_scoring=True), incomplete groups caused systematically lower intermediate pass@k values. Now only examples with exactly rollouts_per_example outputs are included. Incomplete groups are excluded entirely, and if no complete groups exist, empty dicts are returned. Co-authored-by: will brown <willccbb@users.noreply.github.com>

add pass@k and pass^k metrics with configurable threshold

ae3c07d

cursor bot reviewed Feb 20, 2026

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

verifiers/utils/save_utils.py Outdated Show resolved Hide resolved

minor fixes

a6b2601

cursor bot reviewed Feb 20, 2026

View reviewed changes

verifiers/types.py Outdated Show resolved Hide resolved

verifiers/utils/eval_utils.py Show resolved Hide resolved

cursor bot reviewed Feb 22, 2026

View reviewed changes

verifiers/envs/environment.py Show resolved Hide resolved

address bugbot

092427f

hallerite force-pushed the hallerite/passatk branch from a382e79 to 092427f Compare February 22, 2026 22:44

fix ruff

e9432ff

cursor bot reviewed Feb 22, 2026

View reviewed changes

verifiers/utils/metric_utils.py Outdated Show resolved Hide resolved

cursoragent and others added 2 commits February 23, 2026 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

add pass@k and pass^k metrics with configurable threshold#944

add pass@k and pass^k metrics with configurable threshold#944
hallerite wants to merge 6 commits intomainfrom
hallerite/passatk

hallerite commented Feb 20, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willccbb commented Feb 22, 2026

Uh oh!

cursor bot commented Feb 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

hallerite commented Feb 20, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willccbb commented Feb 22, 2026

Uh oh!

cursor bot commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hallerite commented Feb 20, 2026 •

edited by cursor bot

Loading

cursor bot commented Feb 22, 2026 •

edited

Loading