Skip to content

Comments

add pass@k and pass^k metrics with configurable threshold#944

Open
hallerite wants to merge 6 commits intomainfrom
hallerite/passatk
Open

add pass@k and pass^k metrics with configurable threshold#944
hallerite wants to merge 6 commits intomainfrom
hallerite/passatk

Conversation

@hallerite
Copy link
Contributor

@hallerite hallerite commented Feb 20, 2026

Description

  • Add pass@k metric (at least one correct in k samples) using the unbiased estimator: 1 - C(n-c, k) / C(n, k)
  • Add pass^k metric (all k samples correct) using the unbiased estimator: C(c, k) / C(n, k)
  • Add passed_threshold as a first-class Environment property (default 1.0) controlling what reward counts as "correct" for both metrics
  • Display pass@k and pass^k in eval output (print_rewards) and TUI progress display
  • Surface both metrics in GenerateMetadata for downstream consumers

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Touches core evaluation/metadata plumbing and adds new computed fields that downstream consumers may assume exist, but changes are additive and covered by targeted unit tests.

Overview
Adds aggregate sampling robustness metrics to evaluation outputs by computing pass@k (at least one correct) and pass^k (all correct) over rollouts, using power-of-two k values and a configurable correctness threshold.

Introduces verifiers.utils.metric_utils.compute_pass_at_k, threads pass_threshold through Environment and GenerateOutputsBuilder, and persists pass_at_k/pass_all_k/pass_threshold in GenerateMetadata (plus docs updates). Eval reporting now prints these metrics and injects them into the TUI progress display; comprehensive tests cover estimator behavior, thresholds, and incomplete rollout groups.

Written by Cursor Bugbot for commit 72ef851. This will update automatically on new commits. Configure here.

@willccbb
Copy link
Member

@cursoragent change pass_hat_k to pass_all_k, change passed_threshold to pass_threshold, change pass_threshold default to 0.5

@cursor
Copy link

cursor bot commented Feb 22, 2026

  • Implemented: renamed pass_hat_k to pass_all_k across code, metadata, TUI, and tests.
  • Implemented: renamed passed_threshold to pass_threshold everywhere.
  • Changed default: pass_threshold now defaults to 0.5 (also applied to compute_pass_at_k’s default threshold).
  • Updated tests to reflect new names and default.

I ran the test suite with uv; several environment smoke tests failed due to missing PRIME_API_KEY (external dependency), but unit tests for the new metrics and builder paths pass locally. Changes are committed and pushed to hallerite/passatk.

Open in Web Open in Cursor 

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

cursoragent and others added 2 commits February 23, 2026 01:02
…fields

Add missing fields to GenerateMetadata documentation:
- avg_error: float
- pass_at_k: dict[str, float]
- pass_all_k: dict[str, float]
- pass_threshold: float
- usage: TokenUsage | None

Add missing parameters to Environment.__init__ documentation:
- score_rollouts: bool = True
- pass_threshold: float = 0.5

Co-authored-by: will brown <willccbb@users.noreply.github.com>
Previously, examples with fewer outputs than rollouts_per_example were
skipped per-k via 'continue' but still counted in the denominator
(num_examples), implicitly contributing 0 and deflating the metric.
During incremental progress updates (especially with
independent_scoring=True), incomplete groups caused systematically
lower intermediate pass@k values.

Now only examples with exactly rollouts_per_example outputs are
included. Incomplete groups are excluded entirely, and if no complete
groups exist, empty dicts are returned.

Co-authored-by: will brown <willccbb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants