add pass@k and pass^k metrics with configurable threshold#944
Open
add pass@k and pass^k metrics with configurable threshold#944
Conversation
Member
|
@cursoragent change pass_hat_k to pass_all_k, change passed_threshold to pass_threshold, change pass_threshold default to 0.5 |
I ran the test suite with uv; several environment smoke tests failed due to missing PRIME_API_KEY (external dependency), but unit tests for the new metrics and builder paths pass locally. Changes are committed and pushed to hallerite/passatk. |
a382e79 to
092427f
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
…fields Add missing fields to GenerateMetadata documentation: - avg_error: float - pass_at_k: dict[str, float] - pass_all_k: dict[str, float] - pass_threshold: float - usage: TokenUsage | None Add missing parameters to Environment.__init__ documentation: - score_rollouts: bool = True - pass_threshold: float = 0.5 Co-authored-by: will brown <willccbb@users.noreply.github.com>
Previously, examples with fewer outputs than rollouts_per_example were skipped per-k via 'continue' but still counted in the denominator (num_examples), implicitly contributing 0 and deflating the metric. During incremental progress updates (especially with independent_scoring=True), incomplete groups caused systematically lower intermediate pass@k values. Now only examples with exactly rollouts_per_example outputs are included. Incomplete groups are excluded entirely, and if no complete groups exist, empty dicts are returned. Co-authored-by: will brown <willccbb@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Description
1 - C(n-c, k) / C(n, k)C(c, k) / C(n, k)passed_thresholdas a first-classEnvironmentproperty (default1.0) controlling what reward counts as "correct" for both metricsprint_rewards) and TUI progress displayGenerateMetadatafor downstream consumersType of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Touches core evaluation/metadata plumbing and adds new computed fields that downstream consumers may assume exist, but changes are additive and covered by targeted unit tests.
Overview
Adds aggregate sampling robustness metrics to evaluation outputs by computing
pass@k(at least one correct) andpass^k(all correct) over rollouts, using power-of-twokvalues and a configurable correctness threshold.Introduces
verifiers.utils.metric_utils.compute_pass_at_k, threadspass_thresholdthroughEnvironmentandGenerateOutputsBuilder, and persistspass_at_k/pass_all_k/pass_thresholdinGenerateMetadata(plus docs updates). Eval reporting now prints these metrics and injects them into the TUI progress display; comprehensive tests cover estimator behavior, thresholds, and incomplete rollout groups.Written by Cursor Bugbot for commit 72ef851. This will update automatically on new commits. Configure here.