Skip to content

Conversation

@EdwardIrby
Copy link
Member

Summary

  • Adds optional confidenceIntervals fields to comparison metric schemas
  • CIs are computed via bootstrap sampling when strategy=statistical
  • Exposes uncertainty bounds for aggregate metrics (avgScore, passRate, latencyMean, avgPassAtK, avgPassExpK)

Changes

  • New file: src/graders/bootstrap.ts — shared bootstrap utility with configurable iterations/confidence level
  • New file: src/graders/bootstrap.spec.ts — 17 unit tests for bootstrap utility
  • Schema updates: ConfidenceIntervalSchema and CI fields in QualityMetrics, PerformanceMetrics, TrialsCapabilityMetrics, TrialsReliabilityMetrics
  • Pipeline updates: CI computation in compare.ts and compare-trials.ts for statistical strategy
  • Markdown: Updated formatters to display 95% CI columns when present

Usage

bunx @plaited/agent-eval-harness compare \
  --strategy statistical \
  run1.jsonl run2.jsonl -o comparison.json

Output includes:

"quality": {
  "run1": {
    "avgScore": 0.85,
    "confidenceIntervals": {
      "avgScore": [0.82, 0.88],
      "passRate": [0.87, 0.93]
    }
  }
}

Test plan

  • bun run check passes (type/lint/format)
  • bun test passes (489 tests)
  • Manual verification with --strategy statistical

Closes #39
Closes #40

🤖 Generated with Claude Code

EdwardIrby and others added 2 commits January 30, 2026 06:36
Fixes permission check bypass where `exit 0` only stopped the check
step but allowed subsequent steps to continue running. Unauthorized
users could trigger Claude Code reviews by opening PRs.

Changes:
- Add step ID and output flags (authorized=true/false)
- Gate all subsequent steps with `if:` condition on authorization

Closes #39

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add optional confidenceIntervals fields to comparison metrics schemas,
computed via bootstrap sampling when strategy=statistical. This exposes
uncertainty bounds for aggregate metrics to help assess statistical
significance.

Changes:
- Add ConfidenceIntervalSchema and extend QualityMetrics, PerformanceMetrics,
  TrialsCapabilityMetrics, and TrialsReliabilityMetrics schemas
- Create shared bootstrap utility (src/graders/bootstrap.ts) with configurable
  iterations and confidence level
- Refactor compare-statistical.ts and trials-compare-statistical.ts to use
  shared bootstrap module
- Add CI computation in compare.ts and compare-trials.ts for statistical strategy
- Update markdown formatters to display 95% CI columns when present
- Add comprehensive unit tests for bootstrap utility

Closes #39
Closes #40

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
EdwardIrby and others added 3 commits January 30, 2026 07:32
- Rename `mean` to `median` in BootstrapResult for semantic clarity
  (the value is the 50th percentile of bootstrap means, not arithmetic mean)
- Extract duplicate `formatCI` function to shared bootstrap.ts module
- Remove orphaned TSDoc comment in compare.ts
- Add comprehensive integration tests for statistical strategy CI computation
- Move bootstrap.spec.ts to src/graders/tests/ for consistent organization
- Fix package.json script ordering (formatter cleanup)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the new confidence intervals feature from PR #41:
- Add CI output examples to SKILL.md for both CaptureResult and TrialResult
- Update comparison-graders.md with detailed statistical strategy output
- Document markdown output format with 95% CI columns

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@EdwardIrby EdwardIrby merged commit 9de36eb into main Jan 30, 2026
7 of 8 checks passed
@EdwardIrby EdwardIrby deleted the feat/add-confidence-interval branch January 30, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add confidence intervals to comparison output schema security: fix workflow permission check bypass

2 participants