feat(compare): add confidence intervals to comparison output #41

EdwardIrby · 2026-01-30T14:55:01Z

Summary

Adds optional confidenceIntervals fields to comparison metric schemas
CIs are computed via bootstrap sampling when strategy=statistical
Exposes uncertainty bounds for aggregate metrics (avgScore, passRate, latencyMean, avgPassAtK, avgPassExpK)

Changes

New file: src/graders/bootstrap.ts — shared bootstrap utility with configurable iterations/confidence level
New file: src/graders/bootstrap.spec.ts — 17 unit tests for bootstrap utility
Schema updates: ConfidenceIntervalSchema and CI fields in QualityMetrics, PerformanceMetrics, TrialsCapabilityMetrics, TrialsReliabilityMetrics
Pipeline updates: CI computation in compare.ts and compare-trials.ts for statistical strategy
Markdown: Updated formatters to display 95% CI columns when present

Usage

bunx @plaited/agent-eval-harness compare \
  --strategy statistical \
  run1.jsonl run2.jsonl -o comparison.json

Output includes:

"quality": {
  "run1": {
    "avgScore": 0.85,
    "confidenceIntervals": {
      "avgScore": [0.82, 0.88],
      "passRate": [0.87, 0.93]
    }
  }
}

Test plan

bun run check passes (type/lint/format)
bun test passes (489 tests)
Manual verification with --strategy statistical

Closes #39
Closes #40

🤖 Generated with Claude Code

Fixes permission check bypass where `exit 0` only stopped the check step but allowed subsequent steps to continue running. Unauthorized users could trigger Claude Code reviews by opening PRs. Changes: - Add step ID and output flags (authorized=true/false) - Gate all subsequent steps with `if:` condition on authorization Closes #39 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add optional confidenceIntervals fields to comparison metrics schemas, computed via bootstrap sampling when strategy=statistical. This exposes uncertainty bounds for aggregate metrics to help assess statistical significance. Changes: - Add ConfidenceIntervalSchema and extend QualityMetrics, PerformanceMetrics, TrialsCapabilityMetrics, and TrialsReliabilityMetrics schemas - Create shared bootstrap utility (src/graders/bootstrap.ts) with configurable iterations and confidence level - Refactor compare-statistical.ts and trials-compare-statistical.ts to use shared bootstrap module - Add CI computation in compare.ts and compare-trials.ts for statistical strategy - Update markdown formatters to display 95% CI columns when present - Add comprehensive unit tests for bootstrap utility Closes #39 Closes #40 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Rename `mean` to `median` in BootstrapResult for semantic clarity (the value is the 50th percentile of bootstrap means, not arithmetic mean) - Extract duplicate `formatCI` function to shared bootstrap.ts module - Remove orphaned TSDoc comment in compare.ts - Add comprehensive integration tests for statistical strategy CI computation - Move bootstrap.spec.ts to src/graders/tests/ for consistent organization - Fix package.json script ordering (formatter cleanup) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Document the new confidence intervals feature from PR #41: - Add CI output examples to SKILL.md for both CaptureResult and TrialResult - Update comparison-graders.md with detailed statistical strategy output - Document markdown output format with 95% CI columns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

EdwardIrby and others added 2 commits January 30, 2026 06:36

EdwardIrby requested a review from alisonailea as a code owner January 30, 2026 14:55

EdwardIrby and others added 3 commits January 30, 2026 07:32

ci: update test commands

2f27128

EdwardIrby merged commit 9de36eb into main Jan 30, 2026
7 of 8 checks passed

EdwardIrby deleted the feat/add-confidence-interval branch January 30, 2026 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compare): add confidence intervals to comparison output #41

feat(compare): add confidence intervals to comparison output #41

Uh oh!

EdwardIrby commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(compare): add confidence intervals to comparison output #41

feat(compare): add confidence intervals to comparison output #41

Uh oh!

Conversation

EdwardIrby commented Jan 30, 2026

Summary

Changes

Usage

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants