Skill evals system by jeremy · Pull Request #239 · basecamp/basecamp-cli

jeremy · 2026-03-10T21:47:53Z

Summary

Adds skill-evals/ — Anthropic-backed tool-calling harness that verifies SKILL.md teaches agents correct CLI usage
25 YAML cases across 5 tag groups: calling-convention (9), shortcuts (4), semantics (4), workflow (3), domain (5)
Documents assign/unassign shortcuts and reports overdue in SKILL.md (surfaced as coverage gaps by the evals)
Adds make skill-eval target to root Makefile

How it works

Give an LLM the skill + a task + a basecamp tool. Intercept tool_use calls, return mock/generic responses, grade the trace with deterministic assertions (accept, reject, expect_sequence, max_commands).

./skill-evals/run                                    # all cases
./skill-evals/run cases/create-todo.yml              # single case
./skill-evals/run --model claude-haiku-4-5-20251001  # different model
./skill-evals/run --tag calling-convention           # tagged subset
./skill-evals/run --samples 3                        # majority vote
./skill-evals/run --save baseline                    # save snapshot
./skill-evals/run --compare baseline                 # diff against saved

Eval results (Haiku)

PASS agent-output-mode
PASS assign-todo
PASS checkin-answer
PASS complete-multiple
PASS complete-todo
PASS create-card
PASS create-comment
PASS create-doc
PASS create-message
PASS create-todo
PASS list-cards
PASS list-files
PASS list-messages
PASS list-todolists
PASS list-todos
PASS people-in-project
PASS project-scope
PASS recordings-browse
PASS reopen-todo
PASS reports-assigned
PASS reports-overdue
PASS schedule-create
PASS search
PASS url-then-comment
PASS webhook-create

25/25 passed

Test plan

ruby -c skill-evals/run — syntax OK
All 25 cases load and validate (regex patterns compile, project assertions present)
./skill-evals/run --model claude-haiku-4-5-20251001 — 25/25 pass
make fmt-check vet test test-e2e check-naming check-surface provenance-check tidy-check — all pass
Verify ANTHROPIC_API_KEY gating: ./skill-evals/run without key → clean abort

Summary by cubic

Adds a new skill-evals/ harness to verify skills/basecamp/SKILL.md teaches correct basecamp CLI usage, plus a CI gate that runs these evals on PRs touching the skill or eval files.

New Features
- skill-evals/ Ruby runner intercepts Anthropic tool_use, mocks responses, and grades via accept, reject, expect_sequence, and max_commands. Supports --model, --skill, --tag, --samples, --save, --compare, --json, --verbose, and aborts on missing values for all flags.
- 25 YAML cases across five tag groups; create-doc case tightened to require the doc subcommand and body text.
- Make targets: root make skill-eval (default model claude-sonnet-4-20250514); in skill-evals: eval, eval-save, eval-compare.
- CI gate: runs when skills/basecamp/SKILL.md or skill-evals/** change; warns and skips if ANTHROPIC_API_KEY is missing.
- SKILL.md: documents reports overdue and assign/unassign; examples include --in <project>; clarifies cross-project exceptions.
Migration
- Requires ANTHROPIC_API_KEY and Ruby 3.3.
- Run make skill-eval (override with MODEL=<id>); save/compare baselines via make -C skill-evals eval-save NAME=<n> and eval-compare.

^{Written for commit fefab4a. Summary will update on new commits.}

These commands exist in the CLI but were missing from the skill. Surfaced by eval cases that failed before the skill documented them.

cubic-dev-ai

3 issues found across 30 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skill-evals/cases/create-doc.yml">

<violation number="1" location="skill-evals/cases/create-doc.yml:4">
P2: Accept patterns don't verify the body text "Getting started with the API" is passed to the command. The eval will pass even if the agent omits the body entirely. Other create cases (e.g., `create-comment.yml`) match content in their accept patterns.</violation>
</file>

<file name="skill-evals/run">

<violation number="1" location="skill-evals/run:48">
P2: `argv.shift` returns `nil` when a flag like `--samples` is the last argument. `nil.to_i` silently produces `0`, causing every case to report FAIL with zero samples and no error message. Guard against missing flag values.</violation>

<violation number="2" location="skill-evals/run:72">
P2: `YAML.load_file` should be `YAML.safe_load_file` to make deserialization safety explicit. On Ruby < 3.1 `load_file` permits arbitrary object instantiation; `safe_load_file` is safe on all versions and signals intent.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

skill-evals/cases/create-doc.yml

skill-evals/run

Copilot

Pull request overview

Introduces a new skill-evals/ harness that runs Anthropic tool-calling evaluations to verify skills/basecamp/SKILL.md teaches agents correct Basecamp CLI usage, and updates the skill doc to cover gaps found by the eval suite.

Changes:

Added a Ruby runner (skill-evals/run) that executes YAML-defined eval cases, captures tool traces, and grades them via deterministic assertions.
Added 25 YAML eval cases covering calling conventions, shortcuts, semantics, workflow, and domain scenarios.
Updated skills/basecamp/SKILL.md to document cross-project reports overdue and assign/unassign shortcuts; added a top-level make skill-eval target.

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
skills/basecamp/SKILL.md	Documents `reports overdue` and adds assign/unassign examples in quick reference and todos section.
skill-evals/run	New Ruby eval runner that calls Anthropic Messages API, mocks tool results, and grades traces.
skill-evals/Makefile	Adds local make targets to run/save/compare eval runs.
Makefile	Adds `skill-eval` target delegating to `skill-evals/Makefile`.
skill-evals/results/.gitkeep	Ensures results directory is tracked.
skill-evals/cases/agent-output-mode.yml	Case ensuring agents use machine-readable output flags.
skill-evals/cases/assign-todo.yml	Case verifying assign shortcut usage.
skill-evals/cases/checkin-answer.yml	Case verifying check-in answer creation usage.
skill-evals/cases/complete-multiple.yml	Case verifying multi-ID completion shortcut usage.
skill-evals/cases/complete-todo.yml	Case verifying single todo completion shortcut usage.
skill-evals/cases/create-card.yml	Case verifying card creation calling convention.
skill-evals/cases/create-comment.yml	Case verifying comment creation calling convention.
skill-evals/cases/create-doc.yml	Case verifying document creation usage.
skill-evals/cases/create-message.yml	Case verifying message creation calling convention.
skill-evals/cases/create-todo.yml	Case verifying todo creation calling convention.
skill-evals/cases/list-cards.yml	Case verifying cards list calling convention.
skill-evals/cases/list-files.yml	Case verifying files list calling convention.
skill-evals/cases/list-messages.yml	Case verifying messages list calling convention.
skill-evals/cases/list-todolists.yml	Case verifying todolists list calling convention.
skill-evals/cases/list-todos.yml	Case verifying todos list calling convention.
skill-evals/cases/people-in-project.yml	Case verifying project-scoped people listing semantics.
skill-evals/cases/project-scope.yml	Case verifying project scoping behavior for overdue todos.
skill-evals/cases/recordings-browse.yml	Case verifying cross-project recordings browse semantics.
skill-evals/cases/reopen-todo.yml	Case verifying reopen shortcut usage.
skill-evals/cases/reports-assigned.yml	Case verifying cross-project assigned report usage.
skill-evals/cases/reports-overdue.yml	Case verifying cross-project overdue report usage.
skill-evals/cases/schedule-create.yml	Case verifying schedule entry creation usage.
skill-evals/cases/search.yml	Case verifying search semantics and rejecting recordings misuse.
skill-evals/cases/url-then-comment.yml	Case verifying workflow: parse URL then comment.
skill-evals/cases/webhook-create.yml	Case verifying webhook creation usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

skill-evals/run

skills/basecamp/SKILL.md

skill-evals/run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7156f6451

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

skill-evals/run

Tool-calling evals that verify SKILL.md teaches correct CLI usage. Gives an LLM the skill + a task + a basecamp tool, intercepts tool_use calls, and grades traces with deterministic accept/reject/sequence assertions. Ruby, stdlib only, Anthropic API. Cases cover calling-convention (9), shortcuts (4), semantics (4), workflow (3), and domain (5) tag groups. All 25 pass on Haiku.

- safe_load_file instead of load_file for YAML deserialization safety - Validate --samples/--save/--compare flag values, abort on missing args - Show a failing sample (not best) in FAIL diagnostics and snapshots - Verify body text in create-doc eval case - Add --in <project> to assign/unassign examples in SKILL.md

Runs skill evals conditionally when skills/basecamp/SKILL.md or skill-evals/** change in a pull request. Gracefully skips if ANTHROPIC_API_KEY is not configured as a repository secret.

github-actions · 2026-03-10T22:07:47Z

Sensitive Change Detection (shadow mode)

This PR modifies control-plane files:

.github/workflows/test.yml

Shadow mode — this check is informational only. When activated, changes to these paths will require approval from a maintainer.

Copilot

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

skill-evals/run

skill-evals/cases/create-doc.yml

- Add abort guards for --model, --skill, --tag matching the existing pattern used by --samples/--save/--compare. - Fix create-doc accept pattern to require `doc` subcommand, preventing false passes on invalid `docs create` (without the `doc` subcommand).

Document assign/unassign shortcuts and reports overdue in SKILL.md

4c8fcb5

These commands exist in the CLI but were missing from the skill. Surfaced by eval cases that failed before the skill documented them.

jeremy requested a review from a team as a code owner March 10, 2026 21:47

Copilot AI review requested due to automatic review settings March 10, 2026 21:47

github-actions bot added the skills Agent skills label Mar 10, 2026

Copilot started reviewing on behalf of jeremy March 10, 2026 21:48 View session

cubic-dev-ai bot reviewed Mar 10, 2026

View reviewed changes

skill-evals/cases/create-doc.yml Outdated Show resolved Hide resolved

skill-evals/run Outdated Show resolved Hide resolved

skill-evals/run Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 10, 2026

View reviewed changes

skill-evals/run Show resolved Hide resolved

skills/basecamp/SKILL.md Outdated Show resolved Hide resolved

skills/basecamp/SKILL.md Outdated Show resolved Hide resolved

skill-evals/run Outdated Show resolved Hide resolved

skill-evals/run Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

skill-evals/run Outdated Show resolved Hide resolved

skill-evals/run Outdated Show resolved Hide resolved

jeremy added 2 commits March 10, 2026 15:00

jeremy force-pushed the evals branch from f7156f6 to 3a2310b Compare March 10, 2026 22:00

Add skill evals CI gate for PRs touching skill or eval files

2476e9f

Runs skill evals conditionally when skills/basecamp/SKILL.md or skill-evals/** change in a pull request. Gracefully skips if ANTHROPIC_API_KEY is not configured as a repository secret.

Copilot AI review requested due to automatic review settings March 10, 2026 22:07

github-actions bot added the ci CI/CD workflows label Mar 10, 2026

Copilot started reviewing on behalf of jeremy March 10, 2026 22:08 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

skill-evals/run Outdated Show resolved Hide resolved

skill-evals/cases/create-doc.yml Outdated Show resolved Hide resolved

jeremy merged commit 1871606 into main Mar 10, 2026
26 checks passed

jeremy deleted the evals branch March 10, 2026 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skill evals system#239

Skill evals system#239
jeremy merged 5 commits intomainfrom
evals

jeremy commented Mar 10, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeremy commented Mar 10, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Eval results (Haiku)

Test plan

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sensitive Change Detection (shadow mode)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeremy commented Mar 10, 2026 •

edited by cubic-dev-ai bot

Loading

github-actions bot commented Mar 10, 2026 •

edited

Loading