Skip to content

Skill evals system#239

Merged
jeremy merged 5 commits intomainfrom
evals
Mar 10, 2026
Merged

Skill evals system#239
jeremy merged 5 commits intomainfrom
evals

Conversation

@jeremy
Copy link
Copy Markdown
Member

@jeremy jeremy commented Mar 10, 2026

Summary

  • Adds skill-evals/ — Anthropic-backed tool-calling harness that verifies SKILL.md teaches agents correct CLI usage
  • 25 YAML cases across 5 tag groups: calling-convention (9), shortcuts (4), semantics (4), workflow (3), domain (5)
  • Documents assign/unassign shortcuts and reports overdue in SKILL.md (surfaced as coverage gaps by the evals)
  • Adds make skill-eval target to root Makefile

How it works

Give an LLM the skill + a task + a basecamp tool. Intercept tool_use calls, return mock/generic responses, grade the trace with deterministic assertions (accept, reject, expect_sequence, max_commands).

./skill-evals/run                                    # all cases
./skill-evals/run cases/create-todo.yml              # single case
./skill-evals/run --model claude-haiku-4-5-20251001  # different model
./skill-evals/run --tag calling-convention           # tagged subset
./skill-evals/run --samples 3                        # majority vote
./skill-evals/run --save baseline                    # save snapshot
./skill-evals/run --compare baseline                 # diff against saved

Eval results (Haiku)

PASS agent-output-mode
PASS assign-todo
PASS checkin-answer
PASS complete-multiple
PASS complete-todo
PASS create-card
PASS create-comment
PASS create-doc
PASS create-message
PASS create-todo
PASS list-cards
PASS list-files
PASS list-messages
PASS list-todolists
PASS list-todos
PASS people-in-project
PASS project-scope
PASS recordings-browse
PASS reopen-todo
PASS reports-assigned
PASS reports-overdue
PASS schedule-create
PASS search
PASS url-then-comment
PASS webhook-create

25/25 passed

Test plan

  • ruby -c skill-evals/run — syntax OK
  • All 25 cases load and validate (regex patterns compile, project assertions present)
  • ./skill-evals/run --model claude-haiku-4-5-20251001 — 25/25 pass
  • make fmt-check vet test test-e2e check-naming check-surface provenance-check tidy-check — all pass
  • Verify ANTHROPIC_API_KEY gating: ./skill-evals/run without key → clean abort

Summary by cubic

Adds a new skill-evals/ harness to verify skills/basecamp/SKILL.md teaches correct basecamp CLI usage, plus a CI gate that runs these evals on PRs touching the skill or eval files.

  • New Features

    • skill-evals/ Ruby runner intercepts Anthropic tool_use, mocks responses, and grades via accept, reject, expect_sequence, and max_commands. Supports --model, --skill, --tag, --samples, --save, --compare, --json, --verbose, and aborts on missing values for all flags.
    • 25 YAML cases across five tag groups; create-doc case tightened to require the doc subcommand and body text.
    • Make targets: root make skill-eval (default model claude-sonnet-4-20250514); in skill-evals: eval, eval-save, eval-compare.
    • CI gate: runs when skills/basecamp/SKILL.md or skill-evals/** change; warns and skips if ANTHROPIC_API_KEY is missing.
    • SKILL.md: documents reports overdue and assign/unassign; examples include --in <project>; clarifies cross-project exceptions.
  • Migration

    • Requires ANTHROPIC_API_KEY and Ruby 3.3.
    • Run make skill-eval (override with MODEL=<id>); save/compare baselines via make -C skill-evals eval-save NAME=<n> and eval-compare.

Written for commit fefab4a. Summary will update on new commits.

These commands exist in the CLI but were missing from the skill.
Surfaced by eval cases that failed before the skill documented them.
@jeremy jeremy requested a review from a team as a code owner March 10, 2026 21:47
Copilot AI review requested due to automatic review settings March 10, 2026 21:47
@github-actions github-actions bot added the skills Agent skills label Mar 10, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 30 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skill-evals/cases/create-doc.yml">

<violation number="1" location="skill-evals/cases/create-doc.yml:4">
P2: Accept patterns don't verify the body text "Getting started with the API" is passed to the command. The eval will pass even if the agent omits the body entirely. Other create cases (e.g., `create-comment.yml`) match content in their accept patterns.</violation>
</file>

<file name="skill-evals/run">

<violation number="1" location="skill-evals/run:48">
P2: `argv.shift` returns `nil` when a flag like `--samples` is the last argument. `nil.to_i` silently produces `0`, causing every case to report FAIL with zero samples and no error message. Guard against missing flag values.</violation>

<violation number="2" location="skill-evals/run:72">
P2: `YAML.load_file` should be `YAML.safe_load_file` to make deserialization safety explicit. On Ruby < 3.1 `load_file` permits arbitrary object instantiation; `safe_load_file` is safe on all versions and signals intent.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new skill-evals/ harness that runs Anthropic tool-calling evaluations to verify skills/basecamp/SKILL.md teaches agents correct Basecamp CLI usage, and updates the skill doc to cover gaps found by the eval suite.

Changes:

  • Added a Ruby runner (skill-evals/run) that executes YAML-defined eval cases, captures tool traces, and grades them via deterministic assertions.
  • Added 25 YAML eval cases covering calling conventions, shortcuts, semantics, workflow, and domain scenarios.
  • Updated skills/basecamp/SKILL.md to document cross-project reports overdue and assign/unassign shortcuts; added a top-level make skill-eval target.

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
skills/basecamp/SKILL.md Documents reports overdue and adds assign/unassign examples in quick reference and todos section.
skill-evals/run New Ruby eval runner that calls Anthropic Messages API, mocks tool results, and grades traces.
skill-evals/Makefile Adds local make targets to run/save/compare eval runs.
Makefile Adds skill-eval target delegating to skill-evals/Makefile.
skill-evals/results/.gitkeep Ensures results directory is tracked.
skill-evals/cases/agent-output-mode.yml Case ensuring agents use machine-readable output flags.
skill-evals/cases/assign-todo.yml Case verifying assign shortcut usage.
skill-evals/cases/checkin-answer.yml Case verifying check-in answer creation usage.
skill-evals/cases/complete-multiple.yml Case verifying multi-ID completion shortcut usage.
skill-evals/cases/complete-todo.yml Case verifying single todo completion shortcut usage.
skill-evals/cases/create-card.yml Case verifying card creation calling convention.
skill-evals/cases/create-comment.yml Case verifying comment creation calling convention.
skill-evals/cases/create-doc.yml Case verifying document creation usage.
skill-evals/cases/create-message.yml Case verifying message creation calling convention.
skill-evals/cases/create-todo.yml Case verifying todo creation calling convention.
skill-evals/cases/list-cards.yml Case verifying cards list calling convention.
skill-evals/cases/list-files.yml Case verifying files list calling convention.
skill-evals/cases/list-messages.yml Case verifying messages list calling convention.
skill-evals/cases/list-todolists.yml Case verifying todolists list calling convention.
skill-evals/cases/list-todos.yml Case verifying todos list calling convention.
skill-evals/cases/people-in-project.yml Case verifying project-scoped people listing semantics.
skill-evals/cases/project-scope.yml Case verifying project scoping behavior for overdue todos.
skill-evals/cases/recordings-browse.yml Case verifying cross-project recordings browse semantics.
skill-evals/cases/reopen-todo.yml Case verifying reopen shortcut usage.
skill-evals/cases/reports-assigned.yml Case verifying cross-project assigned report usage.
skill-evals/cases/reports-overdue.yml Case verifying cross-project overdue report usage.
skill-evals/cases/schedule-create.yml Case verifying schedule entry creation usage.
skill-evals/cases/search.yml Case verifying search semantics and rejecting recordings misuse.
skill-evals/cases/url-then-comment.yml Case verifying workflow: parse URL then comment.
skill-evals/cases/webhook-create.yml Case verifying webhook creation usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7156f6451

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jeremy added 2 commits March 10, 2026 15:00
Tool-calling evals that verify SKILL.md teaches correct CLI usage.
Gives an LLM the skill + a task + a basecamp tool, intercepts tool_use
calls, and grades traces with deterministic accept/reject/sequence
assertions. Ruby, stdlib only, Anthropic API.

Cases cover calling-convention (9), shortcuts (4), semantics (4),
workflow (3), and domain (5) tag groups. All 25 pass on Haiku.
- safe_load_file instead of load_file for YAML deserialization safety
- Validate --samples/--save/--compare flag values, abort on missing args
- Show a failing sample (not best) in FAIL diagnostics and snapshots
- Verify body text in create-doc eval case
- Add --in <project> to assign/unassign examples in SKILL.md
Runs skill evals conditionally when skills/basecamp/SKILL.md or
skill-evals/** change in a pull request. Gracefully skips if
ANTHROPIC_API_KEY is not configured as a repository secret.
Copilot AI review requested due to automatic review settings March 10, 2026 22:07
@github-actions github-actions bot added the ci CI/CD workflows label Mar 10, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 10, 2026

Sensitive Change Detection (shadow mode)

This PR modifies control-plane files:

  • .github/workflows/test.yml

Shadow mode — this check is informational only. When activated, changes to these paths will require approval from a maintainer.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add abort guards for --model, --skill, --tag matching the existing
  pattern used by --samples/--save/--compare.
- Fix create-doc accept pattern to require `doc` subcommand, preventing
  false passes on invalid `docs create` (without the `doc` subcommand).
@jeremy jeremy merged commit 1871606 into main Mar 10, 2026
26 checks passed
@jeremy jeremy deleted the evals branch March 10, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD workflows skills Agent skills

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants