Conversation
These commands exist in the CLI but were missing from the skill. Surfaced by eval cases that failed before the skill documented them.
There was a problem hiding this comment.
3 issues found across 30 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skill-evals/cases/create-doc.yml">
<violation number="1" location="skill-evals/cases/create-doc.yml:4">
P2: Accept patterns don't verify the body text "Getting started with the API" is passed to the command. The eval will pass even if the agent omits the body entirely. Other create cases (e.g., `create-comment.yml`) match content in their accept patterns.</violation>
</file>
<file name="skill-evals/run">
<violation number="1" location="skill-evals/run:48">
P2: `argv.shift` returns `nil` when a flag like `--samples` is the last argument. `nil.to_i` silently produces `0`, causing every case to report FAIL with zero samples and no error message. Guard against missing flag values.</violation>
<violation number="2" location="skill-evals/run:72">
P2: `YAML.load_file` should be `YAML.safe_load_file` to make deserialization safety explicit. On Ruby < 3.1 `load_file` permits arbitrary object instantiation; `safe_load_file` is safe on all versions and signals intent.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Pull request overview
Introduces a new skill-evals/ harness that runs Anthropic tool-calling evaluations to verify skills/basecamp/SKILL.md teaches agents correct Basecamp CLI usage, and updates the skill doc to cover gaps found by the eval suite.
Changes:
- Added a Ruby runner (
skill-evals/run) that executes YAML-defined eval cases, captures tool traces, and grades them via deterministic assertions. - Added 25 YAML eval cases covering calling conventions, shortcuts, semantics, workflow, and domain scenarios.
- Updated
skills/basecamp/SKILL.mdto document cross-projectreports overdueand assign/unassign shortcuts; added a top-levelmake skill-evaltarget.
Reviewed changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/basecamp/SKILL.md | Documents reports overdue and adds assign/unassign examples in quick reference and todos section. |
| skill-evals/run | New Ruby eval runner that calls Anthropic Messages API, mocks tool results, and grades traces. |
| skill-evals/Makefile | Adds local make targets to run/save/compare eval runs. |
| Makefile | Adds skill-eval target delegating to skill-evals/Makefile. |
| skill-evals/results/.gitkeep | Ensures results directory is tracked. |
| skill-evals/cases/agent-output-mode.yml | Case ensuring agents use machine-readable output flags. |
| skill-evals/cases/assign-todo.yml | Case verifying assign shortcut usage. |
| skill-evals/cases/checkin-answer.yml | Case verifying check-in answer creation usage. |
| skill-evals/cases/complete-multiple.yml | Case verifying multi-ID completion shortcut usage. |
| skill-evals/cases/complete-todo.yml | Case verifying single todo completion shortcut usage. |
| skill-evals/cases/create-card.yml | Case verifying card creation calling convention. |
| skill-evals/cases/create-comment.yml | Case verifying comment creation calling convention. |
| skill-evals/cases/create-doc.yml | Case verifying document creation usage. |
| skill-evals/cases/create-message.yml | Case verifying message creation calling convention. |
| skill-evals/cases/create-todo.yml | Case verifying todo creation calling convention. |
| skill-evals/cases/list-cards.yml | Case verifying cards list calling convention. |
| skill-evals/cases/list-files.yml | Case verifying files list calling convention. |
| skill-evals/cases/list-messages.yml | Case verifying messages list calling convention. |
| skill-evals/cases/list-todolists.yml | Case verifying todolists list calling convention. |
| skill-evals/cases/list-todos.yml | Case verifying todos list calling convention. |
| skill-evals/cases/people-in-project.yml | Case verifying project-scoped people listing semantics. |
| skill-evals/cases/project-scope.yml | Case verifying project scoping behavior for overdue todos. |
| skill-evals/cases/recordings-browse.yml | Case verifying cross-project recordings browse semantics. |
| skill-evals/cases/reopen-todo.yml | Case verifying reopen shortcut usage. |
| skill-evals/cases/reports-assigned.yml | Case verifying cross-project assigned report usage. |
| skill-evals/cases/reports-overdue.yml | Case verifying cross-project overdue report usage. |
| skill-evals/cases/schedule-create.yml | Case verifying schedule entry creation usage. |
| skill-evals/cases/search.yml | Case verifying search semantics and rejecting recordings misuse. |
| skill-evals/cases/url-then-comment.yml | Case verifying workflow: parse URL then comment. |
| skill-evals/cases/webhook-create.yml | Case verifying webhook creation usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7156f6451
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Tool-calling evals that verify SKILL.md teaches correct CLI usage. Gives an LLM the skill + a task + a basecamp tool, intercepts tool_use calls, and grades traces with deterministic accept/reject/sequence assertions. Ruby, stdlib only, Anthropic API. Cases cover calling-convention (9), shortcuts (4), semantics (4), workflow (3), and domain (5) tag groups. All 25 pass on Haiku.
- safe_load_file instead of load_file for YAML deserialization safety - Validate --samples/--save/--compare flag values, abort on missing args - Show a failing sample (not best) in FAIL diagnostics and snapshots - Verify body text in create-doc eval case - Add --in <project> to assign/unassign examples in SKILL.md
Runs skill evals conditionally when skills/basecamp/SKILL.md or skill-evals/** change in a pull request. Gracefully skips if ANTHROPIC_API_KEY is not configured as a repository secret.
Sensitive Change Detection (shadow mode)This PR modifies control-plane files:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add abort guards for --model, --skill, --tag matching the existing pattern used by --samples/--save/--compare. - Fix create-doc accept pattern to require `doc` subcommand, preventing false passes on invalid `docs create` (without the `doc` subcommand).
Summary
skill-evals/— Anthropic-backed tool-calling harness that verifies SKILL.md teaches agents correct CLI usageassign/unassignshortcuts andreports overduein SKILL.md (surfaced as coverage gaps by the evals)make skill-evaltarget to root MakefileHow it works
Give an LLM the skill + a task + a
basecamptool. Intercepttool_usecalls, return mock/generic responses, grade the trace with deterministic assertions (accept,reject,expect_sequence,max_commands).Eval results (Haiku)
Test plan
ruby -c skill-evals/run— syntax OK./skill-evals/run --model claude-haiku-4-5-20251001— 25/25 passmake fmt-check vet test test-e2e check-naming check-surface provenance-check tidy-check— all passANTHROPIC_API_KEYgating:./skill-evals/runwithout key → clean abortSummary by cubic
Adds a new
skill-evals/harness to verifyskills/basecamp/SKILL.mdteaches correctbasecampCLI usage, plus a CI gate that runs these evals on PRs touching the skill or eval files.New Features
skill-evals/Ruby runner intercepts Anthropictool_use, mocks responses, and grades viaaccept,reject,expect_sequence, andmax_commands. Supports--model,--skill,--tag,--samples,--save,--compare,--json,--verbose, and aborts on missing values for all flags.create-doccase tightened to require thedocsubcommand and body text.make skill-eval(default modelclaude-sonnet-4-20250514); inskill-evals:eval,eval-save,eval-compare.skills/basecamp/SKILL.mdorskill-evals/**change; warns and skips ifANTHROPIC_API_KEYis missing.SKILL.md: documentsreports overdueandassign/unassign; examples include--in <project>; clarifies cross-project exceptions.Migration
ANTHROPIC_API_KEYand Ruby 3.3.make skill-eval(override withMODEL=<id>); save/compare baselines viamake -C skill-evals eval-save NAME=<n>andeval-compare.Written for commit fefab4a. Summary will update on new commits.