Expand evals to 25 and improve SKILL.md workflow guidance#32
Conversation
Add evals for: Python project detection, template selection, heuristics generation, pointer principle, validate-structure script usage, content verification, no-fabrication edge case, scoped override precedence, and GitHub rulesets extraction.
Add explicit 4-step workflow (Detect -> Extract -> Generate -> Verify) and replace "Verified Commands" with stronger "Never Fabricate" principle. Keeps word count under 500 (441 words).
There was a problem hiding this comment.
Code Review
This pull request updates the SKILL.md documentation to include a formal workflow and refined core principles, specifically emphasizing the 'Never Fabricate' rule. The evaluation suite in evals.json has been significantly expanded with new test cases for git hooks, Python projects, and structure validation, while existing evals were updated with adjusted line limits and simplified assertions. Feedback highlights that removing execution evidence assertions in the command verification eval weakens its effectiveness and identifies a missing assertion for GEMINI.md symlinks in the cross-agent compatibility test.
| "Commands include correct flags and paths for this specific project", | ||
| "At least one command was actually executed to verify it works", | ||
| "Output includes evidence of command verification (exit codes or output)", | ||
| "Does not list commands like npm, composer, cargo that don't apply to this Go project" |
There was a problem hiding this comment.
The prompt for this eval (id: 9) explicitly instructs the agent to 'verify each command by attempting to run it'. However, the assertions that checked for command execution and evidence of verification (exit codes/output) have been removed in this PR. This makes the eval significantly weaker as it no longer verifies if the agent actually followed the instruction to run the commands. Consider restoring these assertions to maintain the integrity of the 'Never Fabricate' principle which requires verification.
"At least one command was actually executed to verify it works",
"Output includes evidence of command verification (exit codes or output)",
"Does not list commands like npm, composer, cargo that don't apply to this Go project"| "Root GEMINI.md exists and is a symlink to AGENTS.md", | ||
| "Every subdirectory with AGENTS.md also has CLAUDE.md symlink to AGENTS.md", | ||
| "Every subdirectory with AGENTS.md also has GEMINI.md symlink to AGENTS.md", | ||
| "Subdirectory AGENTS.md files also have CLAUDE.md symlinks", |
There was a problem hiding this comment.
The expected output for this eval (id: 14) states that both CLAUDE.md and GEMINI.md symlinks should be created in every directory that has an AGENTS.md. However, the assertion on line 217 only checks for CLAUDE.md symlinks in subdirectories. To ensure full cross-agent compatibility as intended by the expected output, the assertion should also verify the presence of GEMINI.md symlinks in subdirectories.
| "Subdirectory AGENTS.md files also have CLAUDE.md symlinks", | |
| "Subdirectory AGENTS.md files also have CLAUDE.md and GEMINI.md symlinks", |
The root AGENTS.md listed vendor/bin/phpunit which doesn't exist in a skill repo. Replace with actual validation commands.
There was a problem hiding this comment.
Pull request overview
This PR updates the agent-rules skill documentation and evaluation suite to reflect an expanded set of eval scenarios (now 25) and clearer guidance for generating/verifying AGENTS.md outputs.
Changes:
- Added an explicit 4-step workflow section (Detect → Extract → Generate → Verify) to SKILL.md.
- Replaced the “Verified Commands” principle with a stronger “Never Fabricate” principle.
- Expanded and revised eval definitions in
evals.json(IDs 1–25), including updated prompts/assertions and new evals.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
skills/agent-rules/SKILL.md |
Adds workflow guidance and updates core principles for AGENTS.md generation/verification. |
skills/agent-rules/evals/evals.json |
Expands/adjusts eval coverage to 25 scenarios and refines expected behavior/assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 1. **Detect**: `detect-project.sh` + `detect-scopes.sh` to identify stacks and subsystems | ||
| 2. **Extract**: `extract-commands.sh`, `extract-ci-rules.sh`, etc. to gather facts | ||
| 3. **Generate**: `generate-agents.sh` with `--style=thin` (default) or `--verbose` | ||
| 4. **Verify**: `verify-content.sh` + `verify-commands.sh` -- MANDATORY before done |
There was a problem hiding this comment.
The workflow steps list script names without the scripts/ prefix and without a PATH argument, but the Scripts table above documents the invocations as scripts/<name>.sh PATH. For copy/paste correctness and consistency, update the workflow commands to use the same full paths and include a PATH placeholder (e.g., scripts/detect-project.sh PATH).
| 1. **Detect**: `detect-project.sh` + `detect-scopes.sh` to identify stacks and subsystems | |
| 2. **Extract**: `extract-commands.sh`, `extract-ci-rules.sh`, etc. to gather facts | |
| 3. **Generate**: `generate-agents.sh` with `--style=thin` (default) or `--verbose` | |
| 4. **Verify**: `verify-content.sh` + `verify-commands.sh` -- MANDATORY before done | |
| 1. **Detect**: `scripts/detect-project.sh PATH` + `scripts/detect-scopes.sh PATH` to identify stacks and subsystems | |
| 2. **Extract**: `scripts/extract-commands.sh PATH`, `scripts/extract-ci-rules.sh PATH`, etc. to gather facts | |
| 3. **Generate**: `scripts/generate-agents.sh PATH` with `--style=thin` (default) or `--verbose` | |
| 4. **Verify**: `scripts/verify-content.sh PATH` + `scripts/verify-commands.sh PATH` -- MANDATORY before done |
| "eval_name": "symlink-safety-no-overwrite", | ||
| "prompt": "This project already has a CLAUDE.md with custom content (not a symlink). Generate AGENTS.md with --symlinks. Verify that the existing CLAUDE.md is NOT overwritten unless --force is used.", | ||
| "expected_output": "The existing non-symlink CLAUDE.md should be preserved. A warning should be logged that CLAUDE.md exists and is not a symlink. With --force, it should be replaced.", | ||
| "prompt": "This project already has a CLAUDE.md with custom content (not a symlink). Generate AGENTS.md with --symlinks. Verify that the existing CLAUDE.md is NOT overwritten.", |
There was a problem hiding this comment.
--symlinks is referenced here, but scripts/generate-agents.sh does not appear to support a --symlinks flag (symlinks are created by default; --no-symlinks disables). Update this prompt to use the actual CLI semantics so the eval is runnable.
| "prompt": "This project already has a CLAUDE.md with custom content (not a symlink). Generate AGENTS.md with --symlinks. Verify that the existing CLAUDE.md is NOT overwritten.", | |
| "prompt": "This project already has a CLAUDE.md with custom content (not a symlink). Generate AGENTS.md with symlinks enabled (the default; do not pass --no-symlinks). Verify that the existing CLAUDE.md is NOT overwritten.", |
| "A log message warns that CLAUDE.md exists and is not a symlink", | ||
| "GEMINI.md symlink is still created (it didn't exist before)", | ||
| "With --force flag, existing CLAUDE.md IS replaced with symlink", | ||
| "AGENTS.md itself is not affected by --symlinks flag" |
There was a problem hiding this comment.
This assertion refers to a --symlinks flag, but the generator script uses default symlink creation with an opt-out (--no-symlinks). Reword the assertion to match the real flags/behavior (and keep it testable).
| "AGENTS.md itself is not affected by --symlinks flag" | |
| "AGENTS.md itself is not modified by symlink management; only CLAUDE.md is subject to symlink logic" |
| "Root GEMINI.md exists and is a symlink to AGENTS.md", | ||
| "Every subdirectory with AGENTS.md also has CLAUDE.md symlink to AGENTS.md", | ||
| "Every subdirectory with AGENTS.md also has GEMINI.md symlink to AGENTS.md", | ||
| "Subdirectory AGENTS.md files also have CLAUDE.md symlinks", |
There was a problem hiding this comment.
The eval’s expected_output says CLAUDE.md and GEMINI.md symlinks should exist in every directory with an AGENTS.md, but the assertions only check subdirectory CLAUDE.md symlinks. Add a corresponding assertion for subdirectory GEMINI.md symlinks, or relax the expected_output to match what’s being tested.
| "Subdirectory AGENTS.md files also have CLAUDE.md symlinks", | |
| "Subdirectory AGENTS.md files also have CLAUDE.md symlinks", | |
| "Subdirectory AGENTS.md files also have GEMINI.md symlinks", |
Summary
Eval Coverage
A/B Analysis: WITHOUT vs WITH Skill
Skill Improvements Made
Test plan