A Claude Code skill that diagnoses and improves your harness configuration based on 8 design principles from Anthropic's "Harness design for long-running application development".
The skill performs four main operations:
- Scan — Auto-detects all harness components in your project (skills, agents, commands, hooks, CLAUDE.md, plugin.json, MCP config, settings)
- Diagnose — Evaluates each component against 8 harness design principles using 2-layer diagnostics
- Report — Outputs a PASS/FAIL/PARTIAL checklist with per-principle scores and an overall health grade (0-100)
- Fix — Applies tiered auto-fixes: Tier 1 modifies existing files, Tier 2 creates new files (with user confirmation)
Supports both plugin projects (with plugin.json) and non-plugin projects (CLAUDE.md + .claude/ only).
| # | Principle | Weight | Key Question |
|---|---|---|---|
| 1 | Evaluator Separation | 20% | Are generator and evaluator agents separated? |
| 2 | Context Management | 15% | Is there a context reset/compaction strategy? |
| 3 | Task Decomposition | 15% | Are complex tasks broken into manageable units? |
| 4 | Evaluation Criteria Design | 10% | Are subjective qualities converted to measurable criteria? |
| 5 | Structured Handoff | 10% | Is agent context transfer via files/artifacts? |
| 6 | Harness Simplification | 7% | Is unnecessary scaffolding removed? |
| 7 | Sprint Contract | 8% | Are "done" criteria defined before work starts? |
| 8 | Feedback Loop | 15% | Do evaluation results feed back to the generator? |
Unlike simple keyword matching, harness-optimizer uses a 2-layer approach for accurate diagnosis:
- Layer 1 (Signal Collection): Grep/Glob patterns scan for relevant files and keywords — collecting evidence without making judgments
- Layer 2 (Semantic Judgment): The LLM reads the actual file content and determines whether the principle is truly implemented, not just mentioned
This prevents false positives (e.g., a file named reviewer.md that doesn't actually serve as an evaluator) and false negatives (e.g., an evaluator with a non-standard name).
npx skills add CaesiumY/harness-optimizerOnce installed, trigger with phrases like:
optimize harness/diagnose harness/check my harnessharness health/harness review/improve harness
| Flag | Effect |
|---|---|
--dry-run |
Show proposed changes without modifying files |
--report-only |
Output diagnostic report only, skip auto-fix |
--path <path> |
Specify target project path |
--help |
Display usage information |
| Grade | Score | Description |
|---|---|---|
| Excellent | 80-100 | Harness design is principled and well-executed |
| Good | 60-79 | Core principles implemented, room for improvement |
| Fair | 40-59 | Major principles missing, improvement needed |
| Poor | 20-39 | Most principles unimplemented |
| Critical | 0-19 | Harness design principles are barely applied |
harness-design-skill/
├── skills/
│ └── harness-optimizer/
│ ├── SKILL.md # Main workflow (182 lines)
│ ├── references/
│ │ ├── principles-checklist.md # 2-layer diagnostic logic per principle
│ │ ├── scoring-system.md # Weights, formulas, grade definitions
│ │ ├── autofix-catalog.md # Tier 1/2 fix catalog with before/after
│ │ └── harness-article-summary.md # Key insights from the source article
│ └── scripts/
│ └── scan-components.mjs # Project component auto-detection
├── docs/
│ └── Harness design for long-running application development.md
├── LICENSE
└── README.md
This skill is built on insights from Anthropic's engineering blog post "Harness design for long-running application development" by Prithvi Rajasekaran. The article describes a multi-agent architecture (Planner → Generator → Evaluator) that produced rich full-stack applications over multi-hour autonomous coding sessions.
Skill structure modeled after agents-md-optimizer.
MIT