A validator-first skill library and external eval harness for Poolside's models.
Use it to make skills gradeable, measure them with pool, and optimize their
instructions with GEPA without changing the grader.
In this repo, a skill is a contract: SKILL.md prose, a deterministic output
schema, an executable validator, eval cases including an adversarial case, and
run evidence. Prompt-pack-only skills do not merge. The short form of the loop
is check -> eval -> optimize: run local contract checks, replay eval cases,
then search for better skill instructions without changing the grader. Once
credentials are available, replace smoke and dry-run commands with the live
per-skill suite and GEPA run, then review the proposal before accepting
anything.
The worked example is ci-log-reducer: its validation score moved from 0.694 to 0.837 to 0.939 across two GEPA rounds. Those numbers are internal and directional only, not publishable lift claims; see docs/eval-methodology.md section 7.
New here? docs/getting-started.md walks the full loop, and docs/concepts.md defines the vocabulary, including Laguna, arms, gold replay, and GEPA.
From the repo root, these commands should exist:
uv --version
bun --version
pool --versionTo hand the repo to an agent for the fastest live success path, tell it:
Read AGENTS.md, then follow docs/prompts/first-success-pool-run.md.
Use ci-log-reducer unless I name a different skill.
For a local readiness pass that does not call a model:
bun ui/bench.ts doctor
bun ui/bench.ts capabilities
uv run scripts/check_skill_structure.py
uv run scripts/check_schemas.py
uv run scripts/check_validator_robustness.py
uv run scripts/check_eval_cases.py
uv run harness/runner/run_eval.py --suite evals/suites/smoke.json --dry-run --replay
uv run harness/optimize/gepa_skill.py --skill ci-log-reducer --smokeKnown-good versions used while checking these docs: Python 3.11+, uv 0.11.21,
bun 1.3.14, and pool 1.0.5. uv run ... reads pyproject.toml; this repo
is configured as a script-only project with package = false, so there is no
package install step.
- A skill library for reusable Laguna behaviors, including CI log reduction, task scoping, and repo mapping.
- An eval harness that compares model runs with and without a skill, using deterministic workspace artifacts rather than subjective judgment.
- GEPA-based skill optimization that rewrites selected skill authoring components and scores candidates against frozen validators.
- Eval-case generation for growing the test corpus, with generated cases quarantined until human review.
- A local workbench for the skills catalog, workflow catalog, eval runs, node-level grades, optimization runs, proposals, and trace review.
- Smithers workflow experiments where Pool executes workflow nodes and skills can be installed per node.
A publishable skill has:
SKILL.mdwith clear trigger and boundary instructions.- A JSON output schema in
schemas/. - A validator in
scripts/validate_*.ts. - At least three eval cases under
skills/<name>/evals/<case-id>/, including one adversarial case.
skill-generate can draft a structure-valid skill, but generated drafts are not publishable until they pass the eval-case gates.
Publish-ready skills:
ci-log-reducerreduces a failing CI or test log to.laguna/ci-log-summary.json.laguna-task-contractturns a broad engineering request into a bounded worker or router contract.repo-mapwrites.laguna/repo-map.json, an evidence-backed map of a repository.
Skills with rough evals:
bead-selectorwrites.laguna/bead-selection.jsonfor optional Beads selection workflows; seedocs/beads.mdfor the checkout boundary.workspace-inventorywrites.laguna/workspace-inventory.json. The dedicated suite atevals/suites/skill-workspace-inventory.jsoncovers six cases: flat workspaces, nested Python and Rust workspaces, and two adversarial "good-failure" cases (.lagunalisted in entries, shallow-only counts on a Go monorepo). The validator enforces schema, entries-match-tree, lexicographic sorting ofentries[], recursive directory file counts, andtotal_files. Eval numbers are internal/directional.
Experimental imports:
ce-planis an imported prompt-style planning skill with a synthetic bootstrap contract and a 12-case experimental plan-quality corpus. It is committed as Pass 6 evidence, not as a reviewed publishable Laguna skill.
Plan of record: docs/plans/laguna-skills-v0-2026-06-10.md.
doctor reports tool availability plus basic skill-contract, eval-suite, and WIP
coverage checks without starting the web server. bench.ts writes JSON to
stdout on success and JSON to stderr on errors; use
bun ui/bench.ts help <command> or bun ui/bench.ts commands for the current
CLI contract.
Run the local workbench when you want the browser UI:
bun ui/server.ts # http://127.0.0.1:4319/workflows.html
bun ui/bench.ts help # JSON help for the agent CLIWorkbench details live in ui/README.md.
Smithers is installed for this repo as a root workflow pack under .smithers/.
Agents should use it for durable multi-step, long-running, approval-gated, or
parallel work:
bunx smithers-orchestrator workflow doctor --format md
bunx smithers-orchestrator workflow list --format md
bunx smithers-orchestrator starters --format mdThe project-scoped Smithers command skills live under .agents/skills/, with
detected-agent symlink mirrors under .claude/skills/, .goose/skills/, and
.openhands/skills/. The MCP registration is .mcp.json. Details and the
PoolAgent experiment path are in docs/smithers.md.
Run the checks that should be green now:
uv run scripts/check_skill_structure.py
uv run scripts/check_schemas.py
uv run scripts/check_validator_robustness.pyRun eval-case validation when you are working on case coverage:
uv run scripts/check_eval_cases.pyCurrent state: check_eval_cases.py is expected to pass for the v0 bundle:
ci-log-reducer, laguna-task-contract, repo-map, bead-selector, and
workspace-inventory all carry the required minimum cases (including
adversarial cases). It will fail for any future WIP skill that lacks coverage.
Repo check scripts exit 0 when checks pass, 1 for check violations, and 2 for argument
or usage errors. Use --json when another tool needs a repo-check-result.v1 payload on
stdout. The payload includes schema_version, tool, status, counts,
violation_count, and violations[] entries with path, check, and message.
Use this when the source skill lives outside this repo or has no eval corpus yet:
bun ui/bench.ts eval-case-generate --skill /path/to/external-skill --no-lm-skeleton
bun ui/bench.ts eval-case-generate --skill /path/to/external-skill --n 3
bun ui/bench.ts eval-case-generate --skill <name-or-path> --validate-only runs/generate/<name>/<stamp>/candidates/<case-id>
bun ui/bench.ts eval-case-generate --skill <name-or-path> --promote runs/generate/<name>/<stamp>/candidates/<case-id>--skill accepts a repo skill name, an external skill directory, or a
SKILL.md path. Path mode imports the full skill directory into
skills/<name> when the repo copy is missing. Prompt-style skills missing
Laguna contracts get a synthetic bootstrap schema and validator so the first
cases can be reviewed mechanically. Treat that synthetic contract as a starter
scaffold only; build reviewed functional cases before reading GEPA results as
skill-performance evidence.
Generated cases stay quarantined under runs/generate/ until --promote copies
them into skills/<skill>/evals/ and updates the per-skill suite.
Full details: docs/external-skill-bootstrap.md.
Dry run validates fixtures, materialization, manifest shape, and gold replay
without calling pool:
uv run harness/runner/run_eval.py --suite evals/suites/smoke.json --dry-run --replayLive runs require Poolside CLI auth through POOLSIDE_TOKEN or
~/.config/poolside/credentials.json:
bun ui/bench.ts eval-run --suite evals/suites/smoke.json --arm xs_with_skill
bun ui/bench.ts eval-runsRun outputs land under runs/<suite>/<case>/<arm>/. Eval numbers are internal
and directional; do not publish lift claims from them.
GEPA mutates selected skill authoring components and grades candidates against
frozen eval cases, schemas, and validators. By default the mutable component is
SKILL.md; --components references adds references/**. For large imported
prompt skills, prefer a small reference/supplement component over full-SKILL.md
mutation so the optimizer has a narrow target.
Provider-backed reflection uses LiteLLM environment keys such as
OPENROUTER_API_KEY or ANTHROPIC_API_KEY; --reflection-pool-agent uses the
authenticated pool model-selector path instead.
uv run harness/optimize/gepa_skill.py --skill <name> --smoke
uv run harness/optimize/gepa_skill.py --skill <name> --baseline-only
uv run harness/optimize/gepa_skill.py --skill <name> --max-metric-calls 60
uv run harness/optimize/gepa_skill.py --skill <name> \
--reflection-lm openrouter/openai/gpt-5.4 \
--reflection-reasoning-effort medium
uv run harness/optimize/gepa_skill.py --skill <name> \
--reflection-pool-agent anthropic/claude-4.5-sonnet
uv run harness/optimize/gepa_skill.py --skill <name> \
--max-candidate-bytes-over-seed 2500 \
--reject-broad-artifact-overridesGate failures score zero before any pool spend. Outputs land under
runs/optimize/<skill>/<stamp>/; promotion is manual through diff review or:
bun ui/bench.ts optimize-propose --skill <name> --run-dir runs/optimize/<name>/<stamp>Full details: docs/gepa-optimization.md.
The standalone review app flattens run directories into traces and serves a local annotation UI:
uv run harness/review/extract_traces.py
uv run harness/review/serve.py # http://127.0.0.1:8765Use --demo for synthetic traces. Optional LLM judging is a reading aid, not a
metric, and requires OPENROUTER_API_KEY.
skills/ # publishable skill sources
<name>/ # SKILL.md, schemas/, scripts/, evals/<case-id>/
_shared/ # shared validator-result helper for TypeScript validators
schemas/common/ # shared JSON schemas
evals/suites/ # suite definitions
harness/ # Python eval runner and review tools
scripts/ # repo checks and install helper
docs/ # docs index, getting started, authoring guide, eval method, plans, spikes
ui/ # local workbench
plans/ # workbench redesign implementation plans (done; see plans/README.md)
experiments/ # spikes with their own setup, e.g. smithers-pool
.resources/ # design handoff, investigations, and decision-register source material
runs/ # eval and review output, gitignored
docs/README.md: documentation index, organized by audience.docs/getting-started.md: first-session walkthrough, offline steps first.docs/concepts.md: glossary and the offline-vs-credentials command matrix.docs/authoring-guide.md: binding skill authoring rules.evals/README.md: case folder format and gold replay.schemas/common/README.md: shared schema contracts.docs/eval-methodology.md: arm matrix, isolation, metrics, and reporting policy.
index.html, skill.html, workflows.html, and styles.css are prototype pages. index.html and skill.html are static catalog mockups; some cards and metrics are illustrative. workflows.html is the workbench shell and needs bun ui/server.ts for live data.