adaptive_skill evolves an agent's skill library by analyzing trajectories from a batch of tasks — without ever seeing ground-truth labels, test results, or pass/fail signals during evolution.
Most evolution algorithms rely on a labeled evaluation signal: the agent solves tasks, a judge tells it which answers were right, and the evolver learns from that feedback. Adaptive Skill removes the label dependency entirely. The evolver only sees what the agent did (commands, errors, loops, outputs), never whether it succeeded. An LLM judge estimates success from behavior, and the evolver uses those proxy signals to decide which skills to create.
This makes Adaptive Skill suitable for domains where:
- Ground-truth evaluation is expensive, slow, or unavailable during evolution.
- The agent operates in open-ended environments (terminal, CLI, system administration).
- You want to evolve continuously on live traffic without waiting for labeled feedback.
flowchart TD
A["Batch of Tasks"] --> B["Agent Solves Tasks<br/>(black-box execution)"]
B --> C["Collect Trajectories<br/>(tool calls, outputs, errors)"]
C --> D{"trajectory_only<br/>mode?"}
D -- "Yes" --> E["Trajectory Signal Extraction<br/>(errors, loops, timeouts, commands)"]
D -- "No" --> F["Standard Feedback<br/>(pass/fail, score)"]
E --> G["LLM Judge Scoring<br/>(0-10 per trajectory)"]
G --> H["Build Evolution Prompt<br/>(signals + verdicts + current skills)"]
F --> H
H --> I["Evolver LLM<br/>(meta-learning agent with bash access)"]
I --> J{"Skill Budget<br/>Exceeded?"}
J -- "No" --> K["Create New Skills<br/>from failure patterns"]
J -- "Yes" --> L["Refine Existing Skills<br/>with new patterns"]
K --> M["Gating Validation<br/>(holdout tasks)"]
L --> M
M --> N{"Accepted?"}
N -- "Yes" --> O["Commit Mutation<br/>(git tag evo-N)"]
N -- "No" --> P["Rollback via git"]
O --> Q["Agent Reloads<br/>Workspace"]
P --> Q
Q --> A
The agent processes a batch of tasks in a black-box manner. Each task produces a trajectory: the full sequence of tool calls, commands, outputs, and errors the agent encountered.
Instead of collecting labeled feedback, the algorithm extracts behavioral signals from each trajectory:
flowchart LR
T["Raw Trajectory"] --> S1["Signal Extraction"]
S1 --> S2["n_turns, n_tool_calls"]
S1 --> S3["n_errors, n_timeouts"]
S1 --> S4["tools_used frequency"]
S1 --> S5["repeated_commands (loops)"]
S1 --> S6["submitted? submit_value"]
S1 --> S7["error_snippets"]
T --> C1["Trajectory Compression"]
C1 --> C2["First 3 commands (approach)"]
C1 --> C3["All errors + context"]
C1 --> C4["Detected loops (cmd x3+)"]
C1 --> C5["Last 3 commands (resolution)"]
T --> J1["LLM Judge"]
J1 --> J2["score: 0-10"]
J1 --> J3["category: build, debug, ..."]
J1 --> J4["outcome: one sentence"]
J1 --> J5["failure_reason: specific cause"]
The LLM Judge acts as a proxy evaluator. It reads the compressed trajectory and estimates:
| Field | Description |
|---|---|
score (0-10) |
0 = complete failure, 5 = partial progress, 10 = likely solved |
category |
Task type (build, debug, data-science, security, system-admin, ...) |
outcome |
One-sentence description of what happened |
failure_reason |
Specific thing that went wrong (if score < 7) |
The evolver LLM receives all signals and verdicts, plus the current skill library, and decides what to mutate. This is where the skill budget controls growth.
flowchart TD
subgraph Input
V["Judge Verdicts<br/>(score, category, failure_reason)"]
CS["Current Skills<br/>(name, description, content)"]
B["Skill Budget<br/>(max_skills, e.g. 5)"]
end
V --> SORT["Sort by judge score<br/>(lowest first)"]
SORT --> FILTER["Filter: score < 7<br/>(FAILED or PARTIAL)"]
FILTER --> GROUP["Group failures by<br/>category + failure_reason"]
GROUP --> CHECK{"Pattern has<br/>2+ failed tasks?"}
CHECK -- "No" --> SKIP["Skip<br/>(not a pattern)"]
CHECK -- "Yes" --> BUDGET{"current_skills<br/>< max_skills?"}
BUDGET -- "Yes (budget remaining)" --> NEW["Create New Skill<br/>targeting this failure category"]
BUDGET -- "No (budget reached)" --> REFINE["Refine Existing Skill<br/>add new patterns from failures"]
NEW --> QUALITY["Quality Gate:<br/>- kebab-case name<br/>- clear WHEN description<br/>- domain knowledge only<br/>- max 2000 chars<br/>- verification steps"]
REFINE --> QUALITY
QUALITY --> WRITE["Write skill via bash tool<br/>to skills/SKILL.md"]
The skill budget (max_skills, default 5) prevents unbounded skill library growth:
| State | Behavior |
|---|---|
current < max_skills |
Evolver may create new skills for uncovered failure categories |
current >= max_skills |
No new skills allowed. Evolver must refine existing skills instead |
The budget forces the evolver to produce general, high-coverage skills rather than one-off fixes. As the library fills up, new failure patterns must be folded into existing skills that cover the closest category.
Each skill is a SKILL.md file with YAML frontmatter:
---
name: build-legacy-c-projects
description: >
When building legacy C/C++ projects that fail with missing GUI/X11
dependencies or outdated Makefiles.
---
## Steps
1. Check for optional GUI dependencies (X11, SDL, ncurses) and disable them
via configure flags or Makefile edits.
2. ...
## Verification
- `make` completes with exit code 0
- Binary exists in expected output pathSkills are loaded on demand by the agent via read_skill(name) — the agent sees skill names and descriptions in its system prompt and decides which to read for a given task.
After the evolver mutates the workspace, the gating strategy validates the mutation:
flowchart LR
MUT["Mutated Workspace"] --> HOLDOUT["Run Agent on<br/>Holdout Tasks"]
HOLDOUT --> SCORE["Compute avg score"]
SCORE --> THRESH{"avg >= threshold?"}
THRESH -- "Yes" --> ACCEPT["Accept mutation<br/>git tag evo-N"]
THRESH -- "No" --> REJECT["Reject mutation<br/>git rollback"]
Holdout tasks are sampled from the benchmark (default 20% holdout ratio). If the mutated agent regresses on holdout tasks, the entire mutation is rolled back via git.
The agent reloads its workspace (skills, prompts, memory) and the loop repeats. Convergence is tracked via EGL (Evolutionary Generality Loss):
EGL = (new_skills_created / total_tasks_solved) * 1000
When EGL stays below a threshold (default 0.05) for a configurable window (default 3 cycles), the evolution is considered converged — the agent has stabilized and is no longer discovering new failure patterns.
flowchart TD
START(["Start: Base Agent<br/>(0 skills, generic prompt)"]) --> CYCLE["Evolution Cycle N"]
subgraph CYCLE["Evolution Cycle"]
direction TB
SOLVE["1. Solve batch<br/>(agent runs tasks)"]
OBS["2. Observe<br/>(extract trajectory signals,<br/>LLM judge scores)"]
EVOLVE["3. Evolve<br/>(create/refine skills<br/>under budget)"]
GATE["4. Gate<br/>(holdout validation)"]
RELOAD["5. Reload workspace"]
SOLVE --> OBS --> EVOLVE --> GATE --> RELOAD
end
RELOAD --> CONV{"EGL converged<br/>or max_cycles?"}
CONV -- "No" --> CYCLE
CONV -- "Yes" --> END(["End: Evolved Agent<br/>(N targeted skills)"])
Key config fields for Adaptive Skill (set via YAML or EvolveConfig):
| Parameter | Default | Description |
|---|---|---|
trajectory_only |
True |
Only show trajectories to evolver (no labels) |
max_skills |
5 |
Skill budget — max number of skills allowed |
evolve_skills |
True |
Allow skill creation/modification |
evolve_prompts |
True |
Allow system prompt edits |
evolve_memory |
True |
Allow memory updates |
protect_skills |
False |
If True, existing skills are read-only (only new creation allowed) |
solver_proposed |
False |
If True, the solver agent proposes draft skills for the evolver to generalize |
prompt_only |
False |
If True, only system prompt mutations are allowed (no skills) |
batch_size |
10 |
Number of tasks per evolution cycle |
holdout_ratio |
0.2 |
Fraction of tasks reserved for gating validation |
egl_threshold |
0.05 |
EGL convergence threshold |
egl_window |
3 |
Number of consecutive cycles EGL must stay below threshold |
Why no labels? In open-ended terminal tasks, ground-truth evaluation can be expensive (spinning up Docker environments, running test suites). By judging from trajectories alone, the evolver can run continuously without waiting for evaluation infrastructure.
Why a skill budget? Without a budget, the evolver tends to create narrow, task-specific skills that don't generalize. The budget forces consolidation — five well-crafted category skills outperform twenty fragmented ones.
Why an LLM judge? The judge provides a structured signal (score + category + failure reason) that the evolver can sort, filter, and group. Raw trajectories are noisy; the judge distills them into actionable patterns.
| File | Role |
|---|---|
engine.py |
AdaptiveSkillEngine — orchestrates the step/evolve loop |
prompts.py |
Prompt templates, trajectory compression, LLM judge |
gating.py |
Holdout validation strategy |
egl.py |
EGL computation and convergence check |
tools.py |
Bash tool spec and LLM provider factory |