Adaptive Skill — In-Situation Evolution Without Labels

adaptive_skill evolves an agent's skill library by analyzing trajectories from a batch of tasks — without ever seeing ground-truth labels, test results, or pass/fail signals during evolution.

Core Idea

Most evolution algorithms rely on a labeled evaluation signal: the agent solves tasks, a judge tells it which answers were right, and the evolver learns from that feedback. Adaptive Skill removes the label dependency entirely. The evolver only sees what the agent did (commands, errors, loops, outputs), never whether it succeeded. An LLM judge estimates success from behavior, and the evolver uses those proxy signals to decide which skills to create.

This makes Adaptive Skill suitable for domains where:

Ground-truth evaluation is expensive, slow, or unavailable during evolution.
The agent operates in open-ended environments (terminal, CLI, system administration).
You want to evolve continuously on live traffic without waiting for labeled feedback.

Algorithm Overview

flowchart TD
    A["Batch of Tasks"] --> B["Agent Solves Tasks<br/>(black-box execution)"]
    B --> C["Collect Trajectories<br/>(tool calls, outputs, errors)"]
    C --> D{"trajectory_only<br/>mode?"}
    D -- "Yes" --> E["Trajectory Signal Extraction<br/>(errors, loops, timeouts, commands)"]
    D -- "No" --> F["Standard Feedback<br/>(pass/fail, score)"]
    E --> G["LLM Judge Scoring<br/>(0-10 per trajectory)"]
    G --> H["Build Evolution Prompt<br/>(signals + verdicts + current skills)"]
    F --> H
    H --> I["Evolver LLM<br/>(meta-learning agent with bash access)"]
    I --> J{"Skill Budget<br/>Exceeded?"}
    J -- "No" --> K["Create New Skills<br/>from failure patterns"]
    J -- "Yes" --> L["Refine Existing Skills<br/>with new patterns"]
    K --> M["Gating Validation<br/>(holdout tasks)"]
    L --> M
    M --> N{"Accepted?"}
    N -- "Yes" --> O["Commit Mutation<br/>(git tag evo-N)"]
    N -- "No" --> P["Rollback via git"]
    O --> Q["Agent Reloads<br/>Workspace"]
    P --> Q
    Q --> A

Detailed Flow

Phase 1 — Solve

The agent processes a batch of tasks in a black-box manner. Each task produces a trajectory: the full sequence of tool calls, commands, outputs, and errors the agent encountered.

Phase 2 — Observe (Trajectory-Only)

Instead of collecting labeled feedback, the algorithm extracts behavioral signals from each trajectory:

flowchart LR
    T["Raw Trajectory"] --> S1["Signal Extraction"]
    S1 --> S2["n_turns, n_tool_calls"]
    S1 --> S3["n_errors, n_timeouts"]
    S1 --> S4["tools_used frequency"]
    S1 --> S5["repeated_commands (loops)"]
    S1 --> S6["submitted? submit_value"]
    S1 --> S7["error_snippets"]

    T --> C1["Trajectory Compression"]
    C1 --> C2["First 3 commands (approach)"]
    C1 --> C3["All errors + context"]
    C1 --> C4["Detected loops (cmd x3+)"]
    C1 --> C5["Last 3 commands (resolution)"]

    T --> J1["LLM Judge"]
    J1 --> J2["score: 0-10"]
    J1 --> J3["category: build, debug, ..."]
    J1 --> J4["outcome: one sentence"]
    J1 --> J5["failure_reason: specific cause"]

The LLM Judge acts as a proxy evaluator. It reads the compressed trajectory and estimates:

Field	Description
`score` (0-10)	0 = complete failure, 5 = partial progress, 10 = likely solved
`category`	Task type (build, debug, data-science, security, system-admin, ...)
`outcome`	One-sentence description of what happened
`failure_reason`	Specific thing that went wrong (if score < 7)

Phase 3 — Evolve (Skill Mutation Under Budget)

The evolver LLM receives all signals and verdicts, plus the current skill library, and decides what to mutate. This is where the skill budget controls growth.

flowchart TD
    subgraph Input
        V["Judge Verdicts<br/>(score, category, failure_reason)"]
        CS["Current Skills<br/>(name, description, content)"]
        B["Skill Budget<br/>(max_skills, e.g. 5)"]
    end

    V --> SORT["Sort by judge score<br/>(lowest first)"]
    SORT --> FILTER["Filter: score < 7<br/>(FAILED or PARTIAL)"]
    FILTER --> GROUP["Group failures by<br/>category + failure_reason"]

    GROUP --> CHECK{"Pattern has<br/>2+ failed tasks?"}
    CHECK -- "No" --> SKIP["Skip<br/>(not a pattern)"]
    CHECK -- "Yes" --> BUDGET{"current_skills<br/>< max_skills?"}

    BUDGET -- "Yes (budget remaining)" --> NEW["Create New Skill<br/>targeting this failure category"]
    BUDGET -- "No (budget reached)" --> REFINE["Refine Existing Skill<br/>add new patterns from failures"]

    NEW --> QUALITY["Quality Gate:<br/>- kebab-case name<br/>- clear WHEN description<br/>- domain knowledge only<br/>- max 2000 chars<br/>- verification steps"]
    REFINE --> QUALITY
    QUALITY --> WRITE["Write skill via bash tool<br/>to skills/SKILL.md"]

Skill Budget

The skill budget (max_skills, default 5) prevents unbounded skill library growth:

State	Behavior
`current < max_skills`	Evolver may create new skills for uncovered failure categories
`current >= max_skills`	No new skills allowed. Evolver must refine existing skills instead

The budget forces the evolver to produce general, high-coverage skills rather than one-off fixes. As the library fills up, new failure patterns must be folded into existing skills that cover the closest category.

What the Evolver Writes

Each skill is a SKILL.md file with YAML frontmatter:

---
name: build-legacy-c-projects
description: >
  When building legacy C/C++ projects that fail with missing GUI/X11
  dependencies or outdated Makefiles.
---

## Steps
1. Check for optional GUI dependencies (X11, SDL, ncurses) and disable them
   via configure flags or Makefile edits.
2. ...

## Verification
- `make` completes with exit code 0
- Binary exists in expected output path

Skills are loaded on demand by the agent via read_skill(name) — the agent sees skill names and descriptions in its system prompt and decides which to read for a given task.

Phase 4 — Gate (Optional)

After the evolver mutates the workspace, the gating strategy validates the mutation:

flowchart LR
    MUT["Mutated Workspace"] --> HOLDOUT["Run Agent on<br/>Holdout Tasks"]
    HOLDOUT --> SCORE["Compute avg score"]
    SCORE --> THRESH{"avg >= threshold?"}
    THRESH -- "Yes" --> ACCEPT["Accept mutation<br/>git tag evo-N"]
    THRESH -- "No" --> REJECT["Reject mutation<br/>git rollback"]

Holdout tasks are sampled from the benchmark (default 20% holdout ratio). If the mutated agent regresses on holdout tasks, the entire mutation is rolled back via git.

Phase 5 — Reload & Converge (Optional)

The agent reloads its workspace (skills, prompts, memory) and the loop repeats. Convergence is tracked via EGL (Evolutionary Generality Loss):

EGL = (new_skills_created / total_tasks_solved) * 1000

When EGL stays below a threshold (default 0.05) for a configurable window (default 3 cycles), the evolution is considered converged — the agent has stabilized and is no longer discovering new failure patterns.

End-to-End Lifecycle

flowchart TD
    START(["Start: Base Agent<br/>(0 skills, generic prompt)"]) --> CYCLE["Evolution Cycle N"]

    subgraph CYCLE["Evolution Cycle"]
        direction TB
        SOLVE["1. Solve batch<br/>(agent runs tasks)"]
        OBS["2. Observe<br/>(extract trajectory signals,<br/>LLM judge scores)"]
        EVOLVE["3. Evolve<br/>(create/refine skills<br/>under budget)"]
        GATE["4. Gate<br/>(holdout validation)"]
        RELOAD["5. Reload workspace"]
        SOLVE --> OBS --> EVOLVE --> GATE --> RELOAD
    end

    RELOAD --> CONV{"EGL converged<br/>or max_cycles?"}
    CONV -- "No" --> CYCLE
    CONV -- "Yes" --> END(["End: Evolved Agent<br/>(N targeted skills)"])

Configuration

Key config fields for Adaptive Skill (set via YAML or EvolveConfig):

Parameter	Default	Description
`trajectory_only`	`True`	Only show trajectories to evolver (no labels)
`max_skills`	`5`	Skill budget — max number of skills allowed
`evolve_skills`	`True`	Allow skill creation/modification
`evolve_prompts`	`True`	Allow system prompt edits
`evolve_memory`	`True`	Allow memory updates
`protect_skills`	`False`	If True, existing skills are read-only (only new creation allowed)
`solver_proposed`	`False`	If True, the solver agent proposes draft skills for the evolver to generalize
`prompt_only`	`False`	If True, only system prompt mutations are allowed (no skills)
`batch_size`	`10`	Number of tasks per evolution cycle
`holdout_ratio`	`0.2`	Fraction of tasks reserved for gating validation
`egl_threshold`	`0.05`	EGL convergence threshold
`egl_window`	`3`	Number of consecutive cycles EGL must stay below threshold

Key Design Decisions

Why no labels? In open-ended terminal tasks, ground-truth evaluation can be expensive (spinning up Docker environments, running test suites). By judging from trajectories alone, the evolver can run continuously without waiting for evaluation infrastructure.

Why a skill budget? Without a budget, the evolver tends to create narrow, task-specific skills that don't generalize. The budget forces consolidation — five well-crafted category skills outperform twenty fragmented ones.

Why an LLM judge? The judge provides a structured signal (score + category + failure reason) that the evolver can sort, filter, and group. Raw trajectories are noisy; the judge distills them into actionable patterns.

Source Files

File	Role
`engine.py`	`AdaptiveSkillEngine` — orchestrates the step/evolve loop
`prompts.py`	Prompt templates, trajectory compression, LLM judge
`gating.py`	Holdout validation strategy
`egl.py`	EGL computation and convergence check
`tools.py`	Bash tool spec and LLM provider factory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Skill — In-Situation Evolution Without Labels

Core Idea

Algorithm Overview

Detailed Flow

Phase 1 — Solve

Phase 2 — Observe (Trajectory-Only)

Phase 3 — Evolve (Skill Mutation Under Budget)

Skill Budget

What the Evolver Writes

Phase 4 — Gate (Optional)

Phase 5 — Reload & Converge (Optional)

End-to-End Lifecycle

Configuration

Key Design Decisions

Source Files

FilesExpand file tree

adaptive-skill.md

Latest commit

History

adaptive-skill.md

File metadata and controls

Adaptive Skill — In-Situation Evolution Without Labels

Core Idea

Algorithm Overview

Detailed Flow

Phase 1 — Solve

Phase 2 — Observe (Trajectory-Only)

Phase 3 — Evolve (Skill Mutation Under Budget)

Skill Budget

What the Evolver Writes

Phase 4 — Gate (Optional)

Phase 5 — Reload & Converge (Optional)

End-to-End Lifecycle

Configuration

Key Design Decisions

Source Files