Self-Improvement Loop

A general-purpose evolutionary code improvement engine powered by Claude Code. Given any GitHub repository and a measurable goal, it spawns N parallel AI agent pairs that iteratively improve the codebase through a tournament selection model. The loop continues autonomously until the goal is met or a stop condition fires.

What This Does

You point it at a repository and define a measurable objective (e.g., "improve test pass rate to 95%", "reduce inference latency below 50ms"). The system then:

Researches the codebase to identify improvement opportunities
Generates N independent improvement hypotheses in parallel, each with a concrete plan
Executes each plan in an isolated git worktree so experiments don't interfere
Benchmarks every change against a sealed evaluation (agents cannot modify the benchmark)
Selects the best-performing change via tournament and merges it
Records every result — winners and losers alike — as institutional memory that informs future iterations

The core invariant: every improvement is benchmarked, every result is recorded, and only the best change advances.

This runs fully autonomously once started. No human intervention needed between iterations. The system stops when the target is reached, a plateau is detected, the max iteration count is hit, or a circuit breaker fires after repeated failures.

Setup

Clone a target repo into want_to_improve/
Define your goal in docs/user_defined/goal.md
Set benchmark_command in docs/user_defined/settings.json
Configure guardrails in docs/user_defined/harness.md
Run python3 docs/user_defined/initial_setup.py (or let the orchestrator walk you through it)

Architecture

CLAUDE.md (loop controller)
  ├── researcher         — analyzes codebase, finds improvement opportunities
  ├── planner (×N)       — generates improvement hypotheses
  │     ├── plan-creator     — structures plan documents
  │     ├── plan-architect   — reviews architectural soundness
  │     └── plan-critic      — enforces harness rules
  ├── executor (×N)      — implements plans in isolated worktrees
  └── github-manager     — picks winner, merges, records history

Each iteration:

Research — analyze the repo and past results
Plan — N agents each propose a different improvement hypothesis
Review — architect and critic validate each plan
Execute — approved plans run in parallel, each in its own worktree
Select — benchmark all results, best one merges (tournament selection)
Record — every result (win or lose) becomes institutional memory
Repeat — until the goal is reached or a plateau is detected

Project Structure

CLAUDE.md                        # Orchestrator entry point (loop controller)
claude/
  agents/
    si-researcher/               # Codebase analysis + research briefs
    si-planner/                  # Hypothesis generation
      skills/
        si-plan-creator/         # Structures plan documents
        si-plan-architect/       # Reviews architectural soundness
        si-plan-critic/          # Enforces harness rules
    si-executor/                 # Experiment runner in isolated worktrees
    si-github-manager/           # Tournament selection, merge, branch management
  skills/
    si-goal-clarifier/           # Interactive goal definition
    si-benchmark-builder/        # Benchmark creation wizard
docs/
  user_defined/                  # Your config: goal, harness, settings, setup
  agent_defined/                 # Runtime state: iteration history, research briefs
  theory/                        # Design docs, data contracts
scripts/
  validate.sh                    # Sealed file + schema validation
  plot_progress.py               # Progress visualization
want_to_improve/                 # Target repo (cloned during setup)
tracking_history/                # Raw benchmark data + progress chart

Key Concepts

Tournament Selection — N parallel experiments per iteration, single winner merges. All candidates benchmark against the same baseline commit, so scores are directly comparable. Ties are broken by preferring fewer lines changed, then by lower executor ID.
Archive Tag Management — Losing experiment branches are tagged as archive/round_N_executor_id for traceability. Tags accumulate over time; set max_archive_tags in settings.json to enable automatic pruning of the oldest tags.
Institutional Memory — every result (win or lose) is recorded with structured failure analysis. Planners must read the full history before proposing new hypotheses, preventing the system from rediscovering dead ends.
One Hypothesis Per Plan — each plan tests exactly one idea. If the benchmark improves, you know why. If it regresses, you know what to revert. Multi-hypothesis plans are rejected by the critic.
Approach Family Taxonomy — every plan is tagged with a category (architecture, training_config, data, optimization, etc.). The system tracks which families are working and prevents overexploitation of a single family (max 3 consecutive wins from the same family).
Harness Rules — enforced by a critic agent before execution: one hypothesis per plan (H001), no repeating the same approach family 3x (H002), diversity within each round (H003). Custom rules can be added.
Sealed Evaluation — benchmark code is marked read-only via sealed_files in settings. validate.sh enforces this with both git diff checks and SHA-256 hash verification. Agents cannot game the metric.
Research-Driven Planning — a dedicated researcher agent explores the codebase, checks open issues, searches papers, and produces a ranked research brief before planners start. User-provided ideas in idea.md take priority.
Plateau Detection — auto-stops when improvement falls below a threshold for N consecutive iterations.
Circuit Breaker — halts after consecutive no-winner iterations, indicating a systemic problem that needs human review.
Resumability — the system tracks within-iteration progress in iteration_state.json. If interrupted at any step, it resumes from exactly where it left off without re-running completed work.

How It Works (Detailed)

The Loop

while goal not met:
    1. Read goal + history + harness rules
    2. Researcher explores repo → produces research brief
    3. N planners each write 1 plan (1 hypothesis each)
    4. Critic validates each plan against harness rules
    5. N executors run approved plans in parallel (isolated worktrees)
    6. Tournament: best benchmark score wins → merge to improve/ branch
    7. Record everything (winners + losers + lessons)
    8. Update visualization (progress chart)
    9. Check stop conditions

Git Strategy

The system uses a fork-based branch-per-experiment model. All git operations (branches, worktrees, merges) happen inside want_to_improve/ (the forked repo clone), not in the self-improvement project root.

improve/{goal_slug} — accumulation branch. Only winning changes merge here. git log shows a clean history of improvements with scores. Pushed to the fork after each winner for backup.
experiment/round_{n}_executor_{id} — short-lived branches for each experiment. Created via git worktree add for full isolation.
archive/round_{n}_executor_{id} — losing branches are tagged before deletion so commits remain reachable.

Stop Conditions

Condition	When
Target reached	`best_score` meets or exceeds `target_value`
Plateau	Improvement < `plateau_threshold` for `plateau_window` consecutive iterations
Max iterations	`iterations` >= `max_iterations`
Circuit breaker	`circuit_breaker_threshold` consecutive iterations with no winner

Configuration

All configuration lives in docs/user_defined/settings.json:

{
  "number_of_agents": 3,
  "benchmark_command": "python run_eval.py",
  "benchmark_direction": "higher_is_better",
  "max_iterations": 50,
  "plateau_threshold": 0.01,
  "plateau_window": 3,
  "target_value": 95.0,
  "sealed_files": ["benchmark/eval.py"],
  "circuit_breaker_threshold": 3
}

Data Contracts

All inter-agent communication follows strict JSON schemas defined in docs/theory/data_contracts.md:

Schema	Producer	Consumer
Plan Document	planner	critic, executor
Benchmark Result	executor	github-manager
Research Brief	researcher	planners
Iteration History	orchestrator	planners, researcher
Merge Report	github-manager	orchestrator
Visualization Data	orchestrator	plot_progress.py
Iteration State	orchestrator	orchestrator (resume)

Inspired By

autoresearch — sealed evaluation + git-as-state-machine
Orze — decentralized orchestration + research agent + circuit breaker
oh-my-claudecode — multi-agent orchestration layer for Claude Code

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
claude		claude
docs		docs
hooks		hooks
scripts		scripts
tracking_history		tracking_history
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Improvement Loop

What This Does

Setup

Architecture

Project Structure

Key Concepts

How It Works (Detailed)

The Loop

Git Strategy

Stop Conditions

Configuration

Data Contracts

Inspired By

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Improvement Loop

What This Does

Setup

Architecture

Project Structure

Key Concepts

How It Works (Detailed)

The Loop

Git Strategy

Stop Conditions

Configuration

Data Contracts

Inspired By

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages