Like autoresearch, but with guardrails.
Install · Quick Start · How It Works · Examples · Docs
繁體中文 | English
Try an idea, measure it, keep what works, discard what doesn't — and the agent can't cheat.
Autonomous experiment loops where the agent can't game the metric. Crucible enforces file-level access control (editable / readonly / hidden), validates metrics, and manages git history automatically. The agent writes code; the platform controls everything else.
- Python 3.10+
- uv — Python package manager
# macOS / Linux curl -LsSf https://astral.sh/uv/install.sh | sh # or via Homebrew brew install uv
- Git — the platform uses git for version control of experiments
- Claude Code — the
claudeCLI must be installed and authenticated# Install npm install -g @anthropic-ai/claude-code # Authenticate (follow the prompts) claude
# Install as a global CLI tool
uv tool install autocrucible
# Or install from a local clone
git clone https://github.com/suzuke/autocrucible.git
uv tool install ./crucibleVerify:
crucible --help# From PyPI
uv tool install autocrucible --force
# From local source (after pulling changes)
uv tool install ./crucible --forcegit clone https://github.com/suzuke/autocrucible.git
cd crucible
uv sync # install in local .venv
uv run crucible --help # run from source
uv run pytest # run testsCrucible auto-detects your system locale. To override:
export CRUCIBLE_LANG=zh-TW # Traditional Chinese
export CRUCIBLE_LANG=en # English (default)# From example
crucible new ~/my-project -e optimize-sorting
cd ~/my-project
crucible run --tag run1
crucible run --tag run1 --max-iterations 5 # stop after 5 iterations
# Check results
crucible status --tag run1
crucible history --tag run1
crucible postmortem --tag run1
# Continue from best result
crucible run --tag run2See crucible new . --list for all examples, or crucible wizard for AI-generated projects.
If your experiment needs third-party packages (numpy, torch, etc.), install them with uv sync in the project directory.
crucible validate
crucible validate --stability --runs 5 # check metric variancecrucible -v run --tag run1 # debug-level outputcrucible run --tag run1 --profile # track token usage per iteration
crucible postmortem --tag run1 --tokens # analyze after runShows prompt section breakdown, cache hit rates, and per-iteration timing. See docs/PROFILING.md for details.
crucible run --tag run1
│
▼
┌─────────────────────────────────┐
│ 1. Assemble prompt │ instructions + history + state
│ 2. Claude Agent SDK │ agent reads/edits files
│ 3. Guard rails │ validate edits
│ 4. Git commit │ snapshot the change
│ 5. Run experiment │ python evaluate.py > run.log
│ 6. Parse metric │ grep '^metric:' run.log
│ 7. Keep or discard │ improved? keep : reset
│ 8. Loop │
└─────────────────────────────────┘
- Agent: Uses the Claude Agent SDK with a tool allowlist (Read, Edit, Write, Glob, Grep). The agent can read files, make targeted edits, and search the codebase — but cannot execute arbitrary commands.
- Environment: If your project has a
.venv/, crucible automatically activates it when running experiment commands, sopython3 evaluate.pyuses the correct interpreter and packages. - Git: Every attempt is committed. Improvements advance the branch; failures are tagged and reset, preserving the diff for analysis.
Bundled examples to get started quickly. Create a project from any example:
crucible new ~/my-project -e <example-name>| Example | Metric | Direction | Description |
|---|---|---|---|
| Algorithms | |||
optimize-sorting |
ops_per_sec |
maximize | Pure Python sorting throughput optimization |
optimize-pathfind |
nodes_explored |
minimize | Grid pathfinding — showcases beam strategy |
optimize-hash |
uniformity_score |
maximize | Hash function optimization for uniform distribution |
optimize-tsp |
total_distance |
minimize | Travelling Salesman Problem — 200 cities route optimization |
| ML / Data Science | |||
optimize-regression |
val_mse |
minimize | Synthetic regression with nonlinear interactions |
optimize-classifier |
val_accuracy |
maximize | Numpy-only neural network on 8-class dataset |
optimize-quantize |
score |
maximize | Post-training quantization — accuracy × compression tradeoff |
optimize-lm |
val_bpb |
minimize | Language model — minimize validation bits per byte |
| Game AI | |||
optimize-gomoku |
win_rate |
maximize | AlphaZero-style Gomoku agent training |
optimize-snake |
avg_score |
maximize | Snake AI heuristic search (no dependencies) |
optimize-2048 |
avg_score |
maximize | 2048 game-playing AI over 20 seeded games |
| Compression / Encoding | |||
optimize-compress |
compression_ratio |
maximize | Lossless text compression (no zlib/gzip allowed) |
optimize-tokenizer |
tokens_per_char |
minimize | BPE-style tokenizer compression for English text |
optimize-cipher |
throughput |
maximize | Substitution cipher — showcases restart strategy |
| Numerical / Scientific | |||
optimize-monte-carlo |
error |
minimize | Monte Carlo integration — showcases stability validation |
optimize-rl-policy |
mean_reward |
maximize | Pendulum swing-up controller via reinforcement learning |
| Prompt Engineering | |||
optimize-prompt-format |
accuracy |
maximize | System prompt optimization for format conversion tasks |
optimize-prompt-logic |
accuracy |
maximize | System prompt optimization for logic reasoning |
optimize-prompt-math |
accuracy |
maximize | System prompt optimization for math word problems |
| Code / Text | |||
optimize-codegen |
score |
maximize | Code generator — correctness × speed ratio |
optimize-regex |
f1_score |
maximize | Regex pattern optimization for email classification |
A showcase example where the agent builds a lossless text compressor from scratch:
crucible new ~/compress -e optimize-compress
cd ~/compress
crucible run --tag run1Starting from a baseline RLE compressor (0.51x — worse than no compression), the agent typically:
- Iter 1: Implements LZ77 + Huffman → ~2.63x
- Iter 2: Adds optimal parsing DP + symbol remapping → ~2.81x (beats zlib's 2.65x)
- Iter 3+: Context modeling, arithmetic coding → 3.0x+
Three examples that demonstrate the v0.5.0 search strategy and stability features:
Monte Carlo integration of ∫₀¹ x² dx. Each run uses different random samples, so the metric varies by ~30–40% between runs — exactly the scenario that makes single-run evaluation unreliable.
crucible new ~/mc -e optimize-monte-carlo
cd ~/mc
crucible validate # detects CV ~36% > 5%, auto-writes evaluation.repeat: 3
crucible run --tag mc-v1 # now each iteration runs 3× and reports medianThe stability check prevents the agent from chasing noise: without evaluation.repeat, a "lucky" run looks like an improvement even when nothing changed.
Substitution cipher on 1 MB of text. The loop-based baseline can be optimized (list comprehension, caching) to ~55 MB/s — but str.translate() runs at 200+ MB/s and is a completely different approach that greedy search won't reach on its own.
crucible new ~/cipher -e optimize-cipher
cd ~/cipher
crucible run --tag cipher-v1With plateau_threshold: 4, after 4 stagnant iterations the platform resets to the original code and injects full history. The agent sees "loop optimizations reached ceiling" and explores str.translate() — a ~4× breakthrough.
Key insight: Restart is not "retry". The code resets, but the agent retains full history and knows exactly which directions are exhausted.
BFS pathfinding on 100 random 20×20 grids. BFS visits ~40–70% of grid cells; A* with Manhattan heuristic visits ~10–20%; jump-point search is even more efficient.
crucible new ~/pathfind -e optimize-pathfind
cd ~/pathfind
crucible run --tag pathfind-v1With beam_width: 3, three independent branches explore different algorithm families. Each beam sees a compact summary of what others tried — if beam-0 found bidirectional BFS and beam-1 found A*, beam-2 won't waste iterations reimplementing them.
Key insight: Beam is serial (one agent at a time, cost proportional to iterations). The advantage is exploration breadth, not speed.
my-experiment/
├── .crucible/
│ ├── config.yaml # What to optimize, how to run, what to measure
│ └── program.md # Instructions for the LLM agent
├── solution.py # Code the agent modifies (editable)
├── evaluate.py # Fixed harness that measures the metric (hidden)
├── pyproject.toml # Experiment dependencies (NOT crucible itself)
├── results-{tag}.jsonl # Auto-generated experiment log (per run)
├── run.log # Latest experiment output
└── logs/ # Per-iteration logs
└── iter-1/
├── agent.txt # Agent reasoning
└── run.log # Experiment output
Crucible is installed as a global CLI tool — it is NOT a dependency of your experiment project. Your project's pyproject.toml only lists experiment-specific packages (numpy, torch, etc.).
- Config Reference — all YAML fields, eval convention, git strategy, guard rails
- Token Profiling — track prompt composition, cache efficiency, and timing per iteration
- FAQ — local optima, single metric, parallel agents, safety, monitoring
- Token Profiling — understand token usage, prompt breakdown, and cache efficiency
- Changelog — version history and release notes