A reproducible platform for evolving large–language–model prompts, one small, auditable step at a time.
PromptCritical is a data‑driven, experiment‑oriented toolchain that breeds and evaluates prompts for LLMs. It automates the cycle of:
vary → evaluate → select
so you can focus on defining fitness metrics and mutation strategies, not on plumbing. A bootstrap command seeds the initial gen-0 population, after which the main loop can begin.
PromptCritical stands out by pairing a multi-model judging architecture—where evolved prompts face a panel of independent LLM “critics” rather than a single, potentially biased scorer—with an immutable, hash-addressed prompt database that captures every ancestral mutation and evaluation in cryptographically verifiable form. This provenance layer functions as a “time machine”: you can rewind any prompt’s lineage, replay an entire evolutionary run bit-for-bit, or fork experiments without losing auditability. Together, the contest-style fitness and tamper-evident record make PromptCritical uniquely reproducible, bias-resistant, and production-friendly in a field where ad-hoc scripts and opaque metrics still dominate.
Get your first experiment running in two commands:
# 1. Create a new experiment skeleton in a directory named "my-exp"
pcrit init my-exp
# 2. Ingest the seed prompts and create generation 0
pcrit bootstrap my-expYou now have a complete, runnable experiment in the my-exp directory. See the Usage Guide for the next steps (evolve, vary, evaluate, select).
| Ingredient | Purpose |
|---|---|
| Polylith workspace | Re‑usable components, clean boundaries, lightning‑fast incremental tests |
| Immutable Prompt DB | Atomic, hash‑verified store with per‑file lock‑healing |
| Git as temporal database | Second layer of tamper-detection, allows experiment branching and backtracking |
| Failter integration | Runs large‑scale prompt contests and collects scores |
| Evolution engine | Varies, evaluates, and selects prompts to improve fitness |
Using git for population snapshots is attractive because:
- Every generation is a commit with full diff history
- You can branch for experimental evolution strategies
- Merge conflicts become meaningful (competing evolutionary pressures)
- You get distributed replication of your entire evolutionary history for free
For more information, see:
- OVERVIEW describes what PromptCritical is for and how it works.
- BACKGROUND explains what you need to know to use PromptCritical and gives links to where to learn it.
- DESIGN is for developers to understand the structure.
- RISKS outlines the risks that must be addressed.
Prompt engineering still feels like folklore. PromptCritical’s long-term mission is to turn it into a data-driven, evolutionary workflow:
- Store every prompt immutably with lineage, hashes, and timestamps.
- Run controlled experiments that score those prompts on real tasks (Latency / Cost / Accuracy / Consistency).
- Breed the next generation—mutate, crossover, and select—using the recorded scores as fitness.
- Repeat automatically, producing prompts that keep pace with new LLM releases and changing task definitions.
The project follows Polylith conventions, organizing the codebase into re-usable components and runnable bases. This ensures logic is reusable by any interface (e.g., the CLI or a future web service).
workspace/
├── components/
│ ├── command/ ; Reusable, high-level user commands
│ ├── experiment/ ; Defines the logical experiment context
│ ├── expdir/ ; Manages experiment directory layout
│ ├── pdb/ ; The immutable prompt database
│ ├── pop/ ; Population domain model & analysis
│ ├── reports/ ; Generates human-readable CSV reports
│ ├── results/ ; Parses canonical contest results from JSON
│ ├── config/ ; Runtime configuration (EDN → map)
│ ├── log/ ; Structured logging facade
│ └── test-helper/ ; Shared utilities for testing
└── bases/
└── cli/ ; `pcrit` command‑line entry point
| Command | Status | Description |
|---|---|---|
bootstrap |
✅ | Initializes an experiment, ingests seed prompts, and creates gen-0. |
evaluate |
✅ | Runs the active population in a contest and collects results. |
evolve |
✅ | Automates the vary → evaluate → select loop for N generations. |
init |
✅ | Creates a new, minimal experiment skeleton directory. |
select |
✅ | Creates a new generation of survivors based on evaluation scores. |
stats |
✅ | Displays cost and score statistics for a contest or generation. |
vary |
✅ | Adds new prompt variations to the current generation's population. |
| Term | Notes | Meaning |
|---|---|---|
| init | Creates the initial experiment files. | One-time step that scaffolds a runnable experiment directory, including the seeds/ folder, bootstrap.edn, and default configurations. |
| bootstrap | Creates gen-0. |
One-time step that ingests prompts from seeds/ into the prompt database, creates named links, and populates gen-0 with the initial set of object-prompts. |
| vary | Mutates the current generation. | Generates new candidate prompts by mutating or recombining existing ones, adding them to the current population directory (generations/gen-000/population, …). Now supports multiple strategies like :refine and :crossover. |
| evaluate | Runs scoring but does not decide winners. | Orchestrates a Failter contest for every prompt in the current population and collects the raw fitness metrics into failter-report.json. |
| contest | Contest = noun; evaluate = verb/command. | A single Failter run that scores a set of prompts on a target document. It is the core operation inside evaluate. |
| select | Selection strategy is pluggable. Creates new generation. | Picks the top-performing prompts according to failter-report.json and creates a new generation folder populated with symlinks to the survivors. Now supports policies like top-N and tournament. |
| stats | An analysis command. Does not mutate state. | Reads one or more failter-report.json files and displays aggregated statistics about cost and performance scores. |
population (generations/gen-NNN/population/) |
See Directory Layout section. | Folder tree that holds every generation’s prompt files. Each generation gets its own numbered sub-directory. |
experiment directory (expdir/) |
Portable & reproducible. | Root folder that bundles prompt generations, results, Failter specs, and metadata for a single evolutionary run. |
failter-report.json |
Failter produces this. | Canonical filename for evaluation output: a JSON array of objects, each representing a prompt's performance metrics and metadata. |
| template placeholders | Only these two names are recognized by the templater. | Literal strings substituted when a prompt is rendered: {{INPUT_TEXT}} – the evaluation text corpus{{OBJECT_PROMPT}} – a prompt being operated on. |
| seed prompt | Seeds are version-controlled. | Hand-crafted prompt placed in seeds/ that kicks off bootstrap. |
Use this table as the single source of truth when writing docs, code comments, or CLI help.
The project has undergone a significant architectural refactoring into a clean Polylith structure. With the completion of the v0.4 milestone, the evolutionary engine is now equipped with more sophisticated, pluggable operators for selection and mutation, mitigating the risk of premature convergence on suboptimal prompts.
- Tournament Selection: The
selectcommand now supports atournament-k=Npolicy, which helps preserve population diversity by giving lower-scoring prompts a chance to survive into the next generation. - Crossover Mutation: The
varycommand can be configured to use a:crossoverstrategy, which breeds the top two performers of a generation to create a new hybrid offspring. - Automated
evolveLoop: The high-levelevolvecommand composes these steps into a fully automated loop, allowing for multi-generation experiments to be run with a single command.
PromptCritical does not implement scoring or judgement itself. Instead we treat Failter as a black box experiment runner:
- We generate a
spec.ymlfile that defines the entire contest for Failter. - We shell-out to the single, idempotent
failter run --spec <path>command. - We parse the resulting
failter-report.jsonto gather fitness data for theselectandstatssteps.
| Milestone | New Capability |
|---|---|
| v0.2 | DONE Implement core commands (init, stats, etc). |
| v0.3 | DONE Automated evolve command that composes the v0.2 commands. |
| v0.4 | DONE Advanced selection & mutation operators (tournament, crossover). |
| v0.5 | Surrogate critic to pre-filter variants before Failter. |
| v0.6 | Experiment recipes (EDN/YAML) and CLI replayability. |
| v0.7 | Reporting dashboard (pcrit.web base). |
| v1.0 | Distributed workers, advanced semantic validators. |
- Clone and run the tests:
git clone https://github.com/pragsmike/promptcritical cd promptcritical make test
- Read the design documents in the
docs/directory. - Hack on the next milestone—PRs welcome!
PromptCritical: because great prompts shouldn’t be accidental.