An @enchanter-ai product — algorithm-driven, agent-managed, self-learning.
Code review for AI-assisted development that catches runtime failures compile-time checks miss.
6 sub-plugins. 5 engines. 3 slash commands. Bayesian per-developer preference. One command.
A PR adds
result = user_inputs[i] / nwithncoming from a JSON body. M1 Cousot Interval Propagation flagsnas[?, ?]— unknown lower bound, possible zero. M2 Falleri Structural Diff confirms the assignment is new, not refactored. M5 Bounded Subprocess Dry-Run synthesizes a fuzzer input, executes the change in aresource.setrlimitsandbox, and observesZeroDivisionError. M7 Zheng Pairwise Rubric judges: 5/10 Robustness, 2/10 Failure Resilience. M6 remembers that this developer consistently cares about divide-by-zero — next time the prior floor is 0.72, not 0.50. Verdict: HOLD with specific finding. Zero false positives from style noise. Sylph posts the finding on the PR.Time: under 5 seconds. Developer effort: read one finding, merge.
In plain English: Tests pass. Types check. The app still crashes at 3am. Lich runs the suspect lines in a sandbox and confirms the bug before warning you — so warnings mean something again.
Technically: M1 Cousot Interval Propagation propagates abstract ranges (interval + nullability + container-shape lattices) to flag division-by-zero and null-dereference suspects; M5 Bounded Subprocess Dry-Run executes flagged call sites in a stdlib-only resource.setrlimit sandbox to confirm or dismiss each suspicion before surfacing it. M6 Beta-Binomial Thompson sampling per (developer, rule) updates a per-developer preference posterior on every accept/reject, dropping rules the developer consistently ignores toward a 5% floor so the signal-to-noise ratio never collapses.
Lich takes its name from Twilight Forest — the first major boss, an undead sorcerer who tests challengers through phased spell-trials before allowing passage deeper into the dimension. Every PR is a supplicant at the gate; every engine is a test the code must survive before it ships.
The question this plugin answers: Is this code good?
- Teams who've accepted that LLMs ship runtime bugs no type checker catches (
x / nwithn ∈ [?, ?]) and want an automated reviewer that actually runs the code. - Reviewers tired of style-noise from Copilot / Cursor / Qodo who want the tool to learn their preferences, not flood them.
- Engineers who care about the signal-to-noise ratio staying above 1 after six months of reviews, not just week one.
Not for:
- One-off scripts, experimental notebooks, or throwaway prototypes — Lich's sandbox runner costs time you won't save.
- Teams already satisfied with their linter / type-checker combo who don't have runtime-bug incidents in their retros.
- How It Works
- What Makes Lich Different
- The Full Lifecycle
- Install
- Quickstart
- 6 Sub-Plugins, 3 Agents, 5 Engines
- What You Get Per Review
- Roadmap
- The Science Behind Lich
- vs Everything Else
- Agent Conduct (12 Modules)
- Architecture
- Acknowledgments
- Versioning & release cadence
- Contributing
- Citation
- License
Lich runs a five-engine pipeline that treats code review as static suspicion → sandboxed confirmation → Bayesian preference weighting → rubric judgment. The premise: AI-assisted development ships two dominant bug classes that traditional review tools miss.
- Runtime failures that pass compile time.
x / ntype-checks in every language;n = 0crashes at runtime. Static type systems don't catch it; neither doescargo check/tsc. Humans catch it on review, LLMs miss it. - Reviewer fatigue on noisy signals. GitHub Copilot, Cursor, and Qodo ship thousands of style suggestions, all at equal weight. Developers accept/reject without the tool learning. Over time, the signal-to-noise collapses and the reviewer disables the tool.
Lich addresses both: the M1 static flagger feeds the M5 sandboxed confirmer (catches the first); M6 Bayesian preference accumulation per (developer, rule) (addresses the second). No existing reviewer ships either at zero-external-dep weight; both together is genuinely novel.
Source: docs/assets/pipeline.mmd · Regeneration command in docs/assets/README.md.
M1 Cousot Interval Propagation propagates abstract ranges (interval + nullability + container-shape lattices) across every assignment. A / n operation flags when the interval includes zero. Then M5 Bounded Subprocess Dry-Run actually executes the change in a stdlib-only sandbox (resource.setrlimit + signal.alarm + subprocess isolation) and observes whether the bug reproduces. No other reviewer ships the static-suspicion → sandboxed-confirmation pipeline at zero-external-dep weight.
Every accept/reject on a rule updates a Beta-Binomial posterior per (developer, rule). After 20 rejections of "use pathlib instead of os.path" from a developer who works on legacy Python 2 code, the posterior for that surface rule drops from 0.50 → 0.08. The 5% minimum floor keeps the rule alive for edge cases. Thompson sampling preserves exploration. The result: the tool learns which signals this specific developer cares about — instead of rubber-stamp-then-disable collapse.
M7 Zheng Pairwise Rubric runs the judge twice with position-swapped inputs and reports Kappa — a measure of how consistent the LLM judge is with itself. If Kappa drops below 0.6, the verdict is flagged unstable and falls back to a rules-only decision. No other LLM-based reviewer reports inter-judge reliability.
Lich defers security-lane findings to Hydra (CWE classification, pattern databases) and change-classification to Crow (Bayesian trust scoring per file). The three cooperate: Lich catches code quality, Hydra catches security, Crow catches unexpected-change risk. One PR, three orthogonal verdicts, no duplicate work.
A review flows top-to-bottom through five stages. M1 Cousot Interval Propagation (lich-core) propagates abstract ranges over the changed hunks, flagging suspicious assignments and divisions. M2 Falleri Structural Diff (lich-core) clusters the changes by AST edit distance, so a 200-line rename collapses to one finding. M5 Bounded Subprocess Dry-Run (lich-sandbox) sandbox-executes each flagged hunk and observes runtime behavior. M6 Bayesian Preference Accumulation (lich-preference) weights findings by this developer's per-rule posterior. M7 Zheng Pairwise Rubric Judgment (lich-rubric) scores the aggregate along a 5-axis rubric and routes the verdict through lich-verdict (DEPLOY / HOLD / FAIL).
Source: docs/assets/lifecycle.mmd · Regeneration command in docs/assets/README.md.
Every stage is autonomous; the developer surface is pull (/lich-review), not push.
Lich ships as a 6-sub-plugin marketplace. One meta-plugin — full — lists all six as dependencies, so a single install pulls in the whole pipeline.
In Claude Code (recommended):
/plugin marketplace add enchanter-ai/lich
/plugin install full@lich
Claude Code resolves the dependency list and installs all 6 sub-plugins. Verify with /plugin list.
Want to cherry-pick? Individual sub-plugins are still installable — e.g. /plugin install lich-core@lich if you only want the M1+M2 static surface. Sandbox-less / preference-less modes degrade gracefully; Lich falls back to rules-only verdicts when an engine is missing.
git clone https://github.com/enchanter-ai/lich
cd lich
./scripts/bootstrap.sh # canonical first command — installs vis siblingWithout ./scripts/bootstrap.sh, conduct imports will silently miss and Claude Code's @-loader will fail-soft. Always bootstrap first.
| Sub-plugin | Owns | Trigger | Agent |
|---|---|---|---|
| lich-core | M1 Cousot Interval + M2 Falleri Structural Diff | skill-invoked | static-surface (Sonnet) |
| lich-sandbox | M5 Bounded Subprocess Dry-Run | skill-invoked | sandbox-runner (Sonnet) |
| lich-preference | M6 Bayesian Preference Accumulation | hook-driven (PostToolUse) | preference-learner (Haiku) |
| lich-rubric | M7 Zheng Pairwise Rubric Judgment | skill-invoked | rubric-judge (Sonnet) |
| lich-python | Python AST adapter | skill-invoked | — |
| lich-typescript | TypeScript AST adapter | skill-invoked | — |
Slash commands:
| Command | Function | Agent tier |
|---|---|---|
/lich-review <scope> |
On-demand deep review aggregating M1-M7 | Sonnet |
/lich-explain <finding_id> |
Walk through why M1/M5/M7 flagged a specific finding | Sonnet |
/lich-disable <rule_id> |
Permanent rule suppression with quarterly auto-reprompt | Haiku |
Write/Edit events flow through four journals — one per review-pipeline engine — and converge on the enchanted-mcp bus and the developer query surface. Color maps engines to journals: blue = lich-core (M1+M2 static suspicion) · red = lich-sandbox (M5 runtime confirmation) · purple = lich-preference (M6 Bayesian learning) · yellow = lich-rubric (M7 judgment).
Source: docs/assets/state-flow.mmd · Regeneration command in docs/assets/README.md.
plugins/lich-core/state/
├── findings.jsonl M1+M2 flagged hunks with interval + diff cluster metadata
└── metrics.jsonl per-scan timing + hunk counts
plugins/lich-sandbox/state/
├── executions.jsonl M5 sandbox runs with exit code, rlimit hit, observed exceptions
└── metrics.jsonl sandbox run counts + avg latency
plugins/lich-preference/state/
├── posteriors.json per-(developer, rule) Beta-Binomial α/β parameters
├── learnings.json cross-session preference accumulation (α=0.05)
└── metrics.jsonl accept/reject events
plugins/lich-rubric/state/
├── verdicts.jsonl M7 5-axis scores + Kappa reliability per review
└── metrics.jsonl rubric invocation metrics
Every review produces a JSONL row in lich-rubric/state/verdicts.jsonl with the 5-axis rubric scores (Robustness, Specificity, Clarity, Failure Resilience, Determinism), the Cohen's Kappa reliability number, and the final verdict (DEPLOY / HOLD / FAIL).
Tracked in docs/ROADMAP.md and the shared ecosystem map. For upcoming work specific to Lich, see issues tagged roadmap.
Every Lich engine is built on a formal mathematical model. Full derivations in docs/science/README.md.
| ID | Name | Plugin | Algorithm |
|---|---|---|---|
| M1 | Cousot Interval Propagation | lich-core | Abstract interpretation over interval + nullability + container-shape lattices with threshold widening |
| M2 | Falleri Structural Diff | lich-core | GumTree two-phase AST matching (top-down hash + bottom-up Dice) |
| M5 | Bounded Subprocess Dry-Run | lich-sandbox | Stdlib resource.setrlimit + signal.alarm + subprocess sandbox (Unix-only) |
| M6 | Bayesian Preference Accumulation | lich-preference | Beta-Binomial Thompson sampling per (developer, rule) with 5% minimum floor |
| M7 | Zheng Pairwise Rubric Judgment | lich-rubric | 5-axis rubric + position-swap debiasing + Cohen's Kappa reliability |
Defining engine: M5 Bounded Subprocess Dry-Run — the static-suspicion → sandboxed-confirmation pipeline is the novel moat no existing reviewer ships at zero-external-dep weight.
Phase 2 adds M3 Yamaguchi Property-Graph Traversal, M4 Type-Reflected Invariant Synthesis, Schleimer Winnowing Clone Detection, O'Hearn Separation-Logic Bi-Abduction, and Cohort Similarity Borrowing.
Honest comparison against adjacent tools. Marks ✓ only where the feature is present and production-ready.
| Feature | Lich | GitHub Copilot | Cursor | Qodo Merge |
|---|---|---|---|---|
| Catches runtime-only bugs via sandboxed confirmation | ✓ | — | — | — |
| Per-developer Bayesian preference posterior | ✓ | — | — | — |
| Inter-judge reliability (Cohen's Kappa) reported | ✓ | — | — | — |
| Zero external runtime deps | ✓ | — | — | — |
| Markdown-file rule customization | ✓ | ✓ | ✓ | ✓ |
| Auto-generated PR comments | via Sylph | ✓ | ✓ | ✓ |
| Cross-plugin signal routing (Hydra, Crow, Pech) | ✓ | — | — | — |
Every skill inherits a reusable behavioral contract from shared/vis/conduct/ — loaded once into CLAUDE.md, applied across all plugins. This is how Claude acts inside Lich: deterministic, surgical, verifiable. Not a suggestion; a contract.
| Module | What it governs |
|---|---|
| discipline.md | Coding conduct: think-first, simplicity, surgical edits, goal-driven loops |
| context.md | Attention-budget hygiene, U-curve placement, checkpoint protocol |
| verification.md | Independent checks, baseline snapshots, dry-run for destructive ops |
| delegation.md | Subagent contracts, tool whitelisting, parallel vs. serial rules |
| failure-modes.md | 14-code taxonomy for accumulated-learning logs |
| tool-use.md | Tool-choice hygiene, error payload contract, parallel-dispatch rules |
| formatting.md | Per-target format (XML / Markdown sandwich / minimal / few-shot), prefill + stop sequences |
| skill-authoring.md | SKILL.md frontmatter discipline, discovery test |
| hooks.md | Advisory-only hooks, injection over denial, fail-open |
| precedent.md | Log self-observed failures to state/precedent-log.md; consult before risky steps |
| tier-sizing.md | Prompt verbosity scales inversely with model tier; Haiku needs mechanical steps, Opus runs on intent |
| web-fetch.md | External URL handling: cache, dedup, budget; WebFetch is Haiku-tier-only |
Interactive architecture explorer with sub-plugin diagrams, agent cards, and data flow:
docs/architecture/ — auto-generated from the codebase. Run python docs/architecture/generate.py to regenerate.
Architecture diagrams are auto-generated from source-of-truth (plugin.json, hooks.json, SKILL.md frontmatter). Never hand-edited. The full synthesized architecture is at docs/architecture/lich-architecture.md.
Lich builds on substrate laid by others:
- Claude Code (Anthropic) — the plugin surface this work extends.
- Keep a Changelog — CHANGELOG convention.
- Semantic Versioning — versioning contract.
- Contributor Covenant — Code of Conduct.
- repostatus.org — status badge.
- Citation File Format — citation metadata.
- Conventional Commits — commit convention.
Lich follows Semantic Versioning. Breaking changes land on major bumps only; the CHANGELOG flags them explicitly. Release cadence is opportunistic — tags land when accumulated fixes or features justify a cut, not on a fixed schedule. Migration notes between majors live in docs/upgrading.md.
See CONTRIBUTING.md.
If you use this project in research or derivative work, please cite it:
@software{lich_2026,
title = {Lich},
author = {{Klaiderman}},
year = {2026},
url = {https://github.com/enchanter-ai/lich}
}See CITATION.cff for additional formats (APA, MLA, EndNote).
MIT — see LICENSE.
Lich is the code-review layer (Phase 3, pre-release) — it runs a five-engine pipeline (M1 Cousot · M2 Falleri · M5 Bounded Subprocess · M6 Bayesian Preference · M7 Zheng Rubric) over a diff and emits an advisory verdict per finding. Upstream, Crow's trust score tells Lich which changes are worth the sandbox spend. Downstream, Sylph surfaces Lich findings on the PR body at /sylph:pr time.
Lich does not observe every edit (Crow's lane — Lich runs on demand, not on every Write), orchestrate PRs (Sylph's lane), track cost (Pech's lane), or scan for security patterns (Hydra's lane). It decides whether a change is correct, not whether it's trusted or safe.
Lich joins Crow and Sylph in the Hollow-Knight cluster — three HK entities for three related dev-surface concerns. See docs/ecosystem.md § Data Flow Between Plugins for the full map.
