I'm trying to make ai agents behave less like coding assistants and more like cross-functional product teams.
This package brings Shape Up methodology to AI coding agents. Instead of open-ended feature lists, you shape proposals with clear scope boundaries, validate them through adversarial critique, and generate typed technical scopes ready for autonomous execution.
Built by a former product manager and developer, for AI agents and humans. I've led product teams, taught product leadership, advised product strategy, and written code across the full stack. stelow is that experience, systematized — no conference-room theory, no abstract architecture. Lessons from live products, shipped features, real teams, and real codebases. More about my background.
🎯 "Measure thrice, cut once" - applies to product decisions, not just code.
Key differentiators:
- Shape Up methodology for AI agents - IN/OUT scope boundaries, appetite-driven sizing, risk analysis, focused scoping. Every proposal is a shaped bet, not a wishlist.
- Appetite × Review Mode stage control - Two orthogonal dimensions control the full workflow: how deep to prepare (Appetite: Lean / Core / Complete) and which gates run (Review Mode: Auto / Product Spec Gate / Product Spec + Interface Gates / Product Spec + Interface + Scopes / Product Spec + Interface + Tech Review). The cascade propagates automatically through critique depth, supervisor use, verification rigor, and gate requirements - no manual stage skipping needed.
- Adversarial plan critique - Plans are reviewed for gaps, risks, and assumptions by parallel (fresh context) reviewers, not just approved in chat.
- Visual review gate - Plannotator opens the full plan for point-by-point comments before implementation, not a rubber-stamp approval.
- Appetite-scaled interface exploration - 1, 3, or 5 ASCII archetypes plus hybrid depending on scope depth - no coded mockups wasted.
- Product domain libraries - 8 domains auto-detected from your language (Pricing, Trust, Ads, Promotions, Open Source, Health, Marketplace, Business Models).
- Typed technical scopes - feature, spike, optimize, test-* with dependency mapping and sequencing for autonomous execution.
- Acceptance-based scope execution - each scope is delegated with a contract (criteria, verify commands, stop rules). On acceptance-native harnesses (e.g. pi-subagents), the child self-corrects in the same context before returning. On other harnesses, the parent re-delegates with feedback until criteria pass or max iterations exhaust.
- Audit gap-to-scope loop — post-execution audit classifies gaps (FIXED / DOCUMENTED / ESCALATED). ESCALATED gaps become new scopes in the tracking file.
/sw-nextenforces the loop: when pending scopes exist at the Audit phase, it blocks completion and resets to Execution. The cycle repeats until no scopes remain pending. - Bidirectional product ↔ tech flow — tech constraints and opportunities inform product decisions before execution. Tech Preview uses cymbal for appetite-gated codebase recon; Alignment Check catches product-vs-tech misalignment with mode-dependent resolution (auto or user-flagged).
- Stack-matched skills + fresh docs — during execution setup, the workflow discovers skills (via
npx skills) optimized for the chosen tech stack and fetches current library docs (viactx7). Both skip if already installed or unavailable. Skills install in project scope only, after user confirmation. - Real-time TUI tracking - see workflow state as it progresses through all stages.
- Pulse — autonomous inbox processing — background cron-driven system periodically checks your inbox and auto-creates workflows with
review_mode=Auto(no gates, no questions). Items needing human review skip Pulse and land in the interactive inbox for manual triage, preventing silent loops on ambiguous requests.
- Why stelow
- 🎚️ Appetite & Review Mode
- 🔄 Process
- 📋 Skills
- 🚀 Quick Start
- 📦 Installation
- External Dependencies
- 🎮 Commands
- 📡 Pulse — Autonomous Inbox Processing
- Setup per CLI
- 🖥️ Visual & TUI Integrations
- 📁 Artifact Directory
- 📖 Evidence & Limitations
- About the Author
- License
- 📞 Support
"Let's go slow to go fast: invest time in thorough planning to gain speed and deliver value in execution."
Traditional AI development: "Here's what I want. Start coding."
With stelow: The user just says:
/sw-start "Here's what I want to build"
And the workflow begins asking questions, exploring scope, shaping the proposal, reviewing for gaps, getting visual approval, and only then generating typed technical scopes for execution.
Critique → Gate → Scope sequencing. Execution (stage 12) only runs after all three pass. Lighter review modes (Auto/Product Spec Gate) skip some gates; the full path is there when you need it.
Building products with AI agents often leads to:
- Scope creep and unclear boundaries - defining what not to build is harder than what to build
- Plans without adversarial review - no one questions assumptions before coding begins
- Technical work before business validation - shipping features that shouldn't exist
- No systematic testing for AI-generated code - AI writes fast, but also writes wrong
- Generic workflows missing product-specific insights - pricing, trust, ads, and launch strategy are product decisions, not code decisions
A structured workflow that makes AI think like a product manager:
- ✅ Measure thrice, cut once - shapes proposals with IN/OUT boundaries BEFORE coding
- ✅ Strategic exploration - Job To Be Done, Opportunity Mapping, Evolutionary Principles, Market Analysis, and Product Discovery knowledge integrated
- ✅ Adversarial critique - reviews every plan for gaps, risks, and assumptions
- ✅ Visual review gate - Plannotator opens the full plan for point-by-point comments (not just chat)
- ✅ Interface exploration in ASCII art - visualize 5 different approaches in seconds, no coding wasted, then LLM creates a hybrid version combining the best points for the context
- ✅ Domain libraries - auto-detects 8 product domains (Pricing, Trust, Ads, etc.) from your language
- ✅ Technical scope mapping - breaks down into typed scopes, maps dependencies, sequences execution
- ✅ AI-aware testing strategy - for software products, with coverage targets, CI gates, and contextual evaluation of mutation testing for critical paths
- ✅ Greenfield & Brownfield - works for new products and existing product evolution
- 24 sub-skills organized into 4 layers - orchestrator + strategies + workflow stages + tactics
- Part of a broader ecosystem of 25 skills within the project (plus additional skills from other packages in the user's agent environment)
- Real-time TUI tracking with visual status overlay (
/sw-status) - Gate approval via Plannotator - review, comment, approve or reject before implementation
- Typed scopes for autonomous execution (feature, spike, test-*, optimize)
The workflow is controlled by two orthogonal dimensions: Appetite (declared by the human) and Review Mode (declared by the human). Appetite controls scope/exploration depth. Review Mode controls which gates, questions, and approvals are active.
Appetite is the scope and exploration budget - how much product depth the human wants prepared before execution.
Appetite is a constraint, not an estimate. Unlike traditional estimation (which asks "how long will this take?"), appetite asks "how much is this worth?" before the work is defined. This forces scope cuts to fit the budget - the budget never expands.
| Appetite | What it means | Scope depth | Interface exploration | Supervisor | Testing | Best for |
|---|---|---|---|---|---|---|
| Lean | Validate an idea fast. Minimal scope ceremony. | 1 minimal feature, 1-2 scopes | 1 suggested interface; no alternative exploration | Low sensitivity | Smoke tests + critical-path unit tests; a11y lint/static if UI exists | Idea validation, spike, throwaway prototype |
| Core (default) | Standard product feature. Enough depth to catch obvious gaps. | Main JTBD, 3-5 scopes | 3 interface archetypes explored + 1 hybrid recommendation | Medium sensitivity | Unit tests + integration tests for external seams; a11y codebase/browserless audit if UI exists | Most features, bug fixes, small improvements |
| Complete | Multi-feature or high-risk product work. | 8-15 scopes, full edge mapping | 5 interface archetypes explored + 1 hybrid recommendation | High sensitivity | Unit + integration + behavior/e2e + security tests; live a11y audit if UI exists | Critical features, high-risk changes, production releases |
Cut policy implied by appetite:
| Appetite | What to cut first |
|---|---|
| Lean | Edge cases, secondary flows, alternative strategies, non-critical integrations. Keep only the happy path. |
| Core | Low-value variants. Keep the main JTBD, obvious edge cases, and one alternative only if it changes the core flow. |
| Complete | Cut nothing unless impossible. Keep full edge case mapping, multiple implementation strategies, and domain context. |
The Shape Up stage runs a mechanical check (scope count, spec size) and writes a preliminary appetite_fit in the spec frontmatter. The Plan Critique stage validates it via its fresh-context feasibility reviewer (see cali-product-plan-critique checklists — Scope Fit dimension). This uses the existing 5-reviewer infrastructure instead of adding a dedicated subagent.
appetite_fit |
Meaning |
|---|---|
fits |
Proposal fits within appetite - proceed as shaped |
cuts_needed |
Proposal almost fits but needs targeted cuts (LLM suggests what; human decides) |
reshape |
Proposal fundamentally exceeds appetite - must be reshaped before continuing |
This is not an estimate. The LLM does not estimate effort - it checks whether the shaped design fits the human's declared budget. If it doesn't fit, the LLM proposes cuts or reshaping, never an appetite extension. The final decision is always human.
All three appetites benefit from appetite_fit validation by the Plan Critique's fresh-context feasibility reviewer — this uses the existing 5-reviewer infrastructure, no dedicated subagent needed. The Shape Up stage provides only a preliminary mechanical check (scope count, spec size). This aligns appetite_fit with the workflow's convention: all critical evaluations use fresh context via the Plan Critique stage.
Critique and Gate are Review Mode controls, not Appetite controls. Product Critique and Plannotator Gate are governed by Review Mode: Auto skips gates; all other modes run the configured gates. Appetite changes the depth of the shaped proposal, interface exploration, supervisor sensitivity, and test scope breadth — not whether quality gates exist.
Appetite-specific execution budget:
| Area | Lean | Core | Complete |
|---|---|---|---|
| Spec + scopes | ~1 page; 1-2 scopes; one direct implementation path | ~3 pages; 3-5 scopes; 1-2 implementation alternatives with brief rationale | ~8+ pages; 8-15 scopes; 3-5 alternatives with trade-offs |
| Cut policy | Cut edge cases, secondary flows, alternative strategies, non-critical integrations. Keep the happy path. | Cut low-value variants. Keep main JTBD, obvious edge cases, and one alternative only if it changes the core flow. | Cut nothing unless impossible. Keep full edge mapping, multiple strategies, and domain context. |
| Interface exploration | 1 suggested interface only | 3 archetypes explored + 1 hybrid recommendation | 5 archetypes explored + 1 hybrid recommendation |
| Supervisor | Low sensitivity | Medium sensitivity | High sensitivity |
| Testing | Smoke tests + critical-path unit tests | Unit tests + integration tests for external seams | Unit + integration + behavior/e2e + security tests |
| Quality baseline | Build/test/lint/typecheck always; a11y lint/static if UI exists | Build/test/lint/typecheck always; a11y codebase/browserless audit if UI exists | Build/test/lint/typecheck always; live a11y audit if UI exists |
Review Mode controls the breadth of human review — which gates, questions, and approvals are active. Unlike Appetite (depth of scope), Review Mode determines the level of human oversight during the workflow.
Review Mode is set explicitly during the setup phase via ask_user_question. It is NOT auto-detected.
| Review Mode | Plannotator Gates | Interface | IN/OUT Confirmation | Tech Approval | Best for |
|---|---|---|---|---|---|
| Auto | None | LLM decides | LLM decides | Auto | Throwaway prototype, quick validation, spike |
| Product Spec Gate | 1 pre-tech | LLM decides | LLM decides | Auto | Standard feature, bug fix, small improvement |
| Product Spec + Interface Gates | 1 pre-tech + Int.Gate | User chooses | LLM decides | Auto | Feature where interface matters |
| Product Spec + Interface + Scopes | Gate + Int.Gate | User chooses | User confirms | Auto | Critical feature, product with domain context |
| Product Spec + Interface + Tech Review | Gate + Int.Gate + Plan.Gate | User chooses | User confirms | Gate + tech Qs | Full pipeline, high-risk changes, production |
| Product Spec + Interface + Tech Review + Code Diff | Gate + Int.Gate + Plan.Gate + Diff.Gate | User chooses | User confirms | Gate + tech Qs + code diff | Maximum oversight, critical infrastructure |
Key rules:
- Auto: No gates, no Plannotator, no questions. LLM decides everything. Quickest path.
- Product Spec Gate: One Plannotator gate (spec-product visual approval before tech planning). AI resolves all gaps. Interface auto-generated, no choice. No IN/OUT confirmation.
- Product Spec + Interface Gates: Product spec gate + interface gate. User chooses between generated interface alternatives. AI resolves trivial gaps, asks about moderate/critical.
- Product Spec + Interface + Scopes: All product gates active (pre-tech + scope IN/OUT + int-gate). User confirms boundaries. Tech approval uses Auto.
- Product Spec + Interface + Tech Review: Everything in product review + tech plan goes through Plannotator gate + user answers technical questions.
- Product Spec + Interface + Tech Review + Code Diff: All the above + Plannotator code diff review on the working tree after verification. Maximum human oversight for critical changes.
Review Mode controls WHAT runs (breadth) → Which gates are active
Appetite controls HOW DEEP it runs → Scope depth per gate
| Lean | Core | Complete | |
|---|---|---|---|
| Auto | No gates. Fastest path: smaller spec, minimal verify. | No gates. Standard planning depth, standard verify. | No gates. Deep planning, full verify. |
| Product Spec + Interface + Scopes | 2 gates (Gate + Int.Gate). User confirms IN/OUT. | 2 gates + IN/OUT confirmation. Full workflow. | 2 gates + all questions. No shortcuts. |
| Product Spec + Interface + Tech Review + Code Diff | 4 gates + plan-gate + diff-gate. Full review. | 4 gates + all questions. Max oversight. | 4 gates + all questions + code diff review. No shortcuts. |
Examples:
Lean + Auto→ Fastest path: no gates, no questions, no Plannotator. LLM decides scope. Interface runs automatically with 1 suggested interface. (~6 stages)Core + Product Spec Gate→ Standard feature: 1 Plannotator gate (pre-tech), interface runs automatically with 3 interfaces + hybrid. (~10 stages)Core + Product Spec + Interface Gates→ Feature where interface matters: 1 Plannotator gate + user chooses among 3 interfaces + hybrid. (~8 stages)Complete + Product Spec + Interface + Tech Review→ Critical feature: 3 Plannotator gates + all questions. Interface explores all 5 archetypes + hybrid. No shortcuts. (~17 stages)Complete + Product Spec + Interface + Tech Review + Code Diff→ Maximum oversight: 4 Plannotator gates + code diff review. All questions. All archetypes. (~17 stages)
Product ideas vary widely in scope and risk. A throwaway prototype should not require the same planning depth as a critical production feature. The Appetite × Review Mode cascade system ensures:
- Lean appetite limits scope and exploration - smaller spec, fewer scopes, one interface suggestion, and critical-path tests only.
- Complete appetite expands exploration and verification - full edge mapping, all 5 interface archetypes + hybrid, behavior/e2e tests, security tests, and live a11y audit when UI exists.
- Auto review mode skips Plannotator - for lightweight validations where visual review is overkill
- Product Spec + Interface + Scopes review mode enforces strategy - JTBD, Opportunity Mapping, etc. run before shaping if product context exists
This is an appetite-first design: the human's declaration of review budget propagates automatically through all stages - no estimation step required.
The workflow has 3 conceptual phases (17 stages total), from idea triage to post-execution audit. See the Stage Index in the orchestrator skill for the complete stage map with auto-chain rules and flow diagram.
Stages 0-11 — From raw idea through shaped proposal, adversarial critique, visual gate approval, interface exploration, to typed technical plan. Stages 12 — Tech plan gate (conditional). Stages 13+ — Execution onward.
Traditional planning is linear: product spec → tech spec. stelow adds two feedback loops that let tech constraints and opportunities inform product decisions before execution:
-
Tech Preview — Before shaping the product spec, a lightweight codebase analysis runs (via cymbal, when available) to surface existing architecture, entry points, hotspots, and constraints. This prevents shaping features that conflict with the codebase reality. Depth is appetite-gated. Additionally searches existing features by workflow name/topic to avoid duplicating or conflicting with what already exists.
-
Codebase Feature Recon — Before tech planning generates typed scopes, a deeper cymbal investigation runs: searches for related modules, maps references (who connects to what), and analyzes impact (what breaks if changed). Depth varies by appetite — see table below.
-
Alignment Check — After tech planning generates typed scopes, a bidirectional check compares the tech plan against the product spec. If tech reveals constraints that change the product scope, the LLM classifies alignment and acts per Review Mode: Auto/Product Spec Gate auto-updates the product spec; Product Spec + Interface Gates and above ask the user. This catches "tech discovered too late" before any code is written.
| Appetite | Tech Preview (shaping) | Codebase Feature Recon (planning) | Alignment Check |
|---|---|---|---|
| Lean | cymbal search --text by workflow name |
cymbal search --text — verify existence |
Quick feasibility |
| Core | Structure overview (entry points, hotspots) + feature search | search + cymbal refs — find connections |
Standard IN/OUT vs feasibility |
| Complete | Structure + impact analysis (blast radius) + feature search | search + refs + cymbal impact — blast radius |
Deep: each scope's ACs vs codebase |
Greenfield skips all codebase analysis (no code to inspect).
If cymbal is not installed, falls back to find + git log — no cross-references or impact data.
| Review Mode | Alignment Check behavior |
|---|---|
| Auto/Product Spec Gate | Auto-resolve. Updates spec-product if needed. No questions. |
| Product Spec + Interface Gates | Auto-resolve if aligned; flags user if misaligned. |
| Product Spec + Interface + Tech Review / +Code Diff | Always shows diff, asks user to choose update/ignore/reshape. |
These loops are appetite- and mode-respecting by design — they inherit the same two-axis control as the rest of the workflow. No new mechanism needed.
Stages 13-14 — Autonomous scope execution via acceptance contracts: each scope is delegated with criteria, verify commands, and stop rules. Self-correction is harness-dependent - native acceptance loops (pi-subagents) let the child fix gaps in the same context; other harnesses use parent-controlled re-delegation. Optimization scopes use benchmark-driven iteration. Scope completion is gated - /sw-next blocks advance to Verification if any scopes remain incomplete.
Stage 14 — Verification (tests, code review, UI audit). Stage 15 — Code diff review gate (conditional). Stage 16 — Execution critique (scope fidelity, NFR coverage, edge cases, docs, test quality). The audit classifies gaps as FIXED / DOCUMENTED / ESCALATED. ESCALATED gaps become new scopes. /sw-next detects pending scopes at the Audit phase and loops back to Execution.
All 25 skills are flat in skills/ directory, ready for ~/.agents/skills/. They're organized into 4 layers plus 1 complementary skill.
Each skill is fully self-contained - the installer copies the complete directory tree including its own references/cli-tools/, references/, and stages/ files. This means:
- ✅ Skills work standalone - invoke any sub-skill (e.g.,
cali-product-shape-up,cali-product-plan-critique) independently of the orchestrator - ✅ Portable across CLIs - Pi, Claude Code, Codex, OpenCode all reference skills by name (
~/.agents/skills/) - ✅ References resolve locally - every
references/cli-tools/*.mdpath is relative to the skill's own directory - ❌ Not in
~/.agents/skills/? Use./install.shornpx skills add calionauta/stelow -g
| Skill | Purpose |
|---|---|
stelow |
Coordinates the multi-stage workflow (Setup → Context → Shape → Critique → Gate → Scope → Interface → Int.Gate → Selection → Planning → Plan.Gate → Execution → Verification → Diff.Gate → Audit) |
| Skill | Purpose |
|---|---|
cali-product-job-to-be-done |
Job To Be Done - understand what job users hire the product to do |
cali-product-discovery |
Product discovery and validation |
cali-product-opportunity-mapping |
Map opportunities to see where to focus |
cali-product-multi-method-market-analysis |
Multi-method market analysis |
cali-product-evolutionary-principles |
Evolutionary principles for sustainable development |
| Skill | Purpose |
|---|---|
cali-product-shape-up |
Shape Up planning + Tech Preview (appetite-gated codebase recon via cymbal) — surfaces codebase reality before product decisions |
cali-product-interface-alternatives |
Interface alternatives exploration (1/3/5 archetypes by appetite) |
cali-product-plan-critique |
Product plan gap analysis (flows, states, affordances, data, system, compositional quality, feasibility); mode-dependent resolution |
cali-product-codebase-critique |
Codebase structural critique (architecture, performance, AI slop) |
cali-product-ux-critique |
Full UX/UI audit (accessibility, Nielsen heuristics, personas, AI slop) |
cali-product-tech-planning |
Technical scope generation + Alignment Check (mode-gated bidirectional product↔tech feedback loop) |
cali-product-testing-ai-code |
AI-aware testing strategy with contextual mutation testing evaluation |
cali-product-testing-execution |
Post-implementation testing protocol |
cali-product-scope-executor |
Autonomous scope execution via acceptance contracts - child self-corrects (harness-dependent), parent evaluates final result |
cali-product-execution-critique |
Post-execution audit - classifies gaps as FIXED/DOCUMENTED/ESCALATED; ESCALATED gaps become new scopes |
| Skill | Purpose |
|---|---|
cali-product-ads |
Advertising and growth channels |
cali-product-business-models |
Business model canvas and options |
cali-product-health |
Product health metrics |
cali-product-marketplace-playbook |
Marketplace dynamics |
cali-product-open-source |
Open source strategy |
cali-product-pricing |
Pricing strategy and tactics |
cali-product-promotions |
Promotions and campaigns |
cali-product-trust-building |
Trust-building mechanisms |
| Skill | Purpose |
|---|---|
cali-product-coding-standards |
Self-contained coding standards - KISS, DRY, LoB, SoC, Fail Fast, YAGNI, file/function size limits |
This package works across multiple coding agents - not just pi.dev. See the compatibility table in Installation to pick your path.
| Your situation | Recommended command | What you get |
|---|---|---|
| New to CLIs (no Node, no agent) | curl -fsSL https://raw.githubusercontent.com/.../setup.sh | sh |
Node.js + pi.dev + all extensions + 25 skills |
| Already use pi.dev | git clone ... && ./install.sh |
25 skills + TUI overlay + slash commands |
| Use OpenCode / Claude Code / Codex | git clone ... && ./install.sh |
25 skills + command files (no TUI) |
| Any CLI (skills only) | npx skills add calionauta/stelow -g |
25 skills + cross-CLI support |
/sw-start auto-detects what kind of request you're making:
/sw-start "reduce complexity of the codebase"
# → Detected as: Refactor
# → Pipeline: Planning → Execution → Verification → Audit
# → Skips Shape Up, Interface, all Gates
/sw-start "fix login crash when email is empty"
# → Detected as: Bugfix
# → Pipeline: Planning → Execution → Verification → Audit
/sw-start "create a new invoicing platform"
# → Detected as: New Product
# → Full pipeline: Setup → Shape → ... → Execution → AuditIf detection is ambiguous or incorrect, you can change the category before the workflow starts. This prevents token waste from running the full Shape Up pipeline on a simple bugfix.
/sw-resume checks for git changes before resuming a paused workflow. If files changed while paused, it warns you and asks for confirmation before proceeding.
See docs/INSTALLATION.md for detailed options.
Per-agent configuration files (commands, install scripts) are in cli-agents/.
Not every feature works on every CLI. Here's what to expect:
| Feature | pi.dev | OpenCode | Claude Code | Codex |
|---|---|---|---|---|
| Skills (all 25) | ✅ | ✅ | ✅ | ✅ |
/sw-start command |
✅ Slash commands | ✅ Via sw-*.md files |
✅ Via command files | ✅ Via command files |
| TUI overlay (real-time status) | ✅ Native extension | ❌ | ❌ | ❌ |
| Plannotator visual gate | ✅ Extension | |||
| Deep hooks (events, gates) | ✅ Extension | ❌ | ❌ | ❌ |
Bottom line: The 25 skills work identically on every CLI - they run the full Shape Up workflow, generate plans, critique, scopes, everything. The deep integration features (real-time TUI overlay, slash commands, lifecycle hooks, Plannotator gate) are native to pi.dev, which has the extension system to support them. Two CLI-agnostic surfaces also read workflow state from
.stelow/files on disk: the Muxy.app webview panel (macOS terminal multiplexer) and the Herdr split-pane TUI plugin (terminal multiplexer). Both work with any CLI. All CLIs can still complete the workflow; on non-Pi CLIs it happens via chat and command files rather than extensions.
stelow is designed to be self-contained — the 25 skills + installer cover the full workflow. Some features optionally integrate with external tools for enhanced capability. Every external dependency has a documented fallback.
| Dependency | Required? | Used by | Install method | Fallback if absent |
|---|---|---|---|---|
| cymbal | Optional | Tech Preview, Codebase Feature Recon, Alignment Check | brew install 1broseidon/tap/cymbal (macOS), or go install / binary release |
Basic find + git log — no cross-references or impact data |
| npx skills | Optional | Stack-matched skill discovery during execution setup | Part of Node.js ecosystem (npx bundled with npm) |
Skip — workflow runs without stack-matched skills |
| ctx7 | Optional | Current library doc fetching during execution setup | npx @vedanth/context7 (auto-install via npx) |
Skip — docs not fetched (less informed execution) |
| sem | Optional | Entity-level diff in Execution Critique (functions, types, methods instead of raw lines); enhanced changelog + bump detection in releases | brew install sem (macOS), or go install github.com/bcongdon/sem@latest |
git diff — raw line-level only, no structural awareness |
| plannotator | Optional | Visual review gate annotation | Pi: @plannotator/pi-extension · OpenCode: @plannotator/opencode · Claude Code: @backnotprop/plannotator · Codex: built-in hook |
Manual review with approval receipt file — no structured annotation |
| safe-change (pi-agent-codebase-workflows) | Optional | Pre-execution code safety checks | npx skills add Prinova/pi-agent-codebase-workflows -g (works on Pi, OpenCode, Claude Code, Codex) |
Skip — pre-execution check omitted |
| Subagents (built-in to all CLIs) | Optional | Parallel reviewer orchestration during Plan Critique | subagent({ agent, task }) — built-in on Pi/OpenCode/Claude Code/Codex |
Sequential execution — slower, same outcome (single-context review) |
| pi-subagents | Optional (Pi only) | Advanced subagent features (fork/fresh context semantics, parent-child contracts) | npm:pi-subagents |
Use the CLI's built-in subagent() instead — same outcome, fewer features |
| pi-intercom | Optional (Pi only) | Session-to-session coordination | npm:pi-intercom |
Skip — no intercom capability |
| pi-supervisor | Optional (Pi only) | Conversation supervision during execution | npm:pi-supervisor |
Skip — no supervision; rely on stages-guard for invariant enforcement |
| Muxy.app + stelow Muxy extension | Optional (macOS) | Webview panel showing workflow state with phase progress and quick actions | Install Muxy.app, then load extension from integrations/muxy/stelow/ |
No webview — read .stelow/ files directly or use Herdr split-pane TUI |
| herdr + stelow plugin | Optional | Split-pane TUI showing workflow state with click-to-drill | herdr plugin install calionauta/stelow |
No TUI — read .stelow/ files directly or use Muxy webview panel |
Design principle: stelow is harness-agnostic. Zero external tools are required to run the full product workflow. Each optional integration enhances a specific phase but never blocks progress. The installer (./install.sh) auto-installs Pi npm packages when Pi is detected — other tools (cymbal, ctx7) remain user-managed.
For every external tool above, the workflow teaches the agent the specific fallback strategy in skills/stelow-product-orchestrator/references/cli-tools/<tool>.md. When a tool is unavailable, the orchestrator instructs the agent to use harness-native capabilities (built-in subagent(), git grep, terminal-based review with approval receipts) rather than skipping the workflow step entirely. Degraded capability is the trade-off — see the Fallback column above for what you lose without each tool.
One command, everything included. Pick this if you don't have pi.dev yet.
curl -fsSL https://raw.githubusercontent.com/calionauta/stelow/main/setup.sh | shWhat gets installed (in order):
| Step | Component | Details | Works on |
|---|---|---|---|
| 1 | Node.js | v20+ via Homebrew (macOS) or nvm (Linux/Windows) | - |
| 2 | pi.dev | @earendil-works/pi-coding-agent via npm |
pi.dev |
| 3 | Pi extensions | 12 npm packages: pi-subagents, pi-skillful, pi-intercom, pi-supervisor, @plannotator/pi-extension, pi-rewind, pi-powerline-footer, plus 5 harness tooling packages | pi.dev only |
| 4 | Skills (25) | stelow orchestrator + 24 subskills, copied to ~/.agents/skills/ |
All CLIs ✅ |
| 5 | Settings | theme, model defaults, skill shortcuts in ~/.pi/agent/settings.json |
pi.dev |
| 6 | cymbal | codebase navigation via brew install 1broseidon/tap/cymbal (macOS) or go install (Linux). Skipped gracefully if brew/Go absent |
macOS, Linux |
| 7 | ctx7 | library docs fetcher via npx @vedanth/context7 (interactive OAuth — prompts the user) |
All CLIs |
| 8 | safe-change | pre-planning regression check via npx skills add PrinNova/pi-agent-codebase-workflows -g |
All CLIs |
| 9 | Herdr plugin | stelow split-pane TUI installed via herdr plugin install calionauta/stelow — only if herdr CLI is on PATH |
All CLIs (via Herdr) |
| 10 | Muxy detection | detects /Applications/Muxy.app or muxy binary; prints install link if absent (cannot auto-install — Muxy is macOS-only, distributed via GitHub releases) |
macOS |
| 11 | Pulse (optional) | copies Pulse scripts to project's .stelow/pulse/ and creates inbox. Or run standalone: ./scripts/setup-pulse.sh (no pi required — works in CI/CD or before pi is installed) |
All CLIs (cron/launchd/systemd/Task Scheduler) |
Not using pi.dev? Skills land in
~/.agents/skills/and work on OpenCode, Claude Code, and Codex too. You just won't get the Pi-only extensions or TUI overlay. The workflow itself runs fine.Muxy.app can't be auto-installed. It's a macOS-only app (SwiftUI + libghostty), open-source under MIT license, distributed via GitHub releases. Path A detects whether it's present and tells you how to install if not. Once installed, load the stelow extension from
integrations/muxy/stelow/.
git clone https://github.com/calionauta/stelow.git
cd stelow
./install.shThe installer auto-detects your CLIs and installs skills + extensions + slash commands. Skills go to ~/.agents/skills/.
The skills are the core of this project - they work on any agent (Pi, OpenCode, Claude Code, Codex).
git clone https://github.com/calionauta/stelow.git
cd stelow
./install.shThe installer detects your CLI and installs skills + command files. No extensions, no TUI - just the 25 skills that run the workflow.
Or, with npx (no clone needed):
npx skills add calionauta/stelow -gThis installs all 25 skills to ~/.agents/skills/ - works on any CLI.
For CLI-specific setup (OpenCode config, Claude Code plugin, Codex plugin), see docs/INSTALLATION.md.
For per-CLI commands, required npm packages, third-party skills, and updates, see docs/INSTALLATION.md.
For toolchain dependencies (TypeScript, Vitest), see package.json.
This project distributes exclusively via GitHub (no npm) — see docs/SECURITY.md for rationale.
| Command | Description |
|---|---|
/sw-start [idea] |
Start new workflow. Auto-detects intent type and routes to appropriate stage pipeline. If called without arguments, reads from inbox (.stelow/inbox/items.md). If the input contains multiple items, auto-runs triage (group) + select (pick one). |
/sw-status |
Show active workflow phase list, stage progress, and scopes. |
/sw-next |
Advance to next stage. Auto-completes workflow on last phase. |
/sw-pause |
Pause active workflow (keeps state for resume). |
/sw-resume [name=] |
Resume paused/in-progress workflow. Checks git drift before resuming. |
/sw-abort [name=] |
Abort and archive active workflow. |
/sw-archive [name=] |
Archive completed or inactive workflow. |
/sw-unarchive name= |
Restore archived workflow to paused state. |
/sw-status |
Display current phase, progress, and scope status. |
/sw-ls [all|archived] |
List workflows in current project (or all projects). |
/sw-setphase phase=N |
Jump to specific phase by index. |
/sw-info [name=] |
Print workflow path, current stage, and copy-pasteable cd + /sw-resume commands. |
/sw-rename <name> |
Rename active workflow. |
/sw-complete |
Force-complete active workflow. |
/sw-inbox [add|remove|clear|history] |
View or manage deferred inbox items. |
/sw-pulse |
Manage autonomous inbox processing (see Pulse section below). |
/sw-doctor [--fix] |
Diagnose workflow health. Detects zombie workflows, index mismatches, orphaned entries. |
/sw-unlock |
Disable stage guard for current session (debug only). |
All 17 commands work in Pi natively (via
pi.registerCommand()). In OpenCode, Claude Code, Codex, 15 commands work via Skill delegation — the.mdfiles incli-agents/<cli>/commands/sw-*.mdinvoke/skill:stelow-product-orchestrator <command>and route through the orchestrator. The exceptions are/sw-inboxand/sw-pulse, which are markedpiOnlybecause they operate on filesystem state with native TUI notifications; in non-Pi CLIs, the agent falls back to reading files directly.
Pulse is a background system that periodically checks your inbox and creates workflows automatically — no interactive session needed. It runs on a timer (cron, launchd, systemd, Task Scheduler) and processes items with review_mode=Auto (no gates, no questions, no Plannotator).
cron/launchd/systemd (every 30m)
→ pulse.sh / pulse.ps1
→ checks inbox (.stelow/inbox/items.md)
→ runs `pi --print` with triage prompt
→ creates workflow(s) with review_mode=Auto
→ logs provenance to .stelow/inbox/history.jsonl
| Command | Purpose |
|---|---|
/sw-pulse status |
Show pulse state (paused, inbox count, last run) |
/sw-pulse pause |
Pause automatic processing |
/sw-pulse resume |
Resume automatic processing |
/sw-pulse process |
Force immediate processing |
/sw-pulse log [n] |
Show last N log entries |
| Flag / Env | Default | Description |
|---|---|---|
--max-items N |
10 |
Items per cycle. 0 = uncapped (all). 1 = one at a time |
--force |
— | Skip pause + user-activity checks |
--dry-run |
— | Preview without executing |
PULSE_MODEL |
— | Optional. Override harness's configured model for pi --print. If unset, uses whatever the user's harness is configured with (no hardcoded default). |
PULSE_TIMEOUT |
120 |
Max seconds for pi --print |
PULSE_USER_ACTIVITY_MINUTES |
15 |
Skip if user modified stelow.json recently |
Marking items for human review: Prefix an inbox item with [human-in-the-loop] (or [hitl]) — Pulse skips it entirely. Use for items that need human judgement (pricing, partnership, strategy). Items without the marker are processed automatically.
Conflict prevention: Pulse detects active user sessions (modified stelow.json mtime + interactive pi process) and skips automatically. Lock file prevents concurrent runs.
Setup guides: See .stelow/pulse/SETUP.md for macOS (launchd), Linux (systemd/cron), and Windows (Task Scheduler + PowerShell).
Getting the scripts: The stelow extension auto-copies Pulse scripts to .stelow/pulse/ on the first /sw-pulse invocation. To pre-stage (no pi required, useful for CI/CD or before the extension is installed): ./scripts/setup-pulse.sh [--project-dir DIR] [--dry-run].
When working on software projects, trigger the product workflow:
- Trigger: Use
/skill stelow - Execute: Only after visual review gate (Plannotator approval)
| CLI | File |
|---|---|
| Pi | ~/.pi/agent/AGENTS.md |
| OpenCode | ~/.config/opencode/AGENTS.md or project AGENTS.md |
| Claude Code | ~/.claude/CLAUDE.md or project CLAUDE.md |
| Codex | ~/.codex/AGENTS.md or project AGENTS.md |
Two CLI-agnostic surfaces read workflow state from .stelow/ files on disk and present it alongside your terminal. Pick one or both — they share no code and don't require each other.
| Surface | Host | UI model |
|---|---|---|
| Muxy webview panel | Muxy.app (macOS terminal multiplexer) | WKWebView docked/floating panel with HTML/CSS/JS |
| Herdr split-pane TUI | Herdr (terminal multiplexer) | Rust+ratatui TUI in split pane (placement = "split") |
| Muxy plugin | Herdr Extension |
|---|---|
![]() |
![]() |
Herdr Extension
Both integrations share the same workflow state (.stelow/), both work with any CLI (Pi, OpenCode, Claude Code, Codex), neither requires pi.dev.
Requires Muxy.app + the stelow extension loaded. Muxy is a macOS terminal multiplexer — think tmux with a native Mac UI: project-based terminal workspaces, tabs, splits, and custom panels. The webview panel is a Muxy plugin surface, not a web app or a Pi feature.
The panel shows:
- Current phase and progress
- Phase artifacts and outputs
- Upcoming tasks
- Quick actions
Install: Muxy.app (macOS-only) + the stelow Muxy extension at integrations/muxy/stelow/. To load the extension in Muxy: open Extensions modal → Create, pick the folder, and Muxy auto-detects the built dist/. See Muxy's Get started guide for the official workflow.
Requires Herdr + the stelow plugin installed. Herdr is a terminal multiplexer — tmux-style persistence, mouse-native panes, agent state tracking, CLI + socket API. The TUI plugin runs as a Rust binary in a split pane, the same model as
herdr-file-viewer.
The TUI shows:
- Current stage (Discovery → Shape Up → Tech Planning → ...)
- Per-stage status (✓ done, ▶ active, · pending, ! blocked)
- Drill-down: stage → project → scope → task
- Quick action invocation via
herdr plugin action invoke
Keybinds: prefix+w toggle · Tab/j/k next/prev workflow · r refresh · ? help · q/Esc quit. Detail card shows prompt + current stage + scope; click workflow rows to select.
Install: herdr plugin install calionauta/stelow (or herdr plugin link integrations/herdr/stelow/ for local dev after cargo build --release). Source under integrations/herdr/stelow/. See herdr's plugin docs for the official install/link workflow.
All workflow artifacts are stored in:
<project>/.stelow/
| Subdirectory | Contents |
|---|---|
input/ |
User's original idea |
stages/ |
Stage outputs (JTBD, Shape Up, Interface, etc.) |
plans/ |
Technical plans and specs |
reviews/ |
Plannotator feedback |
scopes/ |
Typed execution scopes |
logs/ |
Workflow execution logs |
This workflow is grounded in empirical evidence from the 2025-2026 AI agent research boom. Every architectural decision - from parallel subagent orchestration to cross-session learning - is backed by peer-reviewed papers, open-source tools, and industry benchmarks.
| Practice | Source | Evidence | Where We Implement |
|---|---|---|---|
| Parallel orchestration | CAID (Geng & Neubig, CMU, 2026) | +26.7% accuracy using git-worktree isolation + dependency DAG | 5 parallel reviewers + consolidator during plan critique |
| Cross-session learning | Cat (Liu et al., Beihang, 2025); Memory Transfer (Kim et al., KAIST, 2026) | Context as callable tool; +3.7% via abstract memory pools | Session knowledge from past cycles read during workflow setup |
| Output validation guards | Stage-Gate Agentic (PDMA, 2026); Phaselock (2026) | AI agents with gates reduce execution failures; 80 enforceable rules | Shape Up output guard + Tech Planning validation guard |
| Context isolation | Clean Context Pattern (Agent Factory, 2026); GAM (Zhejiang U., 2026) | Fresh context per agent outperforms shared pipelines; write isolation prevents contamination | subagents.md - context:"fresh" per subagent; disk-based artifacts |
| Visual review gate | Plannotator (backnotprop, 2025); Placement Theory (Tian Pan, 2026) | Browser-based plan annotation with structured feedback loop | Plannotator gate active when Review Mode > Auto; skipped in Auto |
| Intra-step recovery | Try-Heal-Retry (Nweke, 2026); PALADIN (Chaudhary et al., 2025) | 89.68% recovery rate via annotated failure trajectories | subagents.md - Retry 1× + skip with logged error per subagent |
| Parallel review isolation | CooperBench (Khatua et al., 2026) | 2-agent cooperation → 25% success vs 50% solo; monotonic decline from 68% (2 agents) to 30% (4 agents) | Plan Critique uses fresh-context subagents with zero inter-agent communication and independent file outputs |
| Communication topology limits | clawRxiv 2604.00736 (2026) | Overhead grows quadratically: C(n)=0.023n²+0.04n; 50% at n=7; agents inflate 34% when aware of peers | Max 4-5 parallel subagents (n≤5 optimal zone); no message passing between agents — each writes independent file |
| Research vs code parallelism | Co-Coder (Yang et al., 2026) | Parallel speedup requires cohesion-aware partitioning (+14% pass rate, 2.10× speedup); naive file-parallel = worse than sequential | Research/review tasks are naturally cohesion-free; code execution defaults to sequential; parallel scope execution is opt-in with file-overlap guard |
| Metric-driven optimization | ReflexGrad (Kadu et al., 2025); ReliabilityBench (Gupta et al., 2026) | +40pp lift via dual-process routing; standardized reliability measurement | optimization scopes routed to optimization goals (subagent + acceptance) |
| Acceptance-based execution | Pattern inspired by Try-Heal-Retry (Nweke, 2026) and PALADIN (Chaudhary et al., 2025) | Self-correction in same context outperforms fresh re-delegation | Scope executor delegates with acceptance contract - child self-corrects (harness-dependent) before parent evaluates |
| Audit gap-to-scope loop | Pattern inspired by Agentic Debugging (Zhang et al., 2025) | Multi-agent feedback loops improve fix rate | Audit classifies gaps → ESCALATED become new scopes → /sw-next enforces loop back to Execution |
Research parallelism, not code parallelism. All subagent parallelism in stelow is research and review — Plan Critique (4-5 parallel reviewers), Strategic Context (N skill executors), Interface exploration (5 proposals). Every reviewer receives fresh context, writes to an independent file, and communicates zero with other agents. No message passing, no shared mutable state, no concurrent code edits. This avoids the "curse of coordination" deliberately: CooperBench (2026) shows 2-agent cooperation achieves only 25% success vs 50% solo, with monotonic decline as agents increase (68% → 46% → 30% from 2→3→4 agents). Communication overhead scales quadratically (clawRxiv 2604.00736: C(n)=0.023n², 50% of tokens lost to coordination at n=7 agents). For code execution, stelow defaults to sequential scope execution — each scope runs in a single agent turn, no parallel file edits. Parallel scope execution is opt-in, guarded by the execution stage's explicit file-overlap check: if scopes modify the same files, stelow warns and recommends sequential execution or git worktree isolation with merge instructions. The rationale is documented in the execution stage: "If conflicts are extensive, consider sequential execution instead." Research parallelism shows consistent gains (+26.7% accuracy, CAID 2026); code parallelism on shared files degrades quality (CooperBench 2026). Stelow uses each pattern where evidence supports it.
Even with these guardrails, the AI agent still exhibits predictable failure modes. This workflow is a tool for amplifying human judgment, not a substitute for it.
How to read this table: Each row is honest about what the workflow can and cannot do. Every mitigation has a corresponding "not solved" assessment. Read both before deciding whether this workflow helps your context.
| # | Limitation | Impact | What the workflow tries to do | Why it's not solved |
|---|---|---|---|---|
| 1 | Context rot - compliance with own rules drops from ~73% (turn 5) to ~33% (turn 16) in long sessions | Gamage 2026, 4,416 trials, 12 models/8 providers. Replicated by Liu et al. 2023 "Lost in the Middle". | Subagents use context: "fresh". Ordered-execution-goal creates isolated scope execution. Execution stage has explicit "Context Rot Check" re-reading plan from disk. |
Reduced but not solved. The orchestrator itself can forget its own rules in long sessions spanning multiple stages. The core transformer limitation (U-shaped attention curve) remains intrinsic. |
| 2 | Confabulated research references - Agents cite nonexistent papers or books (~11-57% hallucination rate across models) | arXiv 2604.03173 - 10 models/3 databases/69K citation instances | Claim verification via Lessons Learned cross-referencing during setup. | Caught by structure, not guaranteed. Multi-model consensus (≥3 LLMs citing same work) yields 95.6% accuracy, but the workflow doesn't enforce this. |
| 3 | Silent wrong answers - Cross-task state leakage produces plausible but incorrect outputs | UCC (arXiv 2604.01350), 2026 | Write isolation per subagent; clean context pattern | Mitigated by isolation, not by detection. No mechanism to detect when contamination happens despite isolation. |
| 4 | Overconfidence in estimates - AI systematically underestimates implementation complexity | Agentic Overconfidence (ICLR 2026) - all tested agents exhibit agentic overconfidence | Appetite is declared by human as a constraint, not estimated by the LLM. The LLM only checks appetite_fit (fits/cuts_needed/reshape). No estimation step. |
Addressed by design - appetite is a constraint, not an estimate. The human sets the budget before shaping. The LLM checks fit, not effort. But the human still needs to set appetite honestly. |
| 5 | Approval gate fatigue - Users can desensitize to visual gates and approve without scrutiny | Tian Pan Apr 2026 - HITL queues have dynamics | Plannotator requires active annotations (deletions, comments, labels). Auto/Product Spec Gate review modes skip gates entirely when appropriate. | Delayed, not prevented. Review Mode selection helps reduce unnecessary gates, but if the human always picks Complete+Product Spec + Interface + Scopes, fatigue still sets in. |
| 6 | 80% Problem - AI ships the happy path (CRUD, main flow) but omits error handling, observability, security, retry, rollback, edge cases | Osmani Jan 2026 (coined the term); GitClear 2025 | Tech Planning requires NFRs per scope. Acceptance contracts can include NFR criteria (if the plan specifies them). Audit classifies omissions as gaps - ESCALATED ones become new scopes. | Partially mitigated, not solved. NFRs must be in the plan to appear in the contract. Audit classification depends on the LLM - misclassification means gaps slip through. Same model evaluates both stages. |
| 7 | Model dependency - Claude Opus, Gemini Flash, GPT-4o produce significantly different quality | Veracode 2025 - 45% of AI-generated code contains flaws across 100+ models; Anthropic Jan 2026 - RCT: AI-assisted devs score 17% lower on comprehension tests | Every artifact tracks generated_by: {model_name} in frontmatter. Gate stage shows provenance before Plannotator review. |
Transparency, not mitigation. Knowing the model helps calibrate expectations, but it doesn't fix the quality gap. The comprehension penalty (Anthropic 2026) affects users regardless. |
| 8 | Constraint decay - AI progressively violates its own self-imposed rules over time | arXiv 2026 (Constraint Decay) - structural constraints drift in backend code generation; HORIZON - agents break on long-horizon tasks | Context rot rules explicitly warn about this. "No patching in degraded context" rule blocks the most common decay pattern. | Same root cause as context rot. The warning helps, but stopping a session mid-flow is disruptive and users rarely do it. |
| 9 | Code hallucination - AI invents APIs, functions, or contracts that don't exist (~20% of failures) | CloudAPIBench - 20.41% of failures are hallucinated APIs; Code LLM failures | Verification stage runs the test suite, which catches some hallucinated APIs. | Caught by tests, not by the workflow. If tests don't exist (or are also hallucinated), neither Verification nor Critique detects it. |
| 10 | Shallow review trap - same LLM that wrote the code also reviews it | Ox Security 2025 - 300+ repos, 10 anti-patterns, AI code in production with critical flaws | Verification uses context: "fresh" subagent reviewers - same model but fresh session context. |
Automatic via context: "fresh" - fresh context restores full rule awareness lost to context rot (~33% rule adherence at turn 16 vs ~73% at turn 5). True cross-model independence offers marginal additional benefit. |
| 11 | Expertise cliff - AI fails in mature codebases with implicit conventions, undocumented architecture | Tian Pan Mai 2026; METR 2025 RCT - experienced devs 19% slower with AI | Domain libraries and structured specs help surface some conventions. Execution Critique checks for broken refs and anti-patterns. | Not addressed. This workflow was designed for greenfield or well-documented features. If your codebase has 10 years of undocumented architecture decisions, the AI will violate them. |
| 12 | Plan staleness - plans generated against one snapshot; by execution time, target has changed | Superpowers Issue #989 - parallel sessions cause spec/plan staleness | Git diff check before scope execution detects if target files changed since plan creation. | Staleness detected but not auto-resolved. Only detects file-level changes, not semantic staleness. LLM decides whether staleness matters - no forced re-plan. |
| 13 | Pipeline memory loss - no cross-session memory of own failure patterns | Flamehaven 2026 - cross-session memory, MICA governance schema | Execution Critique saves lessons from each cycle. Setup stage automatically reads past lessons with forced reflection. | Captured and injected, but not verified. Same model that made mistakes reads the lessons. Context rot can still cause mid-session forgetting. Cannot auto-verify lesson adherence. |
| 14 | Code complexity growth - AI-generated code increases complexity over time | Cursor Study (MSR 2026) - static analysis warnings +30%, code complexity +41% after month 2 | Execution Critique includes anti-pattern detection (god functions >100 lines, global mutable state). Optional Code Quality Gate with static analysis. | Caught too late. Complexity analysis happens after code is written. No mechanism to prevent complexity during generation - only flag it after. |
| 15 | Activity ≠ productivity - more PRs, more commits does not mean more value delivered | METR 2025 RCT - 19% slower for experienced devs; Faros AI 2025 - 9% more tasks, 0% DORA improvement | Appetite system anchors scope size to human attention budget. OUT/IN scoping keeps proposals focused. Execution Critique includes "close without follow-up" as valid outcome. | Honest assessment: Appetite system mitigates scope bloat, but requires human to set appetite honestly. appetite_fit is validated by the Plan Critique stage's fresh-context feasibility reviewer (reusing existing 5-reviewer infrastructure). The appetite system is new - its real-world effectiveness is not yet measured. |
| 16 | Coordination overhead — adding agents to shared-state coding tasks degrades quality | CooperBench 2026 — 2-agent cooperation: 25% success vs 50% solo; clawRxiv 2604.00736 — overhead hits 50% of tokens at n=7 agents | Parallelism limited to research/review tasks with fresh context, zero inter-agent communication, and independent file outputs. Code execution defaults to sequential. Parallel scope execution is opt-in: the execution stage checks for overlapping file targets and warns + recommends sequential or git worktree isolation. | Addressed by design — research-only parallelism and file-overlap guard. If users force parallel scope execution on overlapping files despite the warning, these failure modes re-emerge. The file-overlap check is a heuristic (file paths) rather than semantic dependency analysis. |
- Every artifact is a draft. Treat spec-product.md, spec-tech.md, critique reports, and interface proposals as first drafts that need human eyes.
- Results vary by model and codebase. A small model generating a plan for a mature codebase is a recipe for failure - regardless of how structured the workflow is.
- Human review is required. The workflow catches structural gaps (missing scopes, contradictory requirements, some untested edge cases). It does NOT catch logic errors in individual lines, security flaws in business logic, or nuanced architectural trade-offs - those need you.
We don't claim to solve product planning. We claim to structure the thinking so you catch more before you code. The rest is still up to you.
Research sourced May 2026. All references are hyperlinked for verification.
This workflow wasn't designed in a vacuum. It comes from years inside real teams — as a developer, product manager, consultant, and leader across different organizations. The skills, patterns, and disciplines here were tested, broken, and rebuilt in live product environments and real codebases, not conference rooms.
- 🇧🇷 [e-book, Brazilian Portuguese] Inovação baseada em Jobs To Be Done (Innovation based on Jobs To Be Done)
- 🇧🇷 [e-book, Brazilian Portuguese] A Arte da Experimentação: Da Ideia ao Produto (The Art of Experimentation: From Idea to Product - Innovate with a simplified process and AI assistance)
- Former Developer — built products across the full stack before moving into product
- Former Product Manager at tech companies
- Product Consultant helping leaders with strategy and teams with processes
- Creator of Triple Track Agile - adds an opportunity mapping track to product cycles
- Developed Contornos - a social technology for decentralized decisions
| Site | Description |
|---|---|
| timeproduto.com.br | Product process divided into stages, with AI tools and prompts for each stage |
| espacocalionauta.substack.com | Blog exploring AI, organizational culture, daily philosophy, narrative practices, and product thinking - with published prompts and free e-books |
MIT

