feat(psm): Add Production Systems & MLOps module by DoanNgocCuong · Pull Request #2057 · bmad-code-org/BMAD-METHOD

DoanNgocCuong · 2026-03-18T18:36:31Z

Add new PSM module for production operations:

3 agents: SRE (Minh), Security (Hà), MLOps (Linh)
6 workflows: incident-response, production-readiness, security-audit, mlops-deployment, setup-new-service, quick-diagnose
Teams for party mode integration
Registered as community module in installer

What

Why

How

Testing

Add new PSM module for production operations: - 3 agents: SRE (Minh), Security (Hà), MLOps (Linh) - 6 workflows: incident-response, production-readiness, security-audit, mlops-deployment, setup-new-service, quick-diagnose - Teams for party mode integration - Registered as community module in installer Co-Authored-By: Claude Opus <noreply@anthropic.com>

coderabbitai · 2026-03-18T18:38:17Z

📝 Walkthrough

Walkthrough

Introduces a new Production Systems & MLOps (PSM) module containing four agent definitions (MLOps, Security, SRE, Operations Team), five workflow specifications (Incident Response, MLOps Deployment, Production Readiness Review, Quick Diagnose, Security Audit) with supporting templates and documentation, and registers the module in the CLI external modules list.

Changes

Cohort / File(s)	Summary
Module Configuration `src/psm/module.yaml`, `src/psm/config.yaml`	Establishes PSM module metadata, headers, descriptions, and default knowledge base path configuration.
Agent Definitions `src/psm/agents/mlops/mlops.agent.yaml`, `src/psm/agents/security/security.agent.yaml`, `src/psm/agents/sre/sre.agent.yaml`, `src/psm/teams/ops-team.yaml`	Defines four agents (MLOps Engineer, Security Engineer, SRE, Operations Team) with personas, communication styles, principles, and menu-driven workflow triggers.
SRE Documentation `src/psm/agents/sre/sre-sidecar/production-standards.md`	Outlines SRE production standards including critical rules, SLO-first approach, blameless postmortems, and severity classification.
Incident Response Workflow `src/psm/workflows/bmad-psm-incident-response/SKILL.md`, `bmad-skill-manifest.yaml`, `workflow.md`, `incident-postmortem.template.md`	Defines incident triage → diagnose → fix → postmortem flow with comprehensive postmortem template including timeline, root cause analysis, impact assessment, and action items.
MLOps Deployment Workflow `src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md`, `bmad-skill-manifest.yaml`, `workflow.md`	Establishes three-phase deployment workflow: Model Validation, Deploy Model, Monitor, with validation gates and governance artifacts.
Production Readiness Workflow `src/psm/workflows/bmad-psm-production-readiness/SKILL.md`, `bmad-skill-manifest.yaml`, `workflow.md`, `production-readiness.template.md`	Outlines init checklist → deep review → final decision flow with 9-dimension scorecard (Reliability, Observability, Performance, Security, Capacity, Data, Runbooks, Dependencies, Rollback).
Quick Diagnose Workflow `src/psm/workflows/bmad-psm-quick-diagnose/SKILL.md`, `bmad-skill-manifest.yaml`, `workflow.md`	Defines lightweight production issue diagnostic workflow as simplified alternative to full production readiness review.
Security Audit Workflow `src/psm/workflows/bmad-psm-security-audit/SKILL.md`, `bmad-skill-manifest.yaml`, `workflow.md`, `security-audit-report.template.md`	Establishes scope & threat model → security assessment → security report flow with detailed audit report template covering authentication, API security, secrets management, encryption, and data protection.
CLI Module Registration `tools/cli/external-official-modules.yaml`	Registers new bmad-production-systems module as external official module with URL, module definition, and npm package references.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

feat(skills): add type:skill manifest for verbatim directory copying #1851: Adds bmad skill manifests and SKILL.md workflow files that depend on the installer/manifest-generator utilities referenced in this PR's skill artifacts.
feat: added tea module as an external module #1430: Both PRs modify tools/cli/external-official-modules.yaml to register new external module entries.
chore(correct-course): convert workflow.yaml to unified workflow.md #1858: Related through unified workflow.md file patterns and agent workflow reference updates to point to workflow.md definitions.

Suggested reviewers

bmadcode
alexeyv

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a Production Systems & MLOps (PSM) module.
Description check	✅ Passed	The description is related to the changeset, providing details about agents, workflows, teams, and module registration.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (2)

src/psm/workflows/bmad-psm-mlops-deployment/workflow.md (1)
44-48: Note: Roles reference agents outside PSM module.

The "Data Scientist" and "DevOps" roles are mentioned but not defined as agents in the PSM module (which includes Minh/SRE, Hà/Security, Linh/MLOps). This is acceptable if these roles are fulfilled by humans or agents from other modules, but consider clarifying if specific agents should be used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/psm/workflows/bmad-psm-mlops-deployment/workflow.md` around lines 44 -
48, Clarify that the "Data Scientist" and "DevOps" roles in the table are
external/human roles or map them to existing PSM agents (e.g., Minh/SRE,
Hà/Security, Linh/MLOps); update the workflow table or add a short note
immediately after the table stating whether these roles are fulfilled by
external teams/humans or by specific PSM agents and, if mapped, explicitly list
the agent-to-role mapping (for example: "Data Scientist -> Data Lead
(external/human)"; "DevOps -> Platform Eng (Minh/SRE) or external Platform
team") so readers know who is responsible.
src/psm/agents/sre/sre-sidecar/production-standards.md (1)
5-5: Minor: Use hyphen for compound adjective.

"User Specified" should be "User-Specified" when used as a compound adjective modifying "CRITICAL Rules".
✏️ Proposed fix
-## User Specified CRITICAL Rules - Supersedes General Rules
+## User-Specified CRITICAL Rules - Supersedes General Rules
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/psm/agents/sre/sre-sidecar/production-standards.md` at line 5, Update the
heading "User Specified CRITICAL Rules - Supersedes General Rules" to use a
hyphenated compound adjective by changing "User Specified" to "User-Specified"
so the header reads "User-Specified CRITICAL Rules - Supersedes General Rules";
ensure any other occurrences of this exact heading text in the document are
updated for consistency.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/psm/module.yaml`:
- Around line 1-13: The new module definition (code: psm, name: "PSM: Production
Systems & MLOps", header/subheader, default_selected, and knowledge_base_path
keys) lacks matching documentation; add or update docs under docs/ (e.g.,
docs/production-knowledge or a new docs/psm.md) that describe the module
purpose, agent/workflow usage, configuration keys (explain
knowledge_base_path.prompt/default/result and default_selected behavior),
example usage and any CLI/config impacts so the new psm module surface is fully
documented for developers and users.

In
`@src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md`:
- Line 98: The "Data Loss" line contains a malformed nested placeholder `{{None
| {{Description}}}}`; remove the inner braces and use a single placeholder or a
plain choice token instead (e.g. replace with `{{Description}}` if you want a
single variable, or `{{None | Description}}` if you intend a choice between None
and Description). Update the template line that currently reads "**Data Loss**:
{{None | {{Description}}}}" to one of those corrected forms so the placeholder
syntax is valid.

In `@src/psm/workflows/bmad-psm-incident-response/workflow.md`:
- Around line 36-41: The workflow references five missing step files
(step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md,
incident-response-summary.md) used by the workflow defined in workflow.md (the
steps/architecture and execution model sections); add these files into the skill
directory with the expected content for each step (triage, diagnose, fix,
postmortem, and final summary) so the references in workflow.md resolve, or if
any step is not applicable remove its reference from workflow.md to keep the
step list and execution model consistent; ensure the filenames exactly match
step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md
and incident-response-summary.md so Step-01-Triage, Step-02-Diagnose,
Step-03-Fix, Step-04-Postmortem and the incident response summary linkage work
correctly.
- Around line 1-163: Remove the estimated_duration field from the workflow
frontmatter (frontmatter symbol: estimated_duration in workflow.md for
W-INCIDENT-001 "Production Incident Response"); add a new documentation page
under docs/ describing the Incident Response workflow overview and the 4-step
process (triage, diagnose, fix, postmortem), severity scale definitions
(SEV1–SEV4) and escalation procedures, state tracking and incident-session
management (referencing incident_context keys such as incident_id, status,
current_step, timeline), integration with observability/monitoring tools, and
postmortem template usage; finally, update docs/reference/workflow-map.md to
include this PSM module workflow (W-INCIDENT-001 / "Production Incident
Response") and link to the new docs page.

In `@src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md`:
- Around line 1-6: Add the new PSM module and its six workflows to the
docs/reference by updating three files: in docs/reference/modules.md add a PSM
entry for the module name "bmad-psm-mlops-deployment" including the module code
snippet, npm package link, and list the workflows (mlops-deployment,
incident-response, production-readiness, quick-diagnose, security-audit,
setup-new-service); in docs/reference/agents.md add the PSM agents and their
trigger phrases (e.g., "deploy model", "ML deployment", "model serving" for the
mlops-deployment workflow and corresponding triggers for the others); and in
docs/reference/workflow-map.md add entries for each PSM workflow or insert a
clear link to this SKILL.md as the external PSM reference so the workflow map
points to the new module.

In
`@src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md`:
- Around line 1-367: The Production Readiness Review template (template_name:
production-readiness-checklist in production-readiness.template.md) was added
but not listed in the central docs index; update the documentation by adding a
PSM workflows entry to docs/reference/workflow-map.md that references the new
PRR workflow and either create a docs/psm/production-readiness.md documenting
the 9-dimension checklist, template usage, and decision criteria or add a PSM
section in workflow-map.md linking to
src/psm/workflows/production-readiness.template.md; ensure the entry includes
the template name, created_date (2026-03-17), and a short description so the new
workflow appears in the central index.

In `@src/psm/workflows/bmad-psm-production-readiness/workflow.md`:
- Around line 1-92: Create a new docs page describing the PRR workflow
introduced by workflow_id PRR001 (workflow_name: "Production Readiness Review");
include the purpose, execution path (START → Step 01 Init Checklist → Step 02
Deep Review → Step 03 Final Decision), integration with the PSM skill framework,
role responsibilities (Lead: Minh (SRE), Subject Matter: Service Owner, Review
Committee), the 9 evaluation dimensions (Reliability, Observability,
Performance, Security, Capacity, Data, Runbooks, Dependencies, Rollback) with
evaluation criteria, the output format/usage of the Production Readiness
Checklist (template: production-readiness.template.md and outputFile pattern),
and decision types (GO/NO-GO/CONDITIONAL-GO) plus post-workflow actions; place
the doc under docs/ and link to the entry_point steps/step-01-init-checklist.md
so reviewers can navigate from the workflow to the documentation.

In `@src/psm/workflows/bmad-psm-quick-diagnose/workflow.md`:
- Line 5: The workflow's entry_point references missing step files so the
workflow cannot run; create the two step files named steps/step-01-gather.md and
steps/step-02-diagnose.md and implement the corresponding step content described
by the workflow (Gather Context and Diagnose & Fix) so the entry_point
(entry_point: steps/step-01-gather.md) and the 2-step execution path in the
workflow.md are satisfied; ensure the new files exist under a steps/ directory
and include the appropriate metadata and instructions matching the workflow's
described steps so the workflow can execute end-to-end.

In `@src/psm/workflows/bmad-psm-setup-new-service/workflow.md`:
- Around line 37-44: The workflow lists six step files (step-01-architecture.md,
step-02-api-database.md, step-03-build-deploy.md, step-04-reliability.md,
step-05-security-infra.md, step-06-readiness.md) that are not present; either
add those missing markdown files with the corresponding step content (matching
the headings in workflow.md: Step-01-Architecture through Step-06-Readiness)
into the PR, or remove/update the six step references in workflow.md so the
workflow only points to existing step files or a consolidated document; ensure
filenames and the step identifiers in workflow.md (e.g., Step-01-Architecture,
Step-02-API-Database, etc.) match exactly.
- Around line 1-116: Add a new "PSM / Production Systems Management" section to
docs/reference/workflow-map.md (place it under Phase 4: Implementation or as its
own Production Systems section) and list the six workflows from this diff:
bmad-psm-setup-new-service (Setup Production Service for BMAD),
bmad-psm-production-readiness, bmad-psm-incident-response,
bmad-psm-mlops-deployment, bmad-psm-security-audit, and bmad-psm-quick-diagnose;
for each entry include a one-line purpose and primary outputs (e.g.,
architecture docs, CI/CD pipelines, PRR checklist, runbooks, incident runbooks,
MLOps deployment artifacts, security audit reports, quick diagnostic playbooks)
following the same format and tone used by the existing BMM workflow entries so
the new workflows integrate consistently with the workflow-map structure.

In `@tools/cli/external-official-modules.yaml`:
- Around line 45-53: The entry for module 'bmad-production-systems' in
external-official-modules.yaml has a broken/non-public GitHub URL; either remove
the entire 'bmad-production-systems' block or update its 'url' to a correct,
publicly accessible repository and ensure 'module-definition' (src/module.yaml)
and 'npmPackage' (bmad-production-systems) match that repo; after updating,
validate the URL resolves (HTTP 200) and the module-definition path exists so
loading the external module will succeed.

---

Nitpick comments:
In `@src/psm/agents/sre/sre-sidecar/production-standards.md`:
- Line 5: Update the heading "User Specified CRITICAL Rules - Supersedes General
Rules" to use a hyphenated compound adjective by changing "User Specified" to
"User-Specified" so the header reads "User-Specified CRITICAL Rules - Supersedes
General Rules"; ensure any other occurrences of this exact heading text in the
document are updated for consistency.

In `@src/psm/workflows/bmad-psm-mlops-deployment/workflow.md`:
- Around line 44-48: Clarify that the "Data Scientist" and "DevOps" roles in the
table are external/human roles or map them to existing PSM agents (e.g.,
Minh/SRE, Hà/Security, Linh/MLOps); update the workflow table or add a short
note immediately after the table stating whether these roles are fulfilled by
external teams/humans or by specific PSM agents and, if mapped, explicitly list
the agent-to-role mapping (for example: "Data Scientist -> Data Lead
(external/human)"; "DevOps -> Platform Eng (Minh/SRE) or external Platform
team") so readers know who is responsible.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 129ba29a-ddf5-4505-bc97-a8ee768ee290

📥 Commits

Reviewing files that changed from the base of the PR and between be555aa and a2df51c.

⛔ Files ignored due to path filters (2)

src/psm/module-help.csv is excluded by !**/*.csv
src/psm/teams/default-party.csv is excluded by !**/*.csv

📒 Files selected for processing (29)

src/psm/agents/mlops/mlops.agent.yaml
src/psm/agents/security/security.agent.yaml
src/psm/agents/sre/sre-sidecar/production-standards.md
src/psm/agents/sre/sre.agent.yaml
src/psm/config.yaml
src/psm/module.yaml
src/psm/teams/ops-team.yaml
src/psm/workflows/bmad-psm-incident-response/SKILL.md
src/psm/workflows/bmad-psm-incident-response/bmad-skill-manifest.yaml
src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md
src/psm/workflows/bmad-psm-incident-response/workflow.md
src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md
src/psm/workflows/bmad-psm-mlops-deployment/bmad-skill-manifest.yaml
src/psm/workflows/bmad-psm-mlops-deployment/workflow.md
src/psm/workflows/bmad-psm-production-readiness/SKILL.md
src/psm/workflows/bmad-psm-production-readiness/bmad-skill-manifest.yaml
src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md
src/psm/workflows/bmad-psm-production-readiness/workflow.md
src/psm/workflows/bmad-psm-quick-diagnose/SKILL.md
src/psm/workflows/bmad-psm-quick-diagnose/bmad-skill-manifest.yaml
src/psm/workflows/bmad-psm-quick-diagnose/workflow.md
src/psm/workflows/bmad-psm-security-audit/SKILL.md
src/psm/workflows/bmad-psm-security-audit/bmad-skill-manifest.yaml
src/psm/workflows/bmad-psm-security-audit/security-audit-report.template.md
src/psm/workflows/bmad-psm-security-audit/workflow.md
src/psm/workflows/bmad-psm-setup-new-service/SKILL.md
src/psm/workflows/bmad-psm-setup-new-service/bmad-skill-manifest.yaml
src/psm/workflows/bmad-psm-setup-new-service/workflow.md
tools/cli/external-official-modules.yaml

coderabbitai · 2026-03-18T18:53:07Z

src/psm/module.yaml

+code: psm
+name: "PSM: Production Systems & MLOps"
+header: "BMad Production Systems Module"
+subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
+description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
+default_selected: false
+
+knowledge_base_path:
+  prompt:
+    - "Where is your production knowledge base? (folder with SKILL.md files)"
+    - "Leave default if you don't have one yet."
+  default: "docs/production-knowledge"
+  result: "{project-root}/{value}"


⚠️ Potential issue | 🟠 Major

Add corresponding docs/ updates for the new PSM module surface.

This change introduces new module capabilities/configuration, but no matching docs update is included here. Please add or update the relevant docs pages for module purpose, agent/workflow usage, and configuration keys.

As per coding guidelines: for src/**, “new features, changed behavior, renamed concepts, altered CLI flags, or modified configuration options should all be reflected in the relevant doc pages.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/module.yaml` around lines 1 - 13, The new module definition (code: psm, name: "PSM: Production Systems & MLOps", header/subheader, default_selected, and knowledge_base_path keys) lacks matching documentation; add or update docs under docs/ (e.g., docs/production-knowledge or a new docs/psm.md) that describe the module purpose, agent/workflow usage, configuration keys (explain knowledge_base_path.prompt/default/result and default_selected behavior), example usage and any CLI/config impacts so the new psm module surface is fully documented for developers and users.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md

+- **System Recovery**: {{SERVICE/METRIC}} took {{TIME}} to recover
+- **Cascading Effects**: {{SERVICE_X}} also affected due to {{reason}}
+- **On-call Load**: {{NUMBER}} pages, {{NUMBER}} escalations
+- **Data Loss**: {{None | {{Description}}}}


⚠️ Potential issue | 🟡 Minor

Fix malformed placeholder syntax in “Data Loss”.

The nested token {{None | {{Description}}}} is ambiguous. Use a single placeholder token or a plain-choice placeholder format.

Proposed fix

-- **Data Loss**: {{None | {{Description}}}} +- **Data Loss**: {{NONE_OR_DESCRIPTION}}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- **Data Loss**: {{None | {{Description}}}}

- **Data Loss**: {{NONE_OR_DESCRIPTION}}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md` at line 98, The "Data Loss" line contains a malformed nested placeholder `{{None | {{Description}}}}`; remove the inner braces and use a single placeholder or a plain choice token instead (e.g. replace with `{{Description}}` if you want a single variable, or `{{None | Description}}` if you intend a choice between None and Description). Update the template line that currently reads "**Data Loss**: {{None | {{Description}}}}" to one of those corrected forms so the placeholder syntax is valid.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-incident-response/workflow.md

+---
+workflow_id: W-INCIDENT-001
+workflow_name: Production Incident Response
+version: 6.2.0
+lead_agent: "SRE Minh"
+supporting_agents: ["Architect Khang", "Mary Analyst"]
+phase: "3-Run: Emergency Response & Recovery"
+created_date: 2026-03-17
+last_modified: 2026-03-17
+config_file: "_config/config.yaml"
+estimated_duration: "15 minutes to 2 hours (depending on severity)"
+outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md'
+---
+
+# Production Incident Response Workflow — BMAD Pattern
+
+## Metadata & Context
+
+**Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.
+
+**Lead Team**:
+- SRE Minh (Incident Command, Recovery Orchestration)
+- Architect Khang (Root Cause Analysis, System-wide Impact)
+- Mary Analyst (Impact Assessment, Post-Incident Review)
+
+**Success Criteria**:
+- ✓ Incident severity classified within 5 minutes
+- ✓ Root cause identified within first triage pass
+- ✓ Fix applied and verified
+- ✓ System metrics returned to baseline
+- ✓ Incident postmortem documented with action items
+- ✓ Prevention measures identified
+
+## Workflow Overview
+
+Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:
+
+1. **Step-01-Triage** → Gather initial info, assess severity, classify impact
+2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces)
+3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery
+4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence
+
+## Configuration Loading
+
+Tự động load từ `_config/config.yaml`:
+
+```yaml
+project_context:
+  organization: "[loaded from config]"
+  environment: "production"
+  incident_channel: "slack:#incidents"
+
+workflow_defaults:
+  communication_language: "Vietnamese-English"
+  severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
+  escalation_contacts: "[loaded from config]"
+  on_call_engineer: "[loaded from config]"
+```
+
+## Workflow Architecture - Micro-File Design
+
+BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:
+
+```
+workflow.md (entry point)
+    ↓
+step-01-triage.md (classify severity, initial assessment)
+    ↓
+step-02-diagnose.md (root cause analysis)
+    ↓
+step-03-fix.md (apply fix, verify)
+    ↓
+step-04-postmortem.md (document, prevent)
+    ↓
+incident-response-summary.md (final output)
+```
+
+**Key Benefits**:
+- Single-step focus — engineer concentrates on one phase
+- Knowledge isolation — load only relevant SKILL docs per step
+- State tracking — save progress after each step
+- Easy resumption — if interrupted, restart from exact step
+
+## Skill References
+
+Workflow này load knowledge từ:
+
+- **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management
+- **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing
+- **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns
+- **5.10 Production Readiness** → Incident prevention checklist, alerting setup
+- **5.14 Documentation & Runbooks** → Postmortem templates, incident reports
+
+## Execution Model
+
+### Entry Point Logic
+
+```
+1. Check if incident session exists
+   → If NEW incident: Start from step-01-triage.md
+   → If ONGOING: Load incident-session.yaml → continue from last completed step
+   → If RESOLVED: Load postmortem template
+
+2. For each step:
+   a) Load step-{N}-{name}.md
+   b) Load referenced SKILL files (auto-parse "Load:" directives)
+   c) Execute MENU [A][C] options
+   d) Save step output to step-{N}-output.md + incident-context.yaml
+   e) Move to next step or conclude
+
+3. Final: Generate incident report + postmortem in outputs folder
+```
+
+### State Tracking
+
+Incident session frontmatter tracks progress:
+
+```yaml
+incident_context:
+  incident_id: "INC-2026-03-17-001"
+  severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
+  status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
+  affected_services: ["service-1", "service-2"]
+  started_at: "2026-03-17T14:30:00Z"
+  timeline:
+    detected_at: "2026-03-17T14:30:00Z"
+    triage_completed_at: "2026-03-17T14:35:00Z"
+    root_cause_identified_at: "2026-03-17T14:50:00Z"
+    fix_applied_at: "2026-03-17T15:10:00Z"
+    resolved_at: "2026-03-17T15:15:00Z"
+  current_step: "step-02-diagnose"
+  last_updated: "2026-03-17T14:50:00Z"
+  incident_commander: "SRE Minh"
+```
+
+## Mandatory Workflow Rules
+
+1. **Speed first** — Triage must complete in < 5 minutes
+2. **Root cause identification** — Must identify root cause before fix attempt
+3. **Verify before declaring resolved** — Check metrics + user reports
+4. **Document everything** — Every action logged for postmortem
+5. **Escalation protocol** — SEV1 → Page on-call architect immediately
+6. **Communication** — Update stakeholders every 5-10 minutes
+7. **No flying blind** — All fixes must reference observability data
+
+## Severity Scale
+
+- **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call
+- **SEV2** — Major degradation, significant users affected, partial functionality down
+- **SEV3** — Moderate impact, some users affected, workaround possible
+- **SEV4** — Minor issue, limited users, can defer to business hours
+
+## Navigation
+
+Hãy chọn cách bắt đầu:
+
+- **[NEW-INC]** — Report new incident → Load step-01-triage
+- **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml)
+- **[ESCALATE]** — Escalate to on-call architect
+
+---
+
+**Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage**


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if incident-response documentation exists under docs/ # Search for incident-response or related documentation fd -e md . docs/ -x grep -l "incident.response\|incident-response\|Incident Response" {} \; # Check for severity scale documentation fd -e md . docs/ -x grep -l "SEV1\|severity scale\|incident severity" {} \;

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

#!/bin/bash # Check if docs/ directory exists and examine its structure echo "=== Checking docs/ directory structure ===" if [ -d "docs/" ]; then echo "docs/ exists. Top-level contents:" ls -la docs/ | head -20 echo "" echo "=== All .md files in docs/ ===" find docs/ -name "*.md" -type f | head -30 else echo "docs/ directory does not exist" fi echo "" echo "=== Source workflow file location ===" ls -la src/psm/workflows/bmad-psm-incident-response/ 2>/dev/null || echo "Directory not found"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 2204

🏁 Script executed:

#!/bin/bash # Check workflow-map documentation and PSM references echo "=== Contents of docs/reference/workflow-map.md ===" cat docs/reference/workflow-map.md echo "" echo "=== Check for any PSM references in docs/ ===" grep -r "PSM\|psm\|incident.*response" docs/ 2>/dev/null | head -20 echo "" echo "=== Git info about the workflow file ===" git log --oneline -n 5 -- src/psm/workflows/bmad-psm-incident-response/workflow.md 2>/dev/null || echo "No git history or file not tracked"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 6760

🏁 Script executed:

#!/bin/bash # Check for skill-validator rules and PSM module documentation echo "=== Looking for skill-validator or related validation rules ===" find tools/ -name "*skill*validator*" -o -name "*skill*" 2>/dev/null | head -20 echo "" echo "=== Check PSM module structure ===" find src/psm/ -type f -name "*.md" | head -20 echo "" echo "=== Check if PSM is referenced in any docs ===" grep -r "psm\|PSM" src/ --include="*.md" | grep -i "module\|description\|purpose" | head -10 echo "" echo "=== Check workflow metadata for creation/modification info ===" head -30 src/psm/workflows/bmad-psm-incident-response/workflow.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 2501

🏁 Script executed:

#!/bin/bash # Read skill-validator.md to understand rules for legacy workflows echo "=== Contents of tools/skill-validator.md ===" cat tools/skill-validator.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 21986

Documentation for incident-response workflow is missing and required per coding guidelines.

This PR introduces a new Production Incident Response workflow for the PSM module. Per the coding guideline "Source file changed. Check whether documentation under docs/ needs a corresponding update", documentation must be added to document:

Incident Response workflow overview and the 4-step process (triage, diagnose, fix, postmortem)

Severity scale (SEV1-SEV4) definitions and escalation procedures

State tracking and incident session management

Integration with observability and monitoring tools

Postmortem template usage

Additionally, docs/reference/workflow-map.md should be updated to include the PSM module workflows.

Minor issue: Remove the estimated_duration field from the workflow.md frontmatter — per skill-validator rule SEQ-02, workflow files must not include time estimates.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-incident-response/workflow.md` around lines 1 - 163, Remove the estimated_duration field from the workflow frontmatter (frontmatter symbol: estimated_duration in workflow.md for W-INCIDENT-001 "Production Incident Response"); add a new documentation page under docs/ describing the Incident Response workflow overview and the 4-step process (triage, diagnose, fix, postmortem), severity scale definitions (SEV1–SEV4) and escalation procedures, state tracking and incident-session management (referencing incident_context keys such as incident_id, status, current_step, timeline), integration with observability/monitoring tools, and postmortem template usage; finally, update docs/reference/workflow-map.md to include this PSM module workflow (W-INCIDENT-001 / "Production Incident Response") and link to the new docs page.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-incident-response/workflow.md

+Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:
+
+1. **Step-01-Triage** → Gather initial info, assess severity, classify impact
+2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces)
+3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery
+4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify that all referenced step files exist for incident-response workflow workflow_dir="src/psm/workflows/bmad-psm-incident-response" echo "Checking for step files in ${workflow_dir}..." # Check for steps directory if [ -d "${workflow_dir}/steps" ]; then echo "✓ steps/ directory exists" ls -la "${workflow_dir}/steps/" else echo "✗ steps/ directory not found" fi # Check for specific step files for step in step-01-triage step-02-diagnose step-03-fix step-04-postmortem; do if [ -f "${workflow_dir}/steps/${step}.md" ]; then echo "✓ ${step}.md exists" else echo "✗ ${step}.md NOT FOUND" fi done # Check for SKILL.md and manifest if [ -f "${workflow_dir}/SKILL.md" ]; then echo "✓ SKILL.md exists" else echo "✗ SKILL.md NOT FOUND" fi if [ -f "${workflow_dir}/bmad-skill-manifest.yaml" ]; then echo "✓ bmad-skill-manifest.yaml exists" else echo "✗ bmad-skill-manifest.yaml NOT FOUND" fi # Check for incident-postmortem template if [ -f "${workflow_dir}/incident-postmortem.template.md" ]; then echo "✓ incident-postmortem.template.md exists" else echo "✗ incident-postmortem.template.md NOT FOUND" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 388

🏁 Script executed:

#!/bin/bash # Get the full directory structure of the skill workflow_dir="src/psm/workflows/bmad-psm-incident-response" echo "=== Full directory structure of ${workflow_dir} ===" find "${workflow_dir}" -type f -name "*.md" -o -name "*.yaml" | sort echo "" echo "=== Directory tree ===" tree "${workflow_dir}" 2>/dev/null || find "${workflow_dir}" -type f | sed 's|[^/]*/| |g' | sort echo "" echo "=== Check SKILL.md content ===" if [ -f "${workflow_dir}/SKILL.md" ]; then head -30 "${workflow_dir}/SKILL.md" fi echo "" echo "=== Check manifest content ===" if [ -f "${workflow_dir}/bmad-skill-manifest.yaml" ]; then cat "${workflow_dir}/bmad-skill-manifest.yaml" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 947

🏁 Script executed:

#!/bin/bash # Read the full workflow.md to understand step references cat -n "src/psm/workflows/bmad-psm-incident-response/workflow.md"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 6882

🏁 Script executed:

#!/bin/bash # Check if incident-response-summary.md is also missing workflow_dir="src/psm/workflows/bmad-psm-incident-response" echo "=== Checking for incident-response-summary.md ===" if [ -f "${workflow_dir}/incident-response-summary.md" ]; then echo "✓ incident-response-summary.md exists" else echo "✗ incident-response-summary.md NOT FOUND" fi echo "" echo "=== Checking PR/git status for context ===" # Show recent git log for this directory (if available) git log --oneline -5 -- "${workflow_dir}" 2>/dev/null || echo "No git history available" echo "" echo "=== All referenced but missing files ===" echo "From workflow.md architecture (lines 64-76):" echo " ✗ step-01-triage.md" echo " ✗ step-02-diagnose.md" echo " ✗ step-03-fix.md" echo " ✗ step-04-postmortem.md" echo " ✗ incident-response-summary.md"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 453

Add missing step files required by the workflow.

The workflow references 5 step files that do not exist in the skill directory:

step-01-triage.md (referenced in lines 38, 67, 100)

step-02-diagnose.md (referenced in lines 39, 69, 131)

step-03-fix.md (referenced in lines 40, 71)

step-04-postmortem.md (referenced in lines 41, 73)

incident-response-summary.md (referenced in line 75)

The workflow architecture (lines 64–76) and execution model (lines 99–111) explicitly require these step files to be present. Without them, the skill cannot function. Include these files in the PR or remove the references if they are not applicable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-incident-response/workflow.md` around lines 36 - 41, The workflow references five missing step files (step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md, incident-response-summary.md) used by the workflow defined in workflow.md (the steps/architecture and execution model sections); add these files into the skill directory with the expected content for each step (triage, diagnose, fix, postmortem, and final summary) so the references in workflow.md resolve, or if any step is not applicable remove its reference from workflow.md to keep the step list and execution model consistent; ensure the filenames exactly match step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md and incident-response-summary.md so Step-01-Triage, Step-02-Diagnose, Step-03-Fix, Step-04-Postmortem and the incident response summary linkage work correctly.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md

+---
+name: bmad-psm-mlops-deployment
+description: 'Deploy ML model to production with validation and monitoring. Use when the user says "deploy model" or "ML deployment" or "model serving"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify whether docs mention the new PSM module/workflows/triggers. if [ -d docs ]; then rg -n --ignore-case \ 'psm|production systems|mlops|incident response|production readiness|security audit|setup new service|quick diagnose|bmad-psm-' \ docs else echo "docs/ directory not found" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# Check if docs/ directory exists and list its structure if [ -d docs ]; then echo "=== docs/ directory exists ===" find docs -type f -name "*.md" | head -20 echo "" echo "=== docs/ top-level structure ===" ls -la docs/ else echo "docs/ directory does not exist" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 1479

🏁 Script executed:

# Check the PSM workflow directory structure if [ -d src/psm ]; then echo "=== src/psm/ directory exists ===" find src/psm -type f | sort else echo "src/psm/ directory does not exist" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 1705

🏁 Script executed:

# Check if there are any workflow indexes or catalog files that list workflows find . -type f -name "*.md" -o -name "*.yaml" -o -name "*.yml" | xargs grep -l "workflow" | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 348

🏁 Script executed:

# Look for any README or documentation in src/psm area find src/psm -type f -name "README*" -o -name "INDEX*" -o -name "*index*"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# Check the main docs index cat docs/index.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 3762

🏁 Script executed:

# Check reference directory which might list workflows ls -la docs/reference/ cat docs/reference/*.md | head -100

Repository: bmad-code-org/BMAD-METHOD

Length of output: 6925

🏁 Script executed:

# Look for any mentions of workflows or modules in docs grep -r "workflow" docs/ --include="*.md" | grep -i "list\|index\|available\|commands"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 5438

🏁 Script executed:

# Check if there are any how-to guides that might reference PSM or explain module discovery cat docs/how-to/get-answers-about-bmad.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 4587

🏁 Script executed:

# Check modules.md which should list all available modules cat docs/reference/modules.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 3867

🏁 Script executed:

# Check workflow-map.md which should list all workflows cat docs/reference/workflow-map.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 6581

Add PSM module and workflows to docs/reference/.

The new PSM module with six workflows (mlops-deployment, incident-response, production-readiness, quick-diagnose, security-audit, setup-new-service) is not documented in the reference section. Update:

docs/reference/modules.md — add PSM entry with module code, npm link, and workflow list

docs/reference/agents.md — add PSM agents and their triggers

docs/reference/workflow-map.md — add PSM workflows or link to external PSM reference

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md` around lines 1 - 6, Add the new PSM module and its six workflows to the docs/reference by updating three files: in docs/reference/modules.md add a PSM entry for the module name "bmad-psm-mlops-deployment" including the module code snippet, npm package link, and list the workflows (mlops-deployment, incident-response, production-readiness, quick-diagnose, security-audit, setup-new-service); in docs/reference/agents.md add the PSM agents and their trigger phrases (e.g., "deploy model", "ML deployment", "model serving" for the mlops-deployment workflow and corresponding triggers for the others); and in docs/reference/workflow-map.md add entries for each PSM workflow or insert a clear link to this SKILL.md as the external PSM reference so the workflow map points to the new module.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-production-readiness/workflow.md

+---
+workflow_id: PRR001
+workflow_name: Production Readiness Review
+description: Validate service is ready for production using comprehensive readiness checklist
+entry_point: steps/step-01-init-checklist.md
+phase: 3-run
+lead_agent: "Minh (SRE)"
+status: "active"
+created_date: 2026-03-17
+version: "1.0.0"
+estimated_duration: "2-3 hours"
+outputFile: '{output_folder}/psm-artifacts/prr-{{project_name}}-{{date}}.md'
+---
+
+# Workflow: Production Readiness Review (PRR)
+
+## Goal
+Validate and certify that a service meets production readiness standards across 9 key dimensions before deployment.
+
+## Overview
+
+This workflow systematically evaluates a service against production readiness criteria defined in the Production Systems BMAD skill framework. Using SRE expertise and architectural patterns, the workflow:
+
+1. **Initializes** the PRR process with service context and dimensional overview
+2. **Deep reviews** each dimension (reliability, observability, performance, security, capacity, data, runbooks, dependencies, rollback)
+3. **Renders final decision** with GO/NO-GO/CONDITIONAL-GO recommendation
+
+## Execution Path
+
+```
+START
+  ↓
+[Step 01] Init Checklist (Load framework, gather service context, present dimensions)
+  ↓
+[Step 02] Deep Review (Score each dimension, identify blockers, recommendations)
+  ↓
+[Step 03] Final Decision (Scorecard, decision, action items, DONE)
+  ↓
+END
+```
+
+## Key Roles
+
+| Role | Agent | Responsibility |
+|------|-------|-----------------|
+| Lead | Minh (SRE) | Navigate workflow, coordinate review, make final call |
+| Subject Matter | Service Owner | Provide service context, clarify architecture |
+| Review Committee | Arch, SecOps, MLOps | Contribute expertise on specific dimensions |
+
+## Dimensions Evaluated (9)
+
+1. **Reliability** — SLA/SLO definition, error budgets, failure modes, incident response
+2. **Observability** — Logging, metrics, tracing, dashboards, alerting
+3. **Performance** — Latency targets, throughput, P99 tail behavior, optimization opportunities
+4. **Security** — Auth/authz, secrets management, encryption, audit logging, compliance
+5. **Capacity** — Resource limits, scaling policies, burst capacity, cost projections
+6. **Data** — Schema versioning, backup/restore, data governance, retention policies
+7. **Runbooks** — Incident runbooks, operational playbooks, troubleshooting guides
+8. **Dependencies** — External services, internal libraries, database versioning, API contracts
+9. **Rollback** — Rollback strategy, canary deployment, feature flags, smoke tests
+
+## Input Requirements
+
+- **Service name and owner** — Which service are we evaluating?
+- **Current architecture** — High-level design, tech stack, topology
+- **Existing metrics/dashboards** — Links to monitoring, SLO definitions
+- **Known gaps/risks** — Already identified issues to address
+
+## Output Deliverable
+
+- **Production Readiness Checklist** (template: `production-readiness.template.md`)
+  - Scorecard with 9 dimensions (red/yellow/green)
+  - Blockers and recommendations per dimension
+  - Final GO/NO-GO/CONDITIONAL-GO decision
+  - Explicit action items with owners and deadlines
+
+## Success Criteria
+
+1. All 9 dimensions evaluated with clear rationale
+2. Blockers categorized as P0 (must fix) or P1 (should fix)
+3. Team alignment on decision (documented in PRR report)
+4. Action plan with clear accountability and timeline
+
+## Next Steps After Workflow
+
+- If **GO**: Proceed to deployment; document in CHANGELOG
+- If **NO-GO**: Reschedule PRR once blockers addressed; track in backlog
+- If **CONDITIONAL-GO**: Deploy with documented caveats; setup monitoring for risk areas
+
+---
+
+**Navigation**: [← Back to 3-run](../), [Next: Step 01 →](steps/step-01-init-checklist.md)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if PRR workflow execution documentation exists under docs/ # Search for PRR workflow execution documentation fd -e md . docs/ -x grep -l "Production Readiness Review.*workflow\|PRR.*execution\|PRR.*step" {} \; # Check if workflow navigation is documented fd -e md . docs/ -x grep -l "workflow.*navigation\|step.*execution.*path" {} \;

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

Add documentation for PRR workflow execution and framework under docs/.

The Production Readiness Review workflow introduces new execution logic, role assignments, and a 9-dimension evaluation framework. Corresponding documentation should be added to docs/ explaining:

PRR workflow purpose, execution path, and integration with the PSM skill framework

The 9 dimensions (Reliability, Observability, Performance, Security, Capacity, Data, Runbooks, Dependencies, Rollback) and evaluation criteria

Role responsibilities (Lead/SRE, Subject Matter, Review Committee)

Output format and usage of the Production Readiness Checklist

Decision types (GO/NO-GO/CONDITIONAL-GO) and post-workflow action paths

Per coding guidelines: source file changes require corresponding documentation updates in docs/ for new features and frameworks.

🧰 Tools

🪛 LanguageTool

[style] ~25-~25: Consider a different adjective to strengthen your wording.
Context: ...e context and dimensional overview 2. Deep reviews each dimension (reliability, ...

(DEEP_PROFOUND)

[style] ~47-~47: This phrase is redundant. Consider using “Subject” to avoid wordiness.
Context: ... coordinate review, make final call | | Subject Matter | Service Owner | Provide service conte...

(SUBJECT_MATTER)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-production-readiness/workflow.md` around lines 1 - 92, Create a new docs page describing the PRR workflow introduced by workflow_id PRR001 (workflow_name: "Production Readiness Review"); include the purpose, execution path (START → Step 01 Init Checklist → Step 02 Deep Review → Step 03 Final Decision), integration with the PSM skill framework, role responsibilities (Lead: Minh (SRE), Subject Matter: Service Owner, Review Committee), the 9 evaluation dimensions (Reliability, Observability, Performance, Security, Capacity, Data, Runbooks, Dependencies, Rollback) with evaluation criteria, the output format/usage of the Production Readiness Checklist (template: production-readiness.template.md and outputFile pattern), and decision types (GO/NO-GO/CONDITIONAL-GO) plus post-workflow actions; place the doc under docs/ and link to the entry_point steps/step-01-init-checklist.md so reviewers can navigate from the workflow to the documentation.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-quick-diagnose/workflow.md

+workflow_id: QD001
+workflow_name: Quick Diagnose
+description: Fast diagnosis of production issue with root cause and fix suggestion
+entry_point: steps/step-01-gather.md


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify step files exist for quick-diagnose workflow WORKFLOW_DIR="src/psm/workflows/bmad-psm-quick-diagnose" echo "=== Checking step files ===" if [ -d "$WORKFLOW_DIR/steps" ]; then ls -la "$WORKFLOW_DIR/steps/" else echo "ERROR: steps/ directory does not exist" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 135

🏁 Script executed:

cat -n src/psm/workflows/bmad-psm-quick-diagnose/workflow.md | head -100

Repository: bmad-code-org/BMAD-METHOD

Length of output: 2732

🏁 Script executed:

cd src/psm/workflows/bmad-psm-quick-diagnose && find . -type f && ls -la ../

Repository: bmad-code-org/BMAD-METHOD

Length of output: 675

The workflow cannot execute: required step files are missing.

The entry_point at line 5 references steps/step-01-gather.md, but the steps/ directory does not exist. The workflow describes a 2-step execution path (Gather Context → Diagnose & Fix) at lines 30-37, yet neither step file is present. Create the missing step files (steps/step-01-gather.md and steps/step-02-diagnose.md) for the workflow to be functional.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-quick-diagnose/workflow.md` at line 5, The workflow's entry_point references missing step files so the workflow cannot run; create the two step files named steps/step-01-gather.md and steps/step-02-diagnose.md and implement the corresponding step content described by the workflow (Gather Context and Diagnose & Fix) so the entry_point (entry_point: steps/step-01-gather.md) and the 2-step execution path in the workflow.md are satisfied; ensure the new files exist under a steps/ directory and include the appropriate metadata and instructions matching the workflow's described steps so the workflow can execute end-to-end.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-setup-new-service/workflow.md

+---
+workflow_id: W-SETUP-SVC-001
+workflow_name: Setup Production Service for BMAD
+version: 6.2.0
+lead_agent: "Architect Khang"
+supporting_agents: ["SRE Minh", "Mary Analyst"]
+phase: "1-Analysis → 2-Planning → 3-Solutioning → 4-Implementation"
+created_date: 2026-03-17
+last_modified: 2026-03-17
+config_file: "_config/config.yaml"
+estimated_duration: "12-20 hours"
+outputFile: '{output_folder}/psm-artifacts/service-setup-{{project_name}}-{{date}}.md'
+---
+
+# Setup Production Service Workflow — BMAD Pattern
+
+## Metadata & Context
+
+**Goal**: Xây dựng production-grade service từ scratch, với đầy đủ architecture, API design, deployment pipeline, reliability patterns, security, và production readiness.
+
+**Lead Team**:
+- SRE Minh (Reliability, Infrastructure, Operations)
+- Architect Khang (System Design, Technology Selection)
+- Mary Analyst (Requirements, Risk Assessment)
+
+**Success Criteria**:
+- ✓ Architecture design document approved
+- ✓ API contracts defined & validated
+- ✓ Database schema designed & indexed
+- ✓ CI/CD pipeline operational
+- ✓ Resilience & observability in place
+- ✓ Security & compliance verified
+- ✓ Production readiness checklist passed
+
+## Workflow Overview
+
+Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng:
+
+1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection
+2. **Step-02-API-Database** → API Design + Database Selection + Schema
+3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy
+4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling
+5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config
+6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision
+
+## Configuration Loading
+
+Tự động load từ `_config/config.yaml`:
+
+```yaml
+project_context:
+  user_name: "[loaded from config]"
+  organization: "[loaded from config]"
+  environment: "production"
+
+workflow_defaults:
+  communication_language: "Vietnamese"
+  output_folder: "./outputs/setup-new-service-{service_name}"
+  timestamp: "2026-03-17"
+```
+
+## Execution Model
+
+### Entry Point Logic
+
+```
+1. Check if workflow.md exists in outputs folder
+   → If NEW: Start from step-01-architecture.md
+   → If RESUME: Load progress.yaml → auto-skip completed steps
+   → If PARTIAL: Load step-N-context.yaml → resume from step N
+
+2. For each step:
+   a) Load step-{N}-{name}.md
+   b) Load referenced SKILL files (auto-parse "Load:" directives)
+   c) Execute MENU [A][C] options
+   d) Save step output to step-{N}-output.md
+   e) Move to next step
+
+3. Final: Generate comprehensive outputs in outputs folder
+```
+
+### State Tracking
+
+Output document frontmatter tracks progress:
+
+```yaml
+workflow_progress:
+  step_01_architecture: "completed"
+  step_02_api_database: "completed"
+  step_03_build_deploy: "in_progress"
+  step_04_reliability: "pending"
+  step_05_security_infra: "pending"
+  step_06_readiness: "pending"
+  last_updated: "2026-03-17T14:30:00Z"
+  current_agent: "Architect Khang"
+```
+
+## Mandatory Workflow Rules
+
+1. **No skipping steps** — Mỗi step phải được execute theo order
+2. **Validate assumptions** — Mỗi decision phải được document
+3. **Cross-phase collaboration** — Architects + SRE + Analysts work together
+4. **Output artifacts** — Mỗi step produce tangible output documents
+5. **Handoff protocol** — Context được transfer giữa steps rõ ràng
+
+## Navigation
+
+Hãy chọn cách bắt đầu:
+
+- **[NEW]** — Bắt đầu workflow mới → Load step-01
+- **[RESUME]** — Quay lại workflow đã từng chạy (detect progress)
+- **[SKIP-TO]** — Nhảy tới step cụ thể (dev-only, requires confirmation)
+
+---
+
+**Tiếp tục bằng cách chọn [NEW] hoặc [RESUME]**


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if setup-new-service documentation exists under docs/ # Search for setup-new-service or related documentation fd -e md . docs/ -x grep -l "setup.new.service\|setup-new-service\|Setup Production Service" {} \; # Check if PSM workflows are documented fd -e md . docs/ -x grep -l "PSM workflow\|Production Systems.*workflow" {} \;

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# Check repository structure git ls-files | head -20 fd -type d -name docs -o -name documentation | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 842

🏁 Script executed:

# Check if docs/ exists and its contents ls -la docs/ 2>/dev/null | head -20 || echo "docs/ not found at root" find . -maxdepth 2 -type d -name docs -o -name doc 2>/dev/null

Repository: bmad-code-org/BMAD-METHOD

Length of output: 752

🏁 Script executed:

# Check if this workflow is a pre-existing file being converted or newly added git log --oneline --all -- "src/psm/workflows/bmad-psm-setup-new-service/workflow.md" 2>/dev/null | head -5 || echo "File history not available"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 107

🏁 Script executed:

# Check for other workflow files and their documentation patterns fd -path "*/workflows/*" -name "workflow.md" | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 237

🏁 Script executed:

# Search for any documentation of PSM or setup workflows in the repo rg -l "psm|Production.*Service|setup.*service" --type md | grep -E "(docs/|README)" | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# Check docs subdirectories for workflow-related content ls -la docs/explanation/ docs/how-to/ docs/reference/ 2>/dev/null | head -30

Repository: bmad-code-org/BMAD-METHOD

Length of output: 1622

🏁 Script executed:

# Search for any workflow documentation pattern rg -i "workflow" docs/ --type md 2>/dev/null | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 1359

🏁 Script executed:

# Check if this is a new file or pre-existing git show HEAD:src/psm/workflows/bmad-psm-setup-new-service/workflow.md 2>/dev/null | head -5 || echo "File does not exist in HEAD"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 195

🏁 Script executed:

# Check for other PSM-related files and documentation fd "psm" src/ --type f -name "*.md" | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 299

🏁 Script executed:

# Check if there's a skill-validator.md referenced in guidelines cat tools/skill-validator.md 2>/dev/null | head -50 || echo "skill-validator.md not found"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 2176

🏁 Script executed:

# Check if workflow-map.md exists ls -la docs/reference/workflow-map.md 2>/dev/null || echo "workflow-map.md not found"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 148

🏁 Script executed:

# Check docs/reference directory contents ls -la docs/reference/

Repository: bmad-code-org/BMAD-METHOD

Length of output: 556

🏁 Script executed:

# Search for any pre-existing XML workflow versions fd "\.xml" src/psm/

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# Check git log with different approach to see file creation/modification git log --all --pretty=format:"%h %s" -- "src/psm/workflows/bmad-psm-setup-new-service/" 2>/dev/null | head -20 || echo "No git history"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 107

🏁 Script executed:

# Check if this is part of a broader conversion by looking at git diff or recent changes git diff --name-status HEAD~1 HEAD 2>/dev/null | grep psm || echo "No recent PSM changes"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 90

🏁 Script executed:

# Check what's in docs/reference for workflow references cat docs/reference/workflow-map.md 2>/dev/null | head -50

Repository: bmad-code-org/BMAD-METHOD

Length of output: 3725

🏁 Script executed:

# Check if PSM workflows are documented in workflow-map.md grep -i "psm\|setup.*service" docs/reference/workflow-map.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# Check the full workflow-map to see all documented workflows cat docs/reference/workflow-map.md

Repository: bmad-code-org/BMAD-METHOD

Length of output: 6581

🏁 Script executed:

# Check if this is a new workflow or if it's pre-existing git log --all --oneline -- "src/psm/workflows/bmad-psm-setup-new-service/" 2>/dev/null | wc -l

Repository: bmad-code-org/BMAD-METHOD

Length of output: 70

🏁 Script executed:

# Check what PSM workflows exist in src/psm/workflows/ fd "workflow.md" src/psm/workflows/

Repository: bmad-code-org/BMAD-METHOD

Length of output: 406

🏁 Script executed:

# Check if other PSM workflows are also undocumented (to see if this is systemic) for file in src/psm/workflows/*/workflow.md; do workflow_name=$(basename $(dirname "$file")) grep -i "$workflow_name\|psm" docs/reference/workflow-map.md 2>/dev/null && echo "FOUND: $workflow_name" || echo "MISSING: $workflow_name" done

Repository: bmad-code-org/BMAD-METHOD

Length of output: 280

🏁 Script executed:

# Check git history for workflow-map.md to see if PSM was ever documented git log --oneline -- docs/reference/workflow-map.md 2>/dev/null | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 107

🏁 Script executed:

# Check if there's a PSM section planned or mentioned in docs/ grep -r "PSM\|Production Systems" docs/ 2>/dev/null || echo "PSM not mentioned in docs"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 94

🏁 Script executed:

# Check if other modules (like BMM) have their own documentation ls -la src/bmm/workflows/ 2>/dev/null | head -10

Repository: bmad-code-org/BMAD-METHOD

Length of output: 51

🏁 Script executed:

# See when the PSM module was added to the repo git log --all --name-status --pretty=format:"%h %s" -- "src/psm/" 2>/dev/null | head -20 || echo "No PSM history"

Repository: bmad-code-org/BMAD-METHOD

Length of output: 1087

Update documentation in docs/reference/workflow-map.md to include the PSM (Production Systems Management) module and its six workflows.

The coding guidelines require that new features be reflected in corresponding documentation. The PSM module was newly added but completely lacks documentation in docs/reference/workflow-map.md. All six PSM workflows—including bmad-psm-setup-new-service—are missing from the workflow reference.

Add a new section to the workflow map (likely Phase 4: Implementation or a dedicated Production Systems section) with entries for:

bmad-psm-setup-new-service (Setup Production Service for BMAD)

bmad-psm-production-readiness

bmad-psm-incident-response

bmad-psm-mlops-deployment

bmad-psm-security-audit

bmad-psm-quick-diagnose

Include their purposes and primary outputs to match the documentation pattern used for BMM workflows.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-setup-new-service/workflow.md` around lines 1 - 116, Add a new "PSM / Production Systems Management" section to docs/reference/workflow-map.md (place it under Phase 4: Implementation or as its own Production Systems section) and list the six workflows from this diff: bmad-psm-setup-new-service (Setup Production Service for BMAD), bmad-psm-production-readiness, bmad-psm-incident-response, bmad-psm-mlops-deployment, bmad-psm-security-audit, and bmad-psm-quick-diagnose; for each entry include a one-line purpose and primary outputs (e.g., architecture docs, CI/CD pipelines, PRR checklist, runbooks, incident runbooks, MLOps deployment artifacts, security audit reports, quick diagnostic playbooks) following the same format and tone used by the existing BMM workflow entries so the new workflows integrate consistently with the workflow-map structure.

coderabbitai · 2026-03-18T18:53:07Z

src/psm/workflows/bmad-psm-setup-new-service/workflow.md

+Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng:
+
+1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection
+2. **Step-02-API-Database** → API Design + Database Selection + Schema
+3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy
+4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling
+5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config
+6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify that all referenced step files exist workflow_dir="src/psm/workflows/bmad-psm-setup-new-service" echo "Checking for step files in ${workflow_dir}..." # Check for steps directory if [ -d "${workflow_dir}/steps" ]; then echo "✓ steps/ directory exists" ls -la "${workflow_dir}/steps/" else echo "✗ steps/ directory not found" fi # Check for specific step files for step in step-01-architecture step-02-api-database step-03-build-deploy step-04-reliability step-05-security-infra step-06-readiness; do if [ -f "${workflow_dir}/steps/${step}.md" ]; then echo "✓ ${step}.md exists" else echo "✗ ${step}.md NOT FOUND" fi done # Check for SKILL.md and manifest if [ -f "${workflow_dir}/SKILL.md" ]; then echo "✓ SKILL.md exists" else echo "✗ SKILL.md NOT FOUND" fi if [ -f "${workflow_dir}/bmad-skill-manifest.yaml" ]; then echo "✓ bmad-skill-manifest.yaml exists" else echo "✗ bmad-skill-manifest.yaml NOT FOUND" fi

Repository: bmad-code-org/BMAD-METHOD

Length of output: 438

🏁 Script executed:

#!/bin/bash # Search for any markdown files in the workflow directory workflow_dir="src/psm/workflows/bmad-psm-setup-new-service" echo "=== All markdown files in workflow directory ===" find "${workflow_dir}" -name "*.md" -type f echo "" echo "=== Directory structure ===" tree -L 2 "${workflow_dir}" 2>/dev/null || find "${workflow_dir}" -type f

Repository: bmad-code-org/BMAD-METHOD

Length of output: 382

Add missing step files or remove step references from the workflow.

The workflow references 6 steps (lines 37–44) but none of the corresponding step files exist in the repository:

step-01-architecture.md

step-02-api-database.md

step-03-build-deploy.md

step-04-reliability.md

step-05-security-infra.md

step-06-readiness.md

Either include these step files in the PR or update the workflow to reflect the actual implementation status.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-setup-new-service/workflow.md` around lines 37 - 44, The workflow lists six step files (step-01-architecture.md, step-02-api-database.md, step-03-build-deploy.md, step-04-reliability.md, step-05-security-infra.md, step-06-readiness.md) that are not present; either add those missing markdown files with the corresponding step content (matching the headings in workflow.md: Step-01-Architecture through Step-06-Readiness) into the PR, or remove/update the six step references in workflow.md so the workflow only points to existing step files or a consolidated document; ensure filenames and the step identifiers in workflow.md (e.g., Step-01-Architecture, Step-02-API-Database, etc.) match exactly.

tools/cli/external-official-modules.yaml

PSM is now a standalone module at: https://github.com/DoanNgocCuong/bmad-module-production-systems It's registered in external-official-modules.yaml for installer integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DoanNgocCuong and others added 2 commits March 19, 2026 01:34

Merge branch 'main' into main

a2df51c

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

DoanNgocCuong closed this Mar 18, 2026

DoanNgocCuong reopened this Mar 18, 2026

DoanNgocCuong closed this Mar 18, 2026

	- Data Loss: {{None \| {{Description}}}}
	- Data Loss: {{NONE_OR_DESCRIPTION}}

Uh oh!

Conversation

DoanNgocCuong commented Mar 18, 2026

What

Why

How

Testing

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 18, 2026 •

edited

Loading