feat(psm): Add Production Systems & MLOps module#2057
feat(psm): Add Production Systems & MLOps module#2057DoanNgocCuong wants to merge 3 commits intobmad-code-org:mainfrom
Conversation
Add new PSM module for production operations: - 3 agents: SRE (Minh), Security (Hà), MLOps (Linh) - 6 workflows: incident-response, production-readiness, security-audit, mlops-deployment, setup-new-service, quick-diagnose - Teams for party mode integration - Registered as community module in installer Co-Authored-By: Claude Opus <noreply@anthropic.com>
📝 WalkthroughWalkthroughIntroduces a new Production Systems & MLOps (PSM) module containing four agent definitions (MLOps, Security, SRE, Operations Team), five workflow specifications (Incident Response, MLOps Deployment, Production Readiness Review, Quick Diagnose, Security Audit) with supporting templates and documentation, and registers the module in the CLI external modules list. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present. |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (2)
src/psm/workflows/bmad-psm-mlops-deployment/workflow.md (1)
44-48: Note: Roles reference agents outside PSM module.The "Data Scientist" and "DevOps" roles are mentioned but not defined as agents in the PSM module (which includes Minh/SRE, Hà/Security, Linh/MLOps). This is acceptable if these roles are fulfilled by humans or agents from other modules, but consider clarifying if specific agents should be used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/psm/workflows/bmad-psm-mlops-deployment/workflow.md` around lines 44 - 48, Clarify that the "Data Scientist" and "DevOps" roles in the table are external/human roles or map them to existing PSM agents (e.g., Minh/SRE, Hà/Security, Linh/MLOps); update the workflow table or add a short note immediately after the table stating whether these roles are fulfilled by external teams/humans or by specific PSM agents and, if mapped, explicitly list the agent-to-role mapping (for example: "Data Scientist -> Data Lead (external/human)"; "DevOps -> Platform Eng (Minh/SRE) or external Platform team") so readers know who is responsible.src/psm/agents/sre/sre-sidecar/production-standards.md (1)
5-5: Minor: Use hyphen for compound adjective."User Specified" should be "User-Specified" when used as a compound adjective modifying "CRITICAL Rules".
✏️ Proposed fix
-## User Specified CRITICAL Rules - Supersedes General Rules +## User-Specified CRITICAL Rules - Supersedes General Rules🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/psm/agents/sre/sre-sidecar/production-standards.md` at line 5, Update the heading "User Specified CRITICAL Rules - Supersedes General Rules" to use a hyphenated compound adjective by changing "User Specified" to "User-Specified" so the header reads "User-Specified CRITICAL Rules - Supersedes General Rules"; ensure any other occurrences of this exact heading text in the document are updated for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/psm/module.yaml`:
- Around line 1-13: The new module definition (code: psm, name: "PSM: Production
Systems & MLOps", header/subheader, default_selected, and knowledge_base_path
keys) lacks matching documentation; add or update docs under docs/ (e.g.,
docs/production-knowledge or a new docs/psm.md) that describe the module
purpose, agent/workflow usage, configuration keys (explain
knowledge_base_path.prompt/default/result and default_selected behavior),
example usage and any CLI/config impacts so the new psm module surface is fully
documented for developers and users.
In
`@src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md`:
- Line 98: The "Data Loss" line contains a malformed nested placeholder `{{None
| {{Description}}}}`; remove the inner braces and use a single placeholder or a
plain choice token instead (e.g. replace with `{{Description}}` if you want a
single variable, or `{{None | Description}}` if you intend a choice between None
and Description). Update the template line that currently reads "**Data Loss**:
{{None | {{Description}}}}" to one of those corrected forms so the placeholder
syntax is valid.
In `@src/psm/workflows/bmad-psm-incident-response/workflow.md`:
- Around line 36-41: The workflow references five missing step files
(step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md,
incident-response-summary.md) used by the workflow defined in workflow.md (the
steps/architecture and execution model sections); add these files into the skill
directory with the expected content for each step (triage, diagnose, fix,
postmortem, and final summary) so the references in workflow.md resolve, or if
any step is not applicable remove its reference from workflow.md to keep the
step list and execution model consistent; ensure the filenames exactly match
step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md
and incident-response-summary.md so Step-01-Triage, Step-02-Diagnose,
Step-03-Fix, Step-04-Postmortem and the incident response summary linkage work
correctly.
- Around line 1-163: Remove the estimated_duration field from the workflow
frontmatter (frontmatter symbol: estimated_duration in workflow.md for
W-INCIDENT-001 "Production Incident Response"); add a new documentation page
under docs/ describing the Incident Response workflow overview and the 4-step
process (triage, diagnose, fix, postmortem), severity scale definitions
(SEV1–SEV4) and escalation procedures, state tracking and incident-session
management (referencing incident_context keys such as incident_id, status,
current_step, timeline), integration with observability/monitoring tools, and
postmortem template usage; finally, update docs/reference/workflow-map.md to
include this PSM module workflow (W-INCIDENT-001 / "Production Incident
Response") and link to the new docs page.
In `@src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md`:
- Around line 1-6: Add the new PSM module and its six workflows to the
docs/reference by updating three files: in docs/reference/modules.md add a PSM
entry for the module name "bmad-psm-mlops-deployment" including the module code
snippet, npm package link, and list the workflows (mlops-deployment,
incident-response, production-readiness, quick-diagnose, security-audit,
setup-new-service); in docs/reference/agents.md add the PSM agents and their
trigger phrases (e.g., "deploy model", "ML deployment", "model serving" for the
mlops-deployment workflow and corresponding triggers for the others); and in
docs/reference/workflow-map.md add entries for each PSM workflow or insert a
clear link to this SKILL.md as the external PSM reference so the workflow map
points to the new module.
In
`@src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md`:
- Around line 1-367: The Production Readiness Review template (template_name:
production-readiness-checklist in production-readiness.template.md) was added
but not listed in the central docs index; update the documentation by adding a
PSM workflows entry to docs/reference/workflow-map.md that references the new
PRR workflow and either create a docs/psm/production-readiness.md documenting
the 9-dimension checklist, template usage, and decision criteria or add a PSM
section in workflow-map.md linking to
src/psm/workflows/production-readiness.template.md; ensure the entry includes
the template name, created_date (2026-03-17), and a short description so the new
workflow appears in the central index.
In `@src/psm/workflows/bmad-psm-production-readiness/workflow.md`:
- Around line 1-92: Create a new docs page describing the PRR workflow
introduced by workflow_id PRR001 (workflow_name: "Production Readiness Review");
include the purpose, execution path (START → Step 01 Init Checklist → Step 02
Deep Review → Step 03 Final Decision), integration with the PSM skill framework,
role responsibilities (Lead: Minh (SRE), Subject Matter: Service Owner, Review
Committee), the 9 evaluation dimensions (Reliability, Observability,
Performance, Security, Capacity, Data, Runbooks, Dependencies, Rollback) with
evaluation criteria, the output format/usage of the Production Readiness
Checklist (template: production-readiness.template.md and outputFile pattern),
and decision types (GO/NO-GO/CONDITIONAL-GO) plus post-workflow actions; place
the doc under docs/ and link to the entry_point steps/step-01-init-checklist.md
so reviewers can navigate from the workflow to the documentation.
In `@src/psm/workflows/bmad-psm-quick-diagnose/workflow.md`:
- Line 5: The workflow's entry_point references missing step files so the
workflow cannot run; create the two step files named steps/step-01-gather.md and
steps/step-02-diagnose.md and implement the corresponding step content described
by the workflow (Gather Context and Diagnose & Fix) so the entry_point
(entry_point: steps/step-01-gather.md) and the 2-step execution path in the
workflow.md are satisfied; ensure the new files exist under a steps/ directory
and include the appropriate metadata and instructions matching the workflow's
described steps so the workflow can execute end-to-end.
In `@src/psm/workflows/bmad-psm-setup-new-service/workflow.md`:
- Around line 37-44: The workflow lists six step files (step-01-architecture.md,
step-02-api-database.md, step-03-build-deploy.md, step-04-reliability.md,
step-05-security-infra.md, step-06-readiness.md) that are not present; either
add those missing markdown files with the corresponding step content (matching
the headings in workflow.md: Step-01-Architecture through Step-06-Readiness)
into the PR, or remove/update the six step references in workflow.md so the
workflow only points to existing step files or a consolidated document; ensure
filenames and the step identifiers in workflow.md (e.g., Step-01-Architecture,
Step-02-API-Database, etc.) match exactly.
- Around line 1-116: Add a new "PSM / Production Systems Management" section to
docs/reference/workflow-map.md (place it under Phase 4: Implementation or as its
own Production Systems section) and list the six workflows from this diff:
bmad-psm-setup-new-service (Setup Production Service for BMAD),
bmad-psm-production-readiness, bmad-psm-incident-response,
bmad-psm-mlops-deployment, bmad-psm-security-audit, and bmad-psm-quick-diagnose;
for each entry include a one-line purpose and primary outputs (e.g.,
architecture docs, CI/CD pipelines, PRR checklist, runbooks, incident runbooks,
MLOps deployment artifacts, security audit reports, quick diagnostic playbooks)
following the same format and tone used by the existing BMM workflow entries so
the new workflows integrate consistently with the workflow-map structure.
In `@tools/cli/external-official-modules.yaml`:
- Around line 45-53: The entry for module 'bmad-production-systems' in
external-official-modules.yaml has a broken/non-public GitHub URL; either remove
the entire 'bmad-production-systems' block or update its 'url' to a correct,
publicly accessible repository and ensure 'module-definition' (src/module.yaml)
and 'npmPackage' (bmad-production-systems) match that repo; after updating,
validate the URL resolves (HTTP 200) and the module-definition path exists so
loading the external module will succeed.
---
Nitpick comments:
In `@src/psm/agents/sre/sre-sidecar/production-standards.md`:
- Line 5: Update the heading "User Specified CRITICAL Rules - Supersedes General
Rules" to use a hyphenated compound adjective by changing "User Specified" to
"User-Specified" so the header reads "User-Specified CRITICAL Rules - Supersedes
General Rules"; ensure any other occurrences of this exact heading text in the
document are updated for consistency.
In `@src/psm/workflows/bmad-psm-mlops-deployment/workflow.md`:
- Around line 44-48: Clarify that the "Data Scientist" and "DevOps" roles in the
table are external/human roles or map them to existing PSM agents (e.g.,
Minh/SRE, Hà/Security, Linh/MLOps); update the workflow table or add a short
note immediately after the table stating whether these roles are fulfilled by
external teams/humans or by specific PSM agents and, if mapped, explicitly list
the agent-to-role mapping (for example: "Data Scientist -> Data Lead
(external/human)"; "DevOps -> Platform Eng (Minh/SRE) or external Platform
team") so readers know who is responsible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 129ba29a-ddf5-4505-bc97-a8ee768ee290
⛔ Files ignored due to path filters (2)
src/psm/module-help.csvis excluded by!**/*.csvsrc/psm/teams/default-party.csvis excluded by!**/*.csv
📒 Files selected for processing (29)
src/psm/agents/mlops/mlops.agent.yamlsrc/psm/agents/security/security.agent.yamlsrc/psm/agents/sre/sre-sidecar/production-standards.mdsrc/psm/agents/sre/sre.agent.yamlsrc/psm/config.yamlsrc/psm/module.yamlsrc/psm/teams/ops-team.yamlsrc/psm/workflows/bmad-psm-incident-response/SKILL.mdsrc/psm/workflows/bmad-psm-incident-response/bmad-skill-manifest.yamlsrc/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.mdsrc/psm/workflows/bmad-psm-incident-response/workflow.mdsrc/psm/workflows/bmad-psm-mlops-deployment/SKILL.mdsrc/psm/workflows/bmad-psm-mlops-deployment/bmad-skill-manifest.yamlsrc/psm/workflows/bmad-psm-mlops-deployment/workflow.mdsrc/psm/workflows/bmad-psm-production-readiness/SKILL.mdsrc/psm/workflows/bmad-psm-production-readiness/bmad-skill-manifest.yamlsrc/psm/workflows/bmad-psm-production-readiness/production-readiness.template.mdsrc/psm/workflows/bmad-psm-production-readiness/workflow.mdsrc/psm/workflows/bmad-psm-quick-diagnose/SKILL.mdsrc/psm/workflows/bmad-psm-quick-diagnose/bmad-skill-manifest.yamlsrc/psm/workflows/bmad-psm-quick-diagnose/workflow.mdsrc/psm/workflows/bmad-psm-security-audit/SKILL.mdsrc/psm/workflows/bmad-psm-security-audit/bmad-skill-manifest.yamlsrc/psm/workflows/bmad-psm-security-audit/security-audit-report.template.mdsrc/psm/workflows/bmad-psm-security-audit/workflow.mdsrc/psm/workflows/bmad-psm-setup-new-service/SKILL.mdsrc/psm/workflows/bmad-psm-setup-new-service/bmad-skill-manifest.yamlsrc/psm/workflows/bmad-psm-setup-new-service/workflow.mdtools/cli/external-official-modules.yaml
src/psm/module.yaml
Outdated
| code: psm | ||
| name: "PSM: Production Systems & MLOps" | ||
| header: "BMad Production Systems Module" | ||
| subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps." | ||
| description: "AI-driven production engineering framework with SRE, Security, and MLOps agents." | ||
| default_selected: false | ||
|
|
||
| knowledge_base_path: | ||
| prompt: | ||
| - "Where is your production knowledge base? (folder with SKILL.md files)" | ||
| - "Leave default if you don't have one yet." | ||
| default: "docs/production-knowledge" | ||
| result: "{project-root}/{value}" |
There was a problem hiding this comment.
Add corresponding docs/ updates for the new PSM module surface.
This change introduces new module capabilities/configuration, but no matching docs update is included here. Please add or update the relevant docs pages for module purpose, agent/workflow usage, and configuration keys.
As per coding guidelines: for src/**, “new features, changed behavior, renamed concepts, altered CLI flags, or modified configuration options should all be reflected in the relevant doc pages.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/module.yaml` around lines 1 - 13, The new module definition (code:
psm, name: "PSM: Production Systems & MLOps", header/subheader,
default_selected, and knowledge_base_path keys) lacks matching documentation;
add or update docs under docs/ (e.g., docs/production-knowledge or a new
docs/psm.md) that describe the module purpose, agent/workflow usage,
configuration keys (explain knowledge_base_path.prompt/default/result and
default_selected behavior), example usage and any CLI/config impacts so the new
psm module surface is fully documented for developers and users.
| - **System Recovery**: {{SERVICE/METRIC}} took {{TIME}} to recover | ||
| - **Cascading Effects**: {{SERVICE_X}} also affected due to {{reason}} | ||
| - **On-call Load**: {{NUMBER}} pages, {{NUMBER}} escalations | ||
| - **Data Loss**: {{None | {{Description}}}} |
There was a problem hiding this comment.
Fix malformed placeholder syntax in “Data Loss”.
The nested token {{None | {{Description}}}} is ambiguous. Use a single placeholder token or a plain-choice placeholder format.
Proposed fix
-- **Data Loss**: {{None | {{Description}}}}
+- **Data Loss**: {{NONE_OR_DESCRIPTION}}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - **Data Loss**: {{None | {{Description}}}} | |
| - **Data Loss**: {{NONE_OR_DESCRIPTION}} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md`
at line 98, The "Data Loss" line contains a malformed nested placeholder `{{None
| {{Description}}}}`; remove the inner braces and use a single placeholder or a
plain choice token instead (e.g. replace with `{{Description}}` if you want a
single variable, or `{{None | Description}}` if you intend a choice between None
and Description). Update the template line that currently reads "**Data Loss**:
{{None | {{Description}}}}" to one of those corrected forms so the placeholder
syntax is valid.
| --- | ||
| workflow_id: W-INCIDENT-001 | ||
| workflow_name: Production Incident Response | ||
| version: 6.2.0 | ||
| lead_agent: "SRE Minh" | ||
| supporting_agents: ["Architect Khang", "Mary Analyst"] | ||
| phase: "3-Run: Emergency Response & Recovery" | ||
| created_date: 2026-03-17 | ||
| last_modified: 2026-03-17 | ||
| config_file: "_config/config.yaml" | ||
| estimated_duration: "15 minutes to 2 hours (depending on severity)" | ||
| outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md' | ||
| --- | ||
|
|
||
| # Production Incident Response Workflow — BMAD Pattern | ||
|
|
||
| ## Metadata & Context | ||
|
|
||
| **Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability. | ||
|
|
||
| **Lead Team**: | ||
| - SRE Minh (Incident Command, Recovery Orchestration) | ||
| - Architect Khang (Root Cause Analysis, System-wide Impact) | ||
| - Mary Analyst (Impact Assessment, Post-Incident Review) | ||
|
|
||
| **Success Criteria**: | ||
| - ✓ Incident severity classified within 5 minutes | ||
| - ✓ Root cause identified within first triage pass | ||
| - ✓ Fix applied and verified | ||
| - ✓ System metrics returned to baseline | ||
| - ✓ Incident postmortem documented with action items | ||
| - ✓ Prevention measures identified | ||
|
|
||
| ## Workflow Overview | ||
|
|
||
| Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau: | ||
|
|
||
| 1. **Step-01-Triage** → Gather initial info, assess severity, classify impact | ||
| 2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces) | ||
| 3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery | ||
| 4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence | ||
|
|
||
| ## Configuration Loading | ||
|
|
||
| Tự động load từ `_config/config.yaml`: | ||
|
|
||
| ```yaml | ||
| project_context: | ||
| organization: "[loaded from config]" | ||
| environment: "production" | ||
| incident_channel: "slack:#incidents" | ||
|
|
||
| workflow_defaults: | ||
| communication_language: "Vietnamese-English" | ||
| severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"] | ||
| escalation_contacts: "[loaded from config]" | ||
| on_call_engineer: "[loaded from config]" | ||
| ``` | ||
|
|
||
| ## Workflow Architecture - Micro-File Design | ||
|
|
||
| BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain: | ||
|
|
||
| ``` | ||
| workflow.md (entry point) | ||
| ↓ | ||
| step-01-triage.md (classify severity, initial assessment) | ||
| ↓ | ||
| step-02-diagnose.md (root cause analysis) | ||
| ↓ | ||
| step-03-fix.md (apply fix, verify) | ||
| ↓ | ||
| step-04-postmortem.md (document, prevent) | ||
| ↓ | ||
| incident-response-summary.md (final output) | ||
| ``` | ||
|
|
||
| **Key Benefits**: | ||
| - Single-step focus — engineer concentrates on one phase | ||
| - Knowledge isolation — load only relevant SKILL docs per step | ||
| - State tracking — save progress after each step | ||
| - Easy resumption — if interrupted, restart from exact step | ||
|
|
||
| ## Skill References | ||
|
|
||
| Workflow này load knowledge từ: | ||
|
|
||
| - **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management | ||
| - **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing | ||
| - **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns | ||
| - **5.10 Production Readiness** → Incident prevention checklist, alerting setup | ||
| - **5.14 Documentation & Runbooks** → Postmortem templates, incident reports | ||
|
|
||
| ## Execution Model | ||
|
|
||
| ### Entry Point Logic | ||
|
|
||
| ``` | ||
| 1. Check if incident session exists | ||
| → If NEW incident: Start from step-01-triage.md | ||
| → If ONGOING: Load incident-session.yaml → continue from last completed step | ||
| → If RESOLVED: Load postmortem template | ||
|
|
||
| 2. For each step: | ||
| a) Load step-{N}-{name}.md | ||
| b) Load referenced SKILL files (auto-parse "Load:" directives) | ||
| c) Execute MENU [A][C] options | ||
| d) Save step output to step-{N}-output.md + incident-context.yaml | ||
| e) Move to next step or conclude | ||
|
|
||
| 3. Final: Generate incident report + postmortem in outputs folder | ||
| ``` | ||
|
|
||
| ### State Tracking | ||
|
|
||
| Incident session frontmatter tracks progress: | ||
|
|
||
| ```yaml | ||
| incident_context: | ||
| incident_id: "INC-2026-03-17-001" | ||
| severity: "SEV1" | "SEV2" | "SEV3" | "SEV4" | ||
| status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem" | ||
| affected_services: ["service-1", "service-2"] | ||
| started_at: "2026-03-17T14:30:00Z" | ||
| timeline: | ||
| detected_at: "2026-03-17T14:30:00Z" | ||
| triage_completed_at: "2026-03-17T14:35:00Z" | ||
| root_cause_identified_at: "2026-03-17T14:50:00Z" | ||
| fix_applied_at: "2026-03-17T15:10:00Z" | ||
| resolved_at: "2026-03-17T15:15:00Z" | ||
| current_step: "step-02-diagnose" | ||
| last_updated: "2026-03-17T14:50:00Z" | ||
| incident_commander: "SRE Minh" | ||
| ``` | ||
|
|
||
| ## Mandatory Workflow Rules | ||
|
|
||
| 1. **Speed first** — Triage must complete in < 5 minutes | ||
| 2. **Root cause identification** — Must identify root cause before fix attempt | ||
| 3. **Verify before declaring resolved** — Check metrics + user reports | ||
| 4. **Document everything** — Every action logged for postmortem | ||
| 5. **Escalation protocol** — SEV1 → Page on-call architect immediately | ||
| 6. **Communication** — Update stakeholders every 5-10 minutes | ||
| 7. **No flying blind** — All fixes must reference observability data | ||
|
|
||
| ## Severity Scale | ||
|
|
||
| - **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call | ||
| - **SEV2** — Major degradation, significant users affected, partial functionality down | ||
| - **SEV3** — Moderate impact, some users affected, workaround possible | ||
| - **SEV4** — Minor issue, limited users, can defer to business hours | ||
|
|
||
| ## Navigation | ||
|
|
||
| Hãy chọn cách bắt đầu: | ||
|
|
||
| - **[NEW-INC]** — Report new incident → Load step-01-triage | ||
| - **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml) | ||
| - **[ESCALATE]** — Escalate to on-call architect | ||
|
|
||
| --- | ||
|
|
||
| **Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage** |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if incident-response documentation exists under docs/
# Search for incident-response or related documentation
fd -e md . docs/ -x grep -l "incident.response\|incident-response\|Incident Response" {} \;
# Check for severity scale documentation
fd -e md . docs/ -x grep -l "SEV1\|severity scale\|incident severity" {} \;Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
#!/bin/bash
# Check if docs/ directory exists and examine its structure
echo "=== Checking docs/ directory structure ==="
if [ -d "docs/" ]; then
echo "docs/ exists. Top-level contents:"
ls -la docs/ | head -20
echo ""
echo "=== All .md files in docs/ ==="
find docs/ -name "*.md" -type f | head -30
else
echo "docs/ directory does not exist"
fi
echo ""
echo "=== Source workflow file location ==="
ls -la src/psm/workflows/bmad-psm-incident-response/ 2>/dev/null || echo "Directory not found"Repository: bmad-code-org/BMAD-METHOD
Length of output: 2204
🏁 Script executed:
#!/bin/bash
# Check workflow-map documentation and PSM references
echo "=== Contents of docs/reference/workflow-map.md ==="
cat docs/reference/workflow-map.md
echo ""
echo "=== Check for any PSM references in docs/ ==="
grep -r "PSM\|psm\|incident.*response" docs/ 2>/dev/null | head -20
echo ""
echo "=== Git info about the workflow file ==="
git log --oneline -n 5 -- src/psm/workflows/bmad-psm-incident-response/workflow.md 2>/dev/null || echo "No git history or file not tracked"Repository: bmad-code-org/BMAD-METHOD
Length of output: 6760
🏁 Script executed:
#!/bin/bash
# Check for skill-validator rules and PSM module documentation
echo "=== Looking for skill-validator or related validation rules ==="
find tools/ -name "*skill*validator*" -o -name "*skill*" 2>/dev/null | head -20
echo ""
echo "=== Check PSM module structure ==="
find src/psm/ -type f -name "*.md" | head -20
echo ""
echo "=== Check if PSM is referenced in any docs ==="
grep -r "psm\|PSM" src/ --include="*.md" | grep -i "module\|description\|purpose" | head -10
echo ""
echo "=== Check workflow metadata for creation/modification info ==="
head -30 src/psm/workflows/bmad-psm-incident-response/workflow.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 2501
🏁 Script executed:
#!/bin/bash
# Read skill-validator.md to understand rules for legacy workflows
echo "=== Contents of tools/skill-validator.md ==="
cat tools/skill-validator.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 21986
Documentation for incident-response workflow is missing and required per coding guidelines.
This PR introduces a new Production Incident Response workflow for the PSM module. Per the coding guideline "Source file changed. Check whether documentation under docs/ needs a corresponding update", documentation must be added to document:
- Incident Response workflow overview and the 4-step process (triage, diagnose, fix, postmortem)
- Severity scale (SEV1-SEV4) definitions and escalation procedures
- State tracking and incident session management
- Integration with observability and monitoring tools
- Postmortem template usage
Additionally, docs/reference/workflow-map.md should be updated to include the PSM module workflows.
Minor issue: Remove the estimated_duration field from the workflow.md frontmatter — per skill-validator rule SEQ-02, workflow files must not include time estimates.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-incident-response/workflow.md` around lines 1 -
163, Remove the estimated_duration field from the workflow frontmatter
(frontmatter symbol: estimated_duration in workflow.md for W-INCIDENT-001
"Production Incident Response"); add a new documentation page under docs/
describing the Incident Response workflow overview and the 4-step process
(triage, diagnose, fix, postmortem), severity scale definitions (SEV1–SEV4) and
escalation procedures, state tracking and incident-session management
(referencing incident_context keys such as incident_id, status, current_step,
timeline), integration with observability/monitoring tools, and postmortem
template usage; finally, update docs/reference/workflow-map.md to include this
PSM module workflow (W-INCIDENT-001 / "Production Incident Response") and link
to the new docs page.
| Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau: | ||
|
|
||
| 1. **Step-01-Triage** → Gather initial info, assess severity, classify impact | ||
| 2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces) | ||
| 3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery | ||
| 4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify that all referenced step files exist for incident-response workflow
workflow_dir="src/psm/workflows/bmad-psm-incident-response"
echo "Checking for step files in ${workflow_dir}..."
# Check for steps directory
if [ -d "${workflow_dir}/steps" ]; then
echo "✓ steps/ directory exists"
ls -la "${workflow_dir}/steps/"
else
echo "✗ steps/ directory not found"
fi
# Check for specific step files
for step in step-01-triage step-02-diagnose step-03-fix step-04-postmortem; do
if [ -f "${workflow_dir}/steps/${step}.md" ]; then
echo "✓ ${step}.md exists"
else
echo "✗ ${step}.md NOT FOUND"
fi
done
# Check for SKILL.md and manifest
if [ -f "${workflow_dir}/SKILL.md" ]; then
echo "✓ SKILL.md exists"
else
echo "✗ SKILL.md NOT FOUND"
fi
if [ -f "${workflow_dir}/bmad-skill-manifest.yaml" ]; then
echo "✓ bmad-skill-manifest.yaml exists"
else
echo "✗ bmad-skill-manifest.yaml NOT FOUND"
fi
# Check for incident-postmortem template
if [ -f "${workflow_dir}/incident-postmortem.template.md" ]; then
echo "✓ incident-postmortem.template.md exists"
else
echo "✗ incident-postmortem.template.md NOT FOUND"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 388
🏁 Script executed:
#!/bin/bash
# Get the full directory structure of the skill
workflow_dir="src/psm/workflows/bmad-psm-incident-response"
echo "=== Full directory structure of ${workflow_dir} ==="
find "${workflow_dir}" -type f -name "*.md" -o -name "*.yaml" | sort
echo ""
echo "=== Directory tree ==="
tree "${workflow_dir}" 2>/dev/null || find "${workflow_dir}" -type f | sed 's|[^/]*/| |g' | sort
echo ""
echo "=== Check SKILL.md content ==="
if [ -f "${workflow_dir}/SKILL.md" ]; then
head -30 "${workflow_dir}/SKILL.md"
fi
echo ""
echo "=== Check manifest content ==="
if [ -f "${workflow_dir}/bmad-skill-manifest.yaml" ]; then
cat "${workflow_dir}/bmad-skill-manifest.yaml"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 947
🏁 Script executed:
#!/bin/bash
# Read the full workflow.md to understand step references
cat -n "src/psm/workflows/bmad-psm-incident-response/workflow.md"Repository: bmad-code-org/BMAD-METHOD
Length of output: 6882
🏁 Script executed:
#!/bin/bash
# Check if incident-response-summary.md is also missing
workflow_dir="src/psm/workflows/bmad-psm-incident-response"
echo "=== Checking for incident-response-summary.md ==="
if [ -f "${workflow_dir}/incident-response-summary.md" ]; then
echo "✓ incident-response-summary.md exists"
else
echo "✗ incident-response-summary.md NOT FOUND"
fi
echo ""
echo "=== Checking PR/git status for context ==="
# Show recent git log for this directory (if available)
git log --oneline -5 -- "${workflow_dir}" 2>/dev/null || echo "No git history available"
echo ""
echo "=== All referenced but missing files ==="
echo "From workflow.md architecture (lines 64-76):"
echo " ✗ step-01-triage.md"
echo " ✗ step-02-diagnose.md"
echo " ✗ step-03-fix.md"
echo " ✗ step-04-postmortem.md"
echo " ✗ incident-response-summary.md"Repository: bmad-code-org/BMAD-METHOD
Length of output: 453
Add missing step files required by the workflow.
The workflow references 5 step files that do not exist in the skill directory:
step-01-triage.md(referenced in lines 38, 67, 100)step-02-diagnose.md(referenced in lines 39, 69, 131)step-03-fix.md(referenced in lines 40, 71)step-04-postmortem.md(referenced in lines 41, 73)incident-response-summary.md(referenced in line 75)
The workflow architecture (lines 64–76) and execution model (lines 99–111) explicitly require these step files to be present. Without them, the skill cannot function. Include these files in the PR or remove the references if they are not applicable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-incident-response/workflow.md` around lines 36 -
41, The workflow references five missing step files (step-01-triage.md,
step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md,
incident-response-summary.md) used by the workflow defined in workflow.md (the
steps/architecture and execution model sections); add these files into the skill
directory with the expected content for each step (triage, diagnose, fix,
postmortem, and final summary) so the references in workflow.md resolve, or if
any step is not applicable remove its reference from workflow.md to keep the
step list and execution model consistent; ensure the filenames exactly match
step-01-triage.md, step-02-diagnose.md, step-03-fix.md, step-04-postmortem.md
and incident-response-summary.md so Step-01-Triage, Step-02-Diagnose,
Step-03-Fix, Step-04-Postmortem and the incident response summary linkage work
correctly.
| --- | ||
| name: bmad-psm-mlops-deployment | ||
| description: 'Deploy ML model to production with validation and monitoring. Use when the user says "deploy model" or "ML deployment" or "model serving"' | ||
| --- | ||
|
|
||
| Follow the instructions in [workflow.md](workflow.md). |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify whether docs mention the new PSM module/workflows/triggers.
if [ -d docs ]; then
rg -n --ignore-case \
'psm|production systems|mlops|incident response|production readiness|security audit|setup new service|quick diagnose|bmad-psm-' \
docs
else
echo "docs/ directory not found"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# Check if docs/ directory exists and list its structure
if [ -d docs ]; then
echo "=== docs/ directory exists ==="
find docs -type f -name "*.md" | head -20
echo ""
echo "=== docs/ top-level structure ==="
ls -la docs/
else
echo "docs/ directory does not exist"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 1479
🏁 Script executed:
# Check the PSM workflow directory structure
if [ -d src/psm ]; then
echo "=== src/psm/ directory exists ==="
find src/psm -type f | sort
else
echo "src/psm/ directory does not exist"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 1705
🏁 Script executed:
# Check if there are any workflow indexes or catalog files that list workflows
find . -type f -name "*.md" -o -name "*.yaml" -o -name "*.yml" | xargs grep -l "workflow" | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 348
🏁 Script executed:
# Look for any README or documentation in src/psm area
find src/psm -type f -name "README*" -o -name "INDEX*" -o -name "*index*"Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# Check the main docs index
cat docs/index.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 3762
🏁 Script executed:
# Check reference directory which might list workflows
ls -la docs/reference/
cat docs/reference/*.md | head -100Repository: bmad-code-org/BMAD-METHOD
Length of output: 6925
🏁 Script executed:
# Look for any mentions of workflows or modules in docs
grep -r "workflow" docs/ --include="*.md" | grep -i "list\|index\|available\|commands"Repository: bmad-code-org/BMAD-METHOD
Length of output: 5438
🏁 Script executed:
# Check if there are any how-to guides that might reference PSM or explain module discovery
cat docs/how-to/get-answers-about-bmad.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 4587
🏁 Script executed:
# Check modules.md which should list all available modules
cat docs/reference/modules.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 3867
🏁 Script executed:
# Check workflow-map.md which should list all workflows
cat docs/reference/workflow-map.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 6581
Add PSM module and workflows to docs/reference/.
The new PSM module with six workflows (mlops-deployment, incident-response, production-readiness, quick-diagnose, security-audit, setup-new-service) is not documented in the reference section. Update:
docs/reference/modules.md— add PSM entry with module code, npm link, and workflow listdocs/reference/agents.md— add PSM agents and their triggersdocs/reference/workflow-map.md— add PSM workflows or link to external PSM reference
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md` around lines 1 - 6, Add
the new PSM module and its six workflows to the docs/reference by updating three
files: in docs/reference/modules.md add a PSM entry for the module name
"bmad-psm-mlops-deployment" including the module code snippet, npm package link,
and list the workflows (mlops-deployment, incident-response,
production-readiness, quick-diagnose, security-audit, setup-new-service); in
docs/reference/agents.md add the PSM agents and their trigger phrases (e.g.,
"deploy model", "ML deployment", "model serving" for the mlops-deployment
workflow and corresponding triggers for the others); and in
docs/reference/workflow-map.md add entries for each PSM workflow or insert a
clear link to this SKILL.md as the external PSM reference so the workflow map
points to the new module.
| --- | ||
| workflow_id: PRR001 | ||
| workflow_name: Production Readiness Review | ||
| description: Validate service is ready for production using comprehensive readiness checklist | ||
| entry_point: steps/step-01-init-checklist.md | ||
| phase: 3-run | ||
| lead_agent: "Minh (SRE)" | ||
| status: "active" | ||
| created_date: 2026-03-17 | ||
| version: "1.0.0" | ||
| estimated_duration: "2-3 hours" | ||
| outputFile: '{output_folder}/psm-artifacts/prr-{{project_name}}-{{date}}.md' | ||
| --- | ||
|
|
||
| # Workflow: Production Readiness Review (PRR) | ||
|
|
||
| ## Goal | ||
| Validate and certify that a service meets production readiness standards across 9 key dimensions before deployment. | ||
|
|
||
| ## Overview | ||
|
|
||
| This workflow systematically evaluates a service against production readiness criteria defined in the Production Systems BMAD skill framework. Using SRE expertise and architectural patterns, the workflow: | ||
|
|
||
| 1. **Initializes** the PRR process with service context and dimensional overview | ||
| 2. **Deep reviews** each dimension (reliability, observability, performance, security, capacity, data, runbooks, dependencies, rollback) | ||
| 3. **Renders final decision** with GO/NO-GO/CONDITIONAL-GO recommendation | ||
|
|
||
| ## Execution Path | ||
|
|
||
| ``` | ||
| START | ||
| ↓ | ||
| [Step 01] Init Checklist (Load framework, gather service context, present dimensions) | ||
| ↓ | ||
| [Step 02] Deep Review (Score each dimension, identify blockers, recommendations) | ||
| ↓ | ||
| [Step 03] Final Decision (Scorecard, decision, action items, DONE) | ||
| ↓ | ||
| END | ||
| ``` | ||
|
|
||
| ## Key Roles | ||
|
|
||
| | Role | Agent | Responsibility | | ||
| |------|-------|-----------------| | ||
| | Lead | Minh (SRE) | Navigate workflow, coordinate review, make final call | | ||
| | Subject Matter | Service Owner | Provide service context, clarify architecture | | ||
| | Review Committee | Arch, SecOps, MLOps | Contribute expertise on specific dimensions | | ||
|
|
||
| ## Dimensions Evaluated (9) | ||
|
|
||
| 1. **Reliability** — SLA/SLO definition, error budgets, failure modes, incident response | ||
| 2. **Observability** — Logging, metrics, tracing, dashboards, alerting | ||
| 3. **Performance** — Latency targets, throughput, P99 tail behavior, optimization opportunities | ||
| 4. **Security** — Auth/authz, secrets management, encryption, audit logging, compliance | ||
| 5. **Capacity** — Resource limits, scaling policies, burst capacity, cost projections | ||
| 6. **Data** — Schema versioning, backup/restore, data governance, retention policies | ||
| 7. **Runbooks** — Incident runbooks, operational playbooks, troubleshooting guides | ||
| 8. **Dependencies** — External services, internal libraries, database versioning, API contracts | ||
| 9. **Rollback** — Rollback strategy, canary deployment, feature flags, smoke tests | ||
|
|
||
| ## Input Requirements | ||
|
|
||
| - **Service name and owner** — Which service are we evaluating? | ||
| - **Current architecture** — High-level design, tech stack, topology | ||
| - **Existing metrics/dashboards** — Links to monitoring, SLO definitions | ||
| - **Known gaps/risks** — Already identified issues to address | ||
|
|
||
| ## Output Deliverable | ||
|
|
||
| - **Production Readiness Checklist** (template: `production-readiness.template.md`) | ||
| - Scorecard with 9 dimensions (red/yellow/green) | ||
| - Blockers and recommendations per dimension | ||
| - Final GO/NO-GO/CONDITIONAL-GO decision | ||
| - Explicit action items with owners and deadlines | ||
|
|
||
| ## Success Criteria | ||
|
|
||
| 1. All 9 dimensions evaluated with clear rationale | ||
| 2. Blockers categorized as P0 (must fix) or P1 (should fix) | ||
| 3. Team alignment on decision (documented in PRR report) | ||
| 4. Action plan with clear accountability and timeline | ||
|
|
||
| ## Next Steps After Workflow | ||
|
|
||
| - If **GO**: Proceed to deployment; document in CHANGELOG | ||
| - If **NO-GO**: Reschedule PRR once blockers addressed; track in backlog | ||
| - If **CONDITIONAL-GO**: Deploy with documented caveats; setup monitoring for risk areas | ||
|
|
||
| --- | ||
|
|
||
| **Navigation**: [← Back to 3-run](../), [Next: Step 01 →](steps/step-01-init-checklist.md) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if PRR workflow execution documentation exists under docs/
# Search for PRR workflow execution documentation
fd -e md . docs/ -x grep -l "Production Readiness Review.*workflow\|PRR.*execution\|PRR.*step" {} \;
# Check if workflow navigation is documented
fd -e md . docs/ -x grep -l "workflow.*navigation\|step.*execution.*path" {} \;Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
Add documentation for PRR workflow execution and framework under docs/.
The Production Readiness Review workflow introduces new execution logic, role assignments, and a 9-dimension evaluation framework. Corresponding documentation should be added to docs/ explaining:
- PRR workflow purpose, execution path, and integration with the PSM skill framework
- The 9 dimensions (Reliability, Observability, Performance, Security, Capacity, Data, Runbooks, Dependencies, Rollback) and evaluation criteria
- Role responsibilities (Lead/SRE, Subject Matter, Review Committee)
- Output format and usage of the Production Readiness Checklist
- Decision types (GO/NO-GO/CONDITIONAL-GO) and post-workflow action paths
Per coding guidelines: source file changes require corresponding documentation updates in docs/ for new features and frameworks.
🧰 Tools
🪛 LanguageTool
[style] ~25-~25: Consider a different adjective to strengthen your wording.
Context: ...e context and dimensional overview 2. Deep reviews each dimension (reliability, ...
(DEEP_PROFOUND)
[style] ~47-~47: This phrase is redundant. Consider using “Subject” to avoid wordiness.
Context: ... coordinate review, make final call | | Subject Matter | Service Owner | Provide service conte...
(SUBJECT_MATTER)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-production-readiness/workflow.md` around lines 1 -
92, Create a new docs page describing the PRR workflow introduced by workflow_id
PRR001 (workflow_name: "Production Readiness Review"); include the purpose,
execution path (START → Step 01 Init Checklist → Step 02 Deep Review → Step 03
Final Decision), integration with the PSM skill framework, role responsibilities
(Lead: Minh (SRE), Subject Matter: Service Owner, Review Committee), the 9
evaluation dimensions (Reliability, Observability, Performance, Security,
Capacity, Data, Runbooks, Dependencies, Rollback) with evaluation criteria, the
output format/usage of the Production Readiness Checklist (template:
production-readiness.template.md and outputFile pattern), and decision types
(GO/NO-GO/CONDITIONAL-GO) plus post-workflow actions; place the doc under docs/
and link to the entry_point steps/step-01-init-checklist.md so reviewers can
navigate from the workflow to the documentation.
| workflow_id: QD001 | ||
| workflow_name: Quick Diagnose | ||
| description: Fast diagnosis of production issue with root cause and fix suggestion | ||
| entry_point: steps/step-01-gather.md |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify step files exist for quick-diagnose workflow
WORKFLOW_DIR="src/psm/workflows/bmad-psm-quick-diagnose"
echo "=== Checking step files ==="
if [ -d "$WORKFLOW_DIR/steps" ]; then
ls -la "$WORKFLOW_DIR/steps/"
else
echo "ERROR: steps/ directory does not exist"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 135
🏁 Script executed:
cat -n src/psm/workflows/bmad-psm-quick-diagnose/workflow.md | head -100Repository: bmad-code-org/BMAD-METHOD
Length of output: 2732
🏁 Script executed:
cd src/psm/workflows/bmad-psm-quick-diagnose && find . -type f && ls -la ../Repository: bmad-code-org/BMAD-METHOD
Length of output: 675
The workflow cannot execute: required step files are missing.
The entry_point at line 5 references steps/step-01-gather.md, but the steps/ directory does not exist. The workflow describes a 2-step execution path (Gather Context → Diagnose & Fix) at lines 30-37, yet neither step file is present. Create the missing step files (steps/step-01-gather.md and steps/step-02-diagnose.md) for the workflow to be functional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-quick-diagnose/workflow.md` at line 5, The
workflow's entry_point references missing step files so the workflow cannot run;
create the two step files named steps/step-01-gather.md and
steps/step-02-diagnose.md and implement the corresponding step content described
by the workflow (Gather Context and Diagnose & Fix) so the entry_point
(entry_point: steps/step-01-gather.md) and the 2-step execution path in the
workflow.md are satisfied; ensure the new files exist under a steps/ directory
and include the appropriate metadata and instructions matching the workflow's
described steps so the workflow can execute end-to-end.
| --- | ||
| workflow_id: W-SETUP-SVC-001 | ||
| workflow_name: Setup Production Service for BMAD | ||
| version: 6.2.0 | ||
| lead_agent: "Architect Khang" | ||
| supporting_agents: ["SRE Minh", "Mary Analyst"] | ||
| phase: "1-Analysis → 2-Planning → 3-Solutioning → 4-Implementation" | ||
| created_date: 2026-03-17 | ||
| last_modified: 2026-03-17 | ||
| config_file: "_config/config.yaml" | ||
| estimated_duration: "12-20 hours" | ||
| outputFile: '{output_folder}/psm-artifacts/service-setup-{{project_name}}-{{date}}.md' | ||
| --- | ||
|
|
||
| # Setup Production Service Workflow — BMAD Pattern | ||
|
|
||
| ## Metadata & Context | ||
|
|
||
| **Goal**: Xây dựng production-grade service từ scratch, với đầy đủ architecture, API design, deployment pipeline, reliability patterns, security, và production readiness. | ||
|
|
||
| **Lead Team**: | ||
| - SRE Minh (Reliability, Infrastructure, Operations) | ||
| - Architect Khang (System Design, Technology Selection) | ||
| - Mary Analyst (Requirements, Risk Assessment) | ||
|
|
||
| **Success Criteria**: | ||
| - ✓ Architecture design document approved | ||
| - ✓ API contracts defined & validated | ||
| - ✓ Database schema designed & indexed | ||
| - ✓ CI/CD pipeline operational | ||
| - ✓ Resilience & observability in place | ||
| - ✓ Security & compliance verified | ||
| - ✓ Production readiness checklist passed | ||
|
|
||
| ## Workflow Overview | ||
|
|
||
| Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng: | ||
|
|
||
| 1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection | ||
| 2. **Step-02-API-Database** → API Design + Database Selection + Schema | ||
| 3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy | ||
| 4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling | ||
| 5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config | ||
| 6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision | ||
|
|
||
| ## Configuration Loading | ||
|
|
||
| Tự động load từ `_config/config.yaml`: | ||
|
|
||
| ```yaml | ||
| project_context: | ||
| user_name: "[loaded from config]" | ||
| organization: "[loaded from config]" | ||
| environment: "production" | ||
|
|
||
| workflow_defaults: | ||
| communication_language: "Vietnamese" | ||
| output_folder: "./outputs/setup-new-service-{service_name}" | ||
| timestamp: "2026-03-17" | ||
| ``` | ||
|
|
||
| ## Execution Model | ||
|
|
||
| ### Entry Point Logic | ||
|
|
||
| ``` | ||
| 1. Check if workflow.md exists in outputs folder | ||
| → If NEW: Start from step-01-architecture.md | ||
| → If RESUME: Load progress.yaml → auto-skip completed steps | ||
| → If PARTIAL: Load step-N-context.yaml → resume from step N | ||
|
|
||
| 2. For each step: | ||
| a) Load step-{N}-{name}.md | ||
| b) Load referenced SKILL files (auto-parse "Load:" directives) | ||
| c) Execute MENU [A][C] options | ||
| d) Save step output to step-{N}-output.md | ||
| e) Move to next step | ||
|
|
||
| 3. Final: Generate comprehensive outputs in outputs folder | ||
| ``` | ||
|
|
||
| ### State Tracking | ||
|
|
||
| Output document frontmatter tracks progress: | ||
|
|
||
| ```yaml | ||
| workflow_progress: | ||
| step_01_architecture: "completed" | ||
| step_02_api_database: "completed" | ||
| step_03_build_deploy: "in_progress" | ||
| step_04_reliability: "pending" | ||
| step_05_security_infra: "pending" | ||
| step_06_readiness: "pending" | ||
| last_updated: "2026-03-17T14:30:00Z" | ||
| current_agent: "Architect Khang" | ||
| ``` | ||
|
|
||
| ## Mandatory Workflow Rules | ||
|
|
||
| 1. **No skipping steps** — Mỗi step phải được execute theo order | ||
| 2. **Validate assumptions** — Mỗi decision phải được document | ||
| 3. **Cross-phase collaboration** — Architects + SRE + Analysts work together | ||
| 4. **Output artifacts** — Mỗi step produce tangible output documents | ||
| 5. **Handoff protocol** — Context được transfer giữa steps rõ ràng | ||
|
|
||
| ## Navigation | ||
|
|
||
| Hãy chọn cách bắt đầu: | ||
|
|
||
| - **[NEW]** — Bắt đầu workflow mới → Load step-01 | ||
| - **[RESUME]** — Quay lại workflow đã từng chạy (detect progress) | ||
| - **[SKIP-TO]** — Nhảy tới step cụ thể (dev-only, requires confirmation) | ||
|
|
||
| --- | ||
|
|
||
| **Tiếp tục bằng cách chọn [NEW] hoặc [RESUME]** |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if setup-new-service documentation exists under docs/
# Search for setup-new-service or related documentation
fd -e md . docs/ -x grep -l "setup.new.service\|setup-new-service\|Setup Production Service" {} \;
# Check if PSM workflows are documented
fd -e md . docs/ -x grep -l "PSM workflow\|Production Systems.*workflow" {} \;Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# Check repository structure
git ls-files | head -20
fd -type d -name docs -o -name documentation | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 842
🏁 Script executed:
# Check if docs/ exists and its contents
ls -la docs/ 2>/dev/null | head -20 || echo "docs/ not found at root"
find . -maxdepth 2 -type d -name docs -o -name doc 2>/dev/nullRepository: bmad-code-org/BMAD-METHOD
Length of output: 752
🏁 Script executed:
# Check if this workflow is a pre-existing file being converted or newly added
git log --oneline --all -- "src/psm/workflows/bmad-psm-setup-new-service/workflow.md" 2>/dev/null | head -5 || echo "File history not available"Repository: bmad-code-org/BMAD-METHOD
Length of output: 107
🏁 Script executed:
# Check for other workflow files and their documentation patterns
fd -path "*/workflows/*" -name "workflow.md" | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 237
🏁 Script executed:
# Search for any documentation of PSM or setup workflows in the repo
rg -l "psm|Production.*Service|setup.*service" --type md | grep -E "(docs/|README)" | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# Check docs subdirectories for workflow-related content
ls -la docs/explanation/ docs/how-to/ docs/reference/ 2>/dev/null | head -30Repository: bmad-code-org/BMAD-METHOD
Length of output: 1622
🏁 Script executed:
# Search for any workflow documentation pattern
rg -i "workflow" docs/ --type md 2>/dev/null | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 1359
🏁 Script executed:
# Check if this is a new file or pre-existing
git show HEAD:src/psm/workflows/bmad-psm-setup-new-service/workflow.md 2>/dev/null | head -5 || echo "File does not exist in HEAD"Repository: bmad-code-org/BMAD-METHOD
Length of output: 195
🏁 Script executed:
# Check for other PSM-related files and documentation
fd "psm" src/ --type f -name "*.md" | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 299
🏁 Script executed:
# Check if there's a skill-validator.md referenced in guidelines
cat tools/skill-validator.md 2>/dev/null | head -50 || echo "skill-validator.md not found"Repository: bmad-code-org/BMAD-METHOD
Length of output: 2176
🏁 Script executed:
# Check if workflow-map.md exists
ls -la docs/reference/workflow-map.md 2>/dev/null || echo "workflow-map.md not found"Repository: bmad-code-org/BMAD-METHOD
Length of output: 148
🏁 Script executed:
# Check docs/reference directory contents
ls -la docs/reference/Repository: bmad-code-org/BMAD-METHOD
Length of output: 556
🏁 Script executed:
# Search for any pre-existing XML workflow versions
fd "\.xml" src/psm/Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# Check git log with different approach to see file creation/modification
git log --all --pretty=format:"%h %s" -- "src/psm/workflows/bmad-psm-setup-new-service/" 2>/dev/null | head -20 || echo "No git history"Repository: bmad-code-org/BMAD-METHOD
Length of output: 107
🏁 Script executed:
# Check if this is part of a broader conversion by looking at git diff or recent changes
git diff --name-status HEAD~1 HEAD 2>/dev/null | grep psm || echo "No recent PSM changes"Repository: bmad-code-org/BMAD-METHOD
Length of output: 90
🏁 Script executed:
# Check what's in docs/reference for workflow references
cat docs/reference/workflow-map.md 2>/dev/null | head -50Repository: bmad-code-org/BMAD-METHOD
Length of output: 3725
🏁 Script executed:
# Check if PSM workflows are documented in workflow-map.md
grep -i "psm\|setup.*service" docs/reference/workflow-map.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# Check the full workflow-map to see all documented workflows
cat docs/reference/workflow-map.mdRepository: bmad-code-org/BMAD-METHOD
Length of output: 6581
🏁 Script executed:
# Check if this is a new workflow or if it's pre-existing
git log --all --oneline -- "src/psm/workflows/bmad-psm-setup-new-service/" 2>/dev/null | wc -lRepository: bmad-code-org/BMAD-METHOD
Length of output: 70
🏁 Script executed:
# Check what PSM workflows exist in src/psm/workflows/
fd "workflow.md" src/psm/workflows/Repository: bmad-code-org/BMAD-METHOD
Length of output: 406
🏁 Script executed:
# Check if other PSM workflows are also undocumented (to see if this is systemic)
for file in src/psm/workflows/*/workflow.md; do
workflow_name=$(basename $(dirname "$file"))
grep -i "$workflow_name\|psm" docs/reference/workflow-map.md 2>/dev/null && echo "FOUND: $workflow_name" || echo "MISSING: $workflow_name"
doneRepository: bmad-code-org/BMAD-METHOD
Length of output: 280
🏁 Script executed:
# Check git history for workflow-map.md to see if PSM was ever documented
git log --oneline -- docs/reference/workflow-map.md 2>/dev/null | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 107
🏁 Script executed:
# Check if there's a PSM section planned or mentioned in docs/
grep -r "PSM\|Production Systems" docs/ 2>/dev/null || echo "PSM not mentioned in docs"Repository: bmad-code-org/BMAD-METHOD
Length of output: 94
🏁 Script executed:
# Check if other modules (like BMM) have their own documentation
ls -la src/bmm/workflows/ 2>/dev/null | head -10Repository: bmad-code-org/BMAD-METHOD
Length of output: 51
🏁 Script executed:
# See when the PSM module was added to the repo
git log --all --name-status --pretty=format:"%h %s" -- "src/psm/" 2>/dev/null | head -20 || echo "No PSM history"Repository: bmad-code-org/BMAD-METHOD
Length of output: 1087
Update documentation in docs/reference/workflow-map.md to include the PSM (Production Systems Management) module and its six workflows.
The coding guidelines require that new features be reflected in corresponding documentation. The PSM module was newly added but completely lacks documentation in docs/reference/workflow-map.md. All six PSM workflows—including bmad-psm-setup-new-service—are missing from the workflow reference.
Add a new section to the workflow map (likely Phase 4: Implementation or a dedicated Production Systems section) with entries for:
bmad-psm-setup-new-service(Setup Production Service for BMAD)bmad-psm-production-readinessbmad-psm-incident-responsebmad-psm-mlops-deploymentbmad-psm-security-auditbmad-psm-quick-diagnose
Include their purposes and primary outputs to match the documentation pattern used for BMM workflows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-setup-new-service/workflow.md` around lines 1 -
116, Add a new "PSM / Production Systems Management" section to
docs/reference/workflow-map.md (place it under Phase 4: Implementation or as its
own Production Systems section) and list the six workflows from this diff:
bmad-psm-setup-new-service (Setup Production Service for BMAD),
bmad-psm-production-readiness, bmad-psm-incident-response,
bmad-psm-mlops-deployment, bmad-psm-security-audit, and bmad-psm-quick-diagnose;
for each entry include a one-line purpose and primary outputs (e.g.,
architecture docs, CI/CD pipelines, PRR checklist, runbooks, incident runbooks,
MLOps deployment artifacts, security audit reports, quick diagnostic playbooks)
following the same format and tone used by the existing BMM workflow entries so
the new workflows integrate consistently with the workflow-map structure.
| Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng: | ||
|
|
||
| 1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection | ||
| 2. **Step-02-API-Database** → API Design + Database Selection + Schema | ||
| 3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy | ||
| 4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling | ||
| 5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config | ||
| 6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify that all referenced step files exist
workflow_dir="src/psm/workflows/bmad-psm-setup-new-service"
echo "Checking for step files in ${workflow_dir}..."
# Check for steps directory
if [ -d "${workflow_dir}/steps" ]; then
echo "✓ steps/ directory exists"
ls -la "${workflow_dir}/steps/"
else
echo "✗ steps/ directory not found"
fi
# Check for specific step files
for step in step-01-architecture step-02-api-database step-03-build-deploy step-04-reliability step-05-security-infra step-06-readiness; do
if [ -f "${workflow_dir}/steps/${step}.md" ]; then
echo "✓ ${step}.md exists"
else
echo "✗ ${step}.md NOT FOUND"
fi
done
# Check for SKILL.md and manifest
if [ -f "${workflow_dir}/SKILL.md" ]; then
echo "✓ SKILL.md exists"
else
echo "✗ SKILL.md NOT FOUND"
fi
if [ -f "${workflow_dir}/bmad-skill-manifest.yaml" ]; then
echo "✓ bmad-skill-manifest.yaml exists"
else
echo "✗ bmad-skill-manifest.yaml NOT FOUND"
fiRepository: bmad-code-org/BMAD-METHOD
Length of output: 438
🏁 Script executed:
#!/bin/bash
# Search for any markdown files in the workflow directory
workflow_dir="src/psm/workflows/bmad-psm-setup-new-service"
echo "=== All markdown files in workflow directory ==="
find "${workflow_dir}" -name "*.md" -type f
echo ""
echo "=== Directory structure ==="
tree -L 2 "${workflow_dir}" 2>/dev/null || find "${workflow_dir}" -type fRepository: bmad-code-org/BMAD-METHOD
Length of output: 382
Add missing step files or remove step references from the workflow.
The workflow references 6 steps (lines 37–44) but none of the corresponding step files exist in the repository:
step-01-architecture.mdstep-02-api-database.mdstep-03-build-deploy.mdstep-04-reliability.mdstep-05-security-infra.mdstep-06-readiness.md
Either include these step files in the PR or update the workflow to reflect the actual implementation status.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/psm/workflows/bmad-psm-setup-new-service/workflow.md` around lines 37 -
44, The workflow lists six step files (step-01-architecture.md,
step-02-api-database.md, step-03-build-deploy.md, step-04-reliability.md,
step-05-security-infra.md, step-06-readiness.md) that are not present; either
add those missing markdown files with the corresponding step content (matching
the headings in workflow.md: Step-01-Architecture through Step-06-Readiness)
into the PR, or remove/update the six step references in workflow.md so the
workflow only points to existing step files or a consolidated document; ensure
filenames and the step identifiers in workflow.md (e.g., Step-01-Architecture,
Step-02-API-Database, etc.) match exactly.
PSM is now a standalone module at: https://github.com/DoanNgocCuong/bmad-module-production-systems It's registered in external-official-modules.yaml for installer integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add new PSM module for production operations:
What
Why
How
Testing