Skip to content

Enable operator SLO alerting in production (provision the PagerDuty + Slack Secrets) #33

Description

@stxkxs

The CR-state half of this issue is resolved by the kube-state-metrics addon work:
KSM now loads its customResourceState natively from the addon values
(addons/observability/kube-state-metrics/values.yaml, added in #44, deepened to the
full operator CRD surface in #55) — no ConfigMap mount or --custom-resource-state-config-file
wiring is needed, and the duplicate copy in the operator chart was removed
(eks-agent-platform#48, which made the eks-gitops addon the single source). The agent-*
persona dashboards read real kube_customresource_* series (de-hollowed in #55).

Remaining scope

Flip slo.alerting.enabled: true in addons/ai-platform/operator/values-production.yaml
once the alerting Secrets exist in production:

  • pagerduty-platform
  • slack-webhook-{incidents,finance,ops,eng,platform}

The operator chart's AlertmanagerConfig receivers reference these; the toggle defaults
off and is documented in the chart NOTES + README. This is a prod-enablement step
gated on an external prerequisite (the Secrets), not a code change.

Note: in prod the in-cluster Alertmanager path is moot anyway — there's no
prometheus-operator/Alertmanager on the hub; SLO alerting runs through Grafana-managed
alert rules (GrafanaAlertRuleGroup → AMG). This toggle matters for the kx / kube-prometheus-stack
path. Confirm which alerting plane production actually uses before provisioning the Secrets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions