The CR-state half of this issue is resolved by the kube-state-metrics addon work:
KSM now loads its customResourceState natively from the addon values
(addons/observability/kube-state-metrics/values.yaml, added in #44, deepened to the
full operator CRD surface in #55) — no ConfigMap mount or --custom-resource-state-config-file
wiring is needed, and the duplicate copy in the operator chart was removed
(eks-agent-platform#48, which made the eks-gitops addon the single source). The agent-*
persona dashboards read real kube_customresource_* series (de-hollowed in #55).
Remaining scope
Flip slo.alerting.enabled: true in addons/ai-platform/operator/values-production.yaml
once the alerting Secrets exist in production:
pagerduty-platform
slack-webhook-{incidents,finance,ops,eng,platform}
The operator chart's AlertmanagerConfig receivers reference these; the toggle defaults
off and is documented in the chart NOTES + README. This is a prod-enablement step
gated on an external prerequisite (the Secrets), not a code change.
Note: in prod the in-cluster Alertmanager path is moot anyway — there's no
prometheus-operator/Alertmanager on the hub; SLO alerting runs through Grafana-managed
alert rules (GrafanaAlertRuleGroup → AMG). This toggle matters for the kx / kube-prometheus-stack
path. Confirm which alerting plane production actually uses before provisioning the Secrets.
The CR-state half of this issue is resolved by the kube-state-metrics addon work:
KSM now loads its
customResourceStatenatively from the addon values(
addons/observability/kube-state-metrics/values.yaml, added in #44, deepened to thefull operator CRD surface in #55) — no ConfigMap mount or
--custom-resource-state-config-filewiring is needed, and the duplicate copy in the operator chart was removed
(eks-agent-platform#48, which made the eks-gitops addon the single source). The agent-*
persona dashboards read real
kube_customresource_*series (de-hollowed in #55).Remaining scope
Flip
slo.alerting.enabled: trueinaddons/ai-platform/operator/values-production.yamlonce the alerting Secrets exist in production:
pagerduty-platformslack-webhook-{incidents,finance,ops,eng,platform}The operator chart's AlertmanagerConfig receivers reference these; the toggle defaults
off and is documented in the chart NOTES + README. This is a prod-enablement step
gated on an external prerequisite (the Secrets), not a code change.
Note: in prod the in-cluster Alertmanager path is moot anyway — there's no
prometheus-operator/Alertmanager on the hub; SLO alerting runs through Grafana-managed
alert rules (GrafanaAlertRuleGroup → AMG). This toggle matters for the kx / kube-prometheus-stack
path. Confirm which alerting plane production actually uses before provisioning the Secrets.