From edb6e9a851779cb0880dd9d72912e22073fce71e Mon Sep 17 00:00:00 2001 From: Tim Allclair Date: Thu, 23 Sep 2021 16:33:25 -0700 Subject: [PATCH] fixup! [PodSecurity] Update monitoring proposal --- keps/sig-auth/2579-psp-replacement/README.md | 18 ++++++++++++------ keps/sig-auth/2579-psp-replacement/kep.yaml | 2 ++ 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/keps/sig-auth/2579-psp-replacement/README.md b/keps/sig-auth/2579-psp-replacement/README.md index 5bddda721034..cff196aa3c63 100644 --- a/keps/sig-auth/2579-psp-replacement/README.md +++ b/keps/sig-auth/2579-psp-replacement/README.md @@ -846,7 +846,7 @@ _This section must be completed when targeting alpha to a release._ of the following metrics mean the feature is not working as expected: * `pod_security_evaluations_total{decision=deny,mode=enforce}` - * `pod_security_evaluations_total{decision=error,mode=enforce}` + * `pod_security_errors_total` * **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** @@ -867,15 +867,21 @@ _This section must be completed when targeting alpha to a release._ * **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** - [x] Metrics - - Metric name: `pod_security_evaluations_total` + - Metric name: `pod_security_evaluations_total`, `pod_security_errors_total` - Components exposing the metric: `kube-apiserver` * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** - - `pod_security_evaluations_total{decision=error}` + - `pod_security_errors_total` - any rising count of these metrics indicates an unexpected problem evaluating the policy - - `pod_security_evaluations_total{decision=error,mode=enforce}` + - `pod_security_errors_total{fatal=true}` - any rising count of these metrics indicates an unexpected problem evaluating the policy that is preventing pod write requests + - `pod_security_errors_total{fatal=false}`, + `pod_security_evaluations_total{decision=deny,mode=enforce,level=restricted,version=latest}` + - a rising count of non-fatal errors indicates an error resolving namespace policies, which + causes PodSecurity to default to enforcing `restricted:latest` + - a corresponding rise in `restricted:latest` denials may indicate that these errors are + preventing pod write requests - `pod_security_evaluations_total{decision=deny,mode=enforce}` - a rising count indicates that the policy is preventing pod creation as intended, but is preventing a user or controller from successfully writing pods @@ -958,8 +964,8 @@ details). For now, we leave it here. - Testing: unit testing on configuration validation - Enforce mode rejects pods because invalid level/version defaulted to `restricted` level - - Detection: rising `pod_security_evaluations_total{decision=error,mode=enforce}` metric counts - - Mitigations: + - Detection: rising `pod_security_errors_total{fatal=false}` metric counts + - Mitigations: fix the malformed labels - Diagnostics: - Locate audit logs containing `pod-security.kubernetes.io/error` annotations on affected requests - Locate namespaces with malformed level labels: diff --git a/keps/sig-auth/2579-psp-replacement/kep.yaml b/keps/sig-auth/2579-psp-replacement/kep.yaml index fcd47daac3b5..479c7af0eba4 100644 --- a/keps/sig-auth/2579-psp-replacement/kep.yaml +++ b/keps/sig-auth/2579-psp-replacement/kep.yaml @@ -53,3 +53,5 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - pod_security_evaluations_total + - pod_security_exemptions_total + - pod_security_errors_total