Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PodSecurity] Update monitoring proposal #2990

Merged
merged 3 commits into from
Oct 5, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 56 additions & 14 deletions keps/sig-auth/2579-psp-replacement/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -598,31 +598,67 @@ coverage of unit tests.

### Monitoring

A single metric will be added to track policy evaluations against pods and [templated pods].
[Namespace evaluations](#namespace-policy-update-warnings) are not counted.
Three metrics will be introduced:

```
pod_security_evaluations_total
```

This metric will be added to track policy evaluations against pods and [templated pods].
[Namespace evaluations](#namespace-policy-update-warnings) are not counted.
The metric will only be incremented when the policy check is actually performed. In other words,
this metric will not be incremented if any of the following are true:

- Ignored resource types, subresources, or workload resources without a pod template
- Update requests that are out of scope (see [Updates](#updates) above)
- Exempt requests
tallclair marked this conversation as resolved.
Show resolved Hide resolved
- Errors that make policy evaluation impossible
tallclair marked this conversation as resolved.
Show resolved Hide resolved

The metric will use the following labels:

1. `decision {exempt, allow, deny, error}` - The policy decision. Error is reserved for panics or
other errors in policy evaluation. Update requests that are out of scope (see [Updates](#updates)
above) are not counted.
1. `decision {allow, deny}` - The policy decision. `allow` is only recorded with `enforce` mode.
tallclair marked this conversation as resolved.
Show resolved Hide resolved
3. `policy_level {privileged, baseline, restricted}` - The policy level that the request was
evaluated against.
4. `policy_version {v1.X, v1.Y, latest, future}` - The policy version that was used for the evaluation.
Explicit versions less than or equal to the build of the API server or webhook are recorded in the form `v1.x` (e.g. `v1.22`).
Explicit versions greater than the build of the API server or webhook (which are evaluated as `latest`) are recorded as `future`.
Explicit use of the `latest` version or implicit use by omitting a version or specifying an unparseable version will be recorded as `latest`.
5. `mode {enforce, warn, audit}` - The type of evaluation mode being recorded. Note that a single
request can increment this metric 3 times, once for each mode. If this admission controller is
enabled, every every create request and in-scope update request will at least increment the
`enforce` total. Privileged evaluations for warn and audit modes are not counted.
request can increment this metric 3 times, once for each mode. `audit` and `warn` mode metrics
are only incremented for violations. If this admission controller is enabled, every
evaluated request will at least increment the `enforce` total.
Comment on lines +627 to +629
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone wanted to figure out the proportion of allowed/denied audit or warn requests, they'd now have to compare the number of denied audit or warn requests to the total number of mode=enforce requests, right? that could be ok, but is non-obvious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We could have a separate metric for tracking total evaluations, but that seems unnecessary. I agree it's non-obvious, but maybe it's something we can just add to the playbook...

6. `request_operation {create, update}` - The operation of the request being checked.
7. `resource {pod, controller}` - Whether the request object is a Pod, or a [templated
pod](#podtemplate-resources) resource.
8. `subresource {ephemeralcontainers}` - The subresource, when relevant & in scope.

```
pod_security_exemptions_total
tallclair marked this conversation as resolved.
Show resolved Hide resolved
```

This metric will be added to track requests that are considered exempt. Ignored resources and out of
scope requests do not count towards the total. Errors encountered before the exemption logic will
not be counted as exempt.

The metric will use the following labels. The definitions match from the above label definitions.

1. `request_operation {create, update}`
2. `resource {pod, controller}`
3. `subresource {ephemeralcontainers}`

```
pod_security_errors_total
```

This metric will be added to track errors encountered during request evaluation.

The metric will use the following labels. The definitions match from the above label definitions.

1. `fatal {true, false}` - Whether the error prevented evaluation (short-circuit deny). If
`fatal=false` then the latest restricted profile may be used to evaluate the pod.
tallclair marked this conversation as resolved.
Show resolved Hide resolved
2. `request_operation {create, update}`
3. `resource {pod, controller}`
4. `subresource {ephemeralcontainers}`

tallclair marked this conversation as resolved.
Show resolved Hide resolved
### Audit Annotations

Expand Down Expand Up @@ -810,7 +846,7 @@ _This section must be completed when targeting alpha to a release._
of the following metrics mean the feature is not working as expected:

* `pod_security_evaluations_total{decision=deny,mode=enforce}`
* `pod_security_evaluations_total{decision=error,mode=enforce}`
* `pod_security_errors_total`

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**

Expand All @@ -831,15 +867,21 @@ _This section must be completed when targeting alpha to a release._

* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
- [x] Metrics
- Metric name: `pod_security_evaluations_total`
- Metric name: `pod_security_evaluations_total`, `pod_security_errors_total`
- Components exposing the metric: `kube-apiserver`

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
- `pod_security_evaluations_total{decision=error}`
- `pod_security_errors_total`
- any rising count of these metrics indicates an unexpected problem evaluating the policy
- `pod_security_evaluations_total{decision=error,mode=enforce}`
- `pod_security_errors_total{fatal=true}`
- any rising count of these metrics indicates an unexpected problem evaluating the policy that
is preventing pod write requests
- `pod_security_errors_total{fatal=false}`,
`pod_security_evaluations_total{decision=deny,mode=enforce,level=restricted,version=latest}`
- a rising count of non-fatal errors indicates an error resolving namespace policies, which
causes PodSecurity to default to enforcing `restricted:latest`
- a corresponding rise in `restricted:latest` denials may indicate that these errors are
preventing pod write requests
- `pod_security_evaluations_total{decision=deny,mode=enforce}`
- a rising count indicates that the policy is preventing pod creation as intended, but is
preventing a user or controller from successfully writing pods
Expand Down Expand Up @@ -922,8 +964,8 @@ details). For now, we leave it here.
- Testing: unit testing on configuration validation

- Enforce mode rejects pods because invalid level/version defaulted to `restricted` level
- Detection: rising `pod_security_evaluations_total{decision=error,mode=enforce}` metric counts
- Mitigations:
- Detection: rising `pod_security_errors_total{fatal=false}` metric counts
- Mitigations: fix the malformed labels
- Diagnostics:
- Locate audit logs containing `pod-security.kubernetes.io/error` annotations on affected requests
- Locate namespaces with malformed level labels:
Expand Down
2 changes: 2 additions & 0 deletions keps/sig-auth/2579-psp-replacement/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,5 @@ disable-supported: true
# The following PRR answers are required at beta release
metrics:
- pod_security_evaluations_total
- pod_security_exemptions_total
- pod_security_errors_total