Skip to content

Commit 8e40772

Browse files
committed
Move "Understand PSI Metrics" into a new reference doc
The new reference doc talks about how to generate CPU / memory / I/O pressures with test workloads, and how to interpret PSI metrics through both the Summary API and the Prometheus metrics.
1 parent 1a1d3f0 commit 8e40772

File tree

4 files changed

+207
-235
lines changed

4 files changed

+207
-235
lines changed

content/en/docs/concepts/cluster-administration/system-metrics.md

Lines changed: 1 addition & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -199,134 +199,7 @@ container_pressure_io_waiting_seconds_total
199199
This feature is enabled by default, by setting the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). The information is also exposed in the
200200
[Summary API](/docs/reference/instrumentation/node-metrics#psi).
201201

202-
#### Understanding PSI Metrics
203-
204-
Pressure Stall Information (PSI) metrics are provided for three resources: CPU, memory, and I/O. They are categorized into two main types of pressure: `some` and `full`.
205-
206-
* **`some`**: This value indicates that some tasks (one or more) are stalled on a resource. For example, if some tasks are waiting for I/O, this metric will increase. This can be an early indicator of resource contention.
207-
* **`full`**: This value indicates that *all* non-idle tasks are stalled on a resource simultaneously. This signifies a more severe resource shortage, where the entire system is unable to make progress.
208-
209-
Each pressure type provides four metrics: `avg10`, `avg60`, `avg300`, and `total`. The `avg` values represent the percentage of wall-clock time that tasks were stalled over 10-second, 60-second, and 3-minute moving averages. The `total` value is a cumulative counter in microseconds showing the total time tasks have been stalled.
210-
211-
#### Example Scenarios
212-
213-
You can use a simple Pod with a stress-testing tool to generate resource pressure and observe the PSI metrics. The following examples use the `agnhost` container image, which includes the `stress` tool.
214-
215-
The examples show how to query the kubelet's `/metrics/cadvisor` endpoint to observe the Prometheus metrics.
216-
217-
**Example 1: Generating CPU Pressure**
218-
219-
Create a Pod that generates CPU pressure using the `stress` utility. This workload will put a heavy load on one CPU core.
220-
221-
Create a file named `cpu-pressure-pod.yaml`:
222-
```yaml
223-
apiVersion: v1
224-
kind: Pod
225-
metadata:
226-
name: cpu-pressure-pod
227-
spec:
228-
restartPolicy: Never
229-
containers:
230-
- name: cpu-stress
231-
image: registry.k8s.io/e2e-test-images/agnhost:2.47
232-
args:
233-
- "stress"
234-
- "--cpus"
235-
- "1"
236-
```
237-
238-
Apply it to your cluster: `kubectl apply -f cpu-pressure-pod.yaml`
239-
240-
After the Pod is running, query the `/metrics/cadvisor` endpoint to see the `container_pressure_cpu_waiting_seconds_total` metric.
241-
```shell
242-
# Replace <node-name> with the name of the node where the pod is running
243-
kubectl get --raw "/api/v1/nodes/<node-name>/proxy/metrics/cadvisor" | \
244-
grep 'container_pressure_cpu_waiting_seconds_total{container="cpu-stress",pod="cpu-pressure-pod"}'
245-
```
246-
The output should show an increasing value, indicating that the container is spending time stalled waiting for CPU resources.
247-
248-
249-
Clean up the Pod when you are finished:
250-
```shell
251-
kubectl delete pod cpu-pressure-pod
252-
```
253-
254-
**Example 2: Generating Memory Pressure**
255-
256-
This example creates a Pod that continuously writes to files in the container's writable layer, causing the kernel's page cache to grow and forcing memory reclamation, which generates pressure.
257-
258-
Create a file named `memory-pressure-pod.yaml`:
259-
```yaml
260-
apiVersion: v1
261-
kind: Pod
262-
metadata:
263-
name: memory-pressure-pod
264-
spec:
265-
restartPolicy: Never
266-
containers:
267-
- name: memory-stress
268-
image: registry.k8s.io/e2e-test-images/agnhost:2.47
269-
command: ["/bin/sh", "-c"]
270-
args:
271-
- "i=0; while true; do dd if=/dev/zero of=testfile.$i bs=1M count=50 &>/dev/null; i=$(((i+1)%5)); sleep 0.1; done"
272-
resources:
273-
limits:
274-
memory: "200M"
275-
requests:
276-
memory: "200M"
277-
```
278-
279-
Apply it to your cluster: `kubectl apply -f memory-pressure-pod.yaml`
280-
281-
After the Pod is running, query the `/metrics/cadvisor` endpoint to see the `container_pressure_memory_waiting_seconds_total` metric.
282-
```shell
283-
# Replace <node-name> with the name of the node where the pod is running
284-
kubectl get --raw "/api/v1/nodes/<node-name>/proxy/metrics/cadvisor" | \
285-
grep 'container_pressure_memory_waiting_seconds_total{container="memory-stress",pod="memory-pressure-pod"}'
286-
```
287-
In the output, you will observe an increasing value for the metric, indicating that the system is under significant memory pressure.
288-
289-
290-
Clean up the Pod when you are finished:
291-
```shell
292-
kubectl delete pod memory-pressure-pod
293-
```
294-
295-
**Example 3: Generating I/O Pressure**
296-
297-
This Pod generates I/O pressure by repeatedly writing a file to disk and using `sync` to flush the data from memory, which creates I/O stalls.
298-
299-
Create a file named `io-pressure-pod.yaml`:
300-
```yaml
301-
apiVersion: v1
302-
kind: Pod
303-
metadata:
304-
name: io-pressure-pod
305-
spec:
306-
restartPolicy: Never
307-
containers:
308-
- name: io-stress
309-
image: registry.k8s.io/e2e-test-images/agnhost:2.47
310-
command: ["/bin/sh", "-c"]
311-
args:
312-
- "while true; do dd if=/dev/zero of=testfile bs=1M count=128 &>/dev/null; sync; rm testfile &>/dev/null; done"
313-
```
314-
315-
Apply this to your cluster: `kubectl apply -f io-pressure-pod.yaml`
316-
317-
After the Pod is running, query the `/metrics/cadvisor` endpoint to see the `container_pressure_io_waiting_seconds_total` metric.
318-
```shell
319-
# Replace <node-name> with the name of the node where the pod is running
320-
kubectl get --raw "/api/v1/nodes/<node-name>/proxy/metrics/cadvisor" | \
321-
grep 'container_pressure_io_waiting_seconds_total{container="io-stress",pod="io-pressure-pod"}'
322-
```
323-
You will see the metric's value increase as the Pod continuously writes to disk.
324-
325-
326-
Clean up the Pod when you are finished:
327-
```shell
328-
kubectl delete pod io-pressure-pod
329-
```
202+
You can learn how to interpret the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/).
330203

331204
#### Requirements
332205

content/en/docs/reference/instrumentation/node-metrics.md

Lines changed: 1 addition & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -54,113 +54,7 @@ See [Summary API](/docs/reference/config-api/kubelet-stats.v1alpha1/) for detail
5454
This feature is enabled by default, by setting the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). The information is also exposed in
5555
[Prometheus metrics](/docs/concepts/cluster-administration/system-metrics#psi-metrics).
5656

57-
#### Understanding PSI Metrics
58-
59-
Pressure Stall Information (PSI) metrics are provided for three resources: CPU, memory, and I/O. They are categorized into two main types of pressure: `some` and `full`.
60-
61-
* **`some`**: This value indicates that some tasks (one or more) are stalled on a resource. For example, if some tasks are waiting for I/O, this metric will increase. This can be an early indicator of resource contention.
62-
* **`full`**: This value indicates that *all* non-idle tasks are stalled on a resource simultaneously. This signifies a more severe resource shortage, where the entire system is unable to make progress.
63-
64-
Each pressure type provides four metrics: `avg10`, `avg60`, `avg300`, and `total`. The `avg` values represent the percentage of wall-clock time that tasks were stalled over 10-second, 60-second, and 3-minute moving averages. The `total` value is a cumulative counter in microseconds showing the total time tasks have been stalled.
65-
66-
#### Example Scenarios
67-
68-
You can use a simple Pod with a stress-testing tool to generate resource pressure and observe the PSI metrics. The following examples use the `agnhost` container image, which includes the `stress` tool.
69-
70-
First, start by watching the summary stats for your node. In a separate terminal, run:
71-
```shell
72-
# Replace <node-name> with the name of a node in your cluster
73-
kubectl get --raw "/api/v1/nodes/<node-name>/proxy/stats/summary" | jq '.pods[] | select(.podRef.name | contains("pressure-pod"))'
74-
```
75-
76-
**Example 1: Generating CPU Pressure**
77-
78-
Create a Pod that generates CPU pressure using the `stress` utility. This workload will put a heavy load on one CPU core.
79-
80-
Create a file named `cpu-pressure-pod.yaml`:
81-
```yaml
82-
apiVersion: v1
83-
kind: Pod
84-
metadata:
85-
name: cpu-pressure-pod
86-
spec:
87-
restartPolicy: Never
88-
containers:
89-
- name: cpu-stress
90-
image: registry.k8s.io/e2e-test-images/agnhost:2.47
91-
args:
92-
- "stress"
93-
- "--cpus"
94-
- "1"
95-
```
96-
97-
Apply it to your cluster: `kubectl apply -f cpu-pressure-pod.yaml`
98-
99-
After the Pod is running, you will see the `some` PSI metrics for CPU increase in the summary API output. The `avg10` value for `some` pressure should rise above zero, indicating that tasks are spending time stalled on the CPU.
100-
101-
Clean up the Pod when you are finished:
102-
```shell
103-
kubectl delete pod cpu-pressure-pod
104-
```
105-
106-
**Example 2: Generating Memory Pressure**
107-
108-
This example creates a Pod that continuously writes to files in the container's writable layer, causing the kernel's page cache to grow and forcing memory reclamation, which generates pressure.
109-
110-
Create a file named `memory-pressure-pod.yaml`:
111-
```yaml
112-
apiVersion: v1
113-
kind: Pod
114-
metadata:
115-
name: memory-pressure-pod
116-
spec:
117-
restartPolicy: Never
118-
containers:
119-
- name: memory-stress
120-
image: registry.k8s.io/e2e-test-images/agnhost:2.47
121-
command: ["/bin/sh", "-c"]
122-
args:
123-
- "i=0; while true; do dd if=/dev/zero of=testfile.$i bs=1M count=50 &>/dev/null; i=$(((i+1)%5)); sleep 0.1; done"
124-
resources:
125-
limits:
126-
memory: "200M"
127-
requests:
128-
memory: "200M"
129-
```
130-
131-
Apply it to your cluster. In the summary output, you will observe an increase in the `full` PSI metrics for memory, indicating that the system is under significant memory pressure.
132-
133-
Clean up the Pod when you are finished:
134-
```shell
135-
kubectl delete pod memory-pressure-pod
136-
```
137-
138-
**Example 3: Generating I/O Pressure**
139-
140-
This Pod generates I/O pressure by repeatedly writing a file to disk and using `sync` to flush the data from memory, which creates I/O stalls.
141-
142-
Create a file named `io-pressure-pod.yaml`:
143-
```yaml
144-
apiVersion: v1
145-
kind: Pod
146-
metadata:
147-
name: io-pressure-pod
148-
spec:
149-
restartPolicy: Never
150-
containers:
151-
- name: io-stress
152-
image: registry.k8s.io/e2e-test-images/agnhost:2.47
153-
command: ["/bin/sh", "-c"]
154-
args:
155-
- "while true; do dd if=/dev/zero of=testfile bs=1M count=128 &>/dev/null; sync; rm testfile &>/dev/null; done"
156-
```
157-
158-
Apply this to your cluster. You will see the `some` PSI metrics for I/O increase as the Pod continuously writes to disk.
159-
160-
Clean up the Pod when you are finished:
161-
```shell
162-
kubectl delete pod io-pressure-pod
163-
```
57+
You can learn how to interpret the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/).
16458

16559
### Requirements
16660

0 commit comments

Comments
 (0)