From 024299715ee010a2cfad8092f4b42ba479a9a2bb Mon Sep 17 00:00:00 2001
From: "Shih-Ting, Yuan" <shihtiy.tw@gmail.com>
Date: Wed, 28 Aug 2024 01:22:24 +0000
Subject: [PATCH] [post]: publish kuberentes ephemeral storage

---
 ...-08-27-Kubernetes-OOM-Ephemeral-Storage.md | 160 ++++++++++++++++++
 1 file changed, 160 insertions(+)
 create mode 100644 _posts/2024-08-27-Kubernetes-OOM-Ephemeral-Storage.md

diff --git a/_posts/2024-08-27-Kubernetes-OOM-Ephemeral-Storage.md b/_posts/2024-08-27-Kubernetes-OOM-Ephemeral-Storage.md
new file mode 100644
index 00000000000..92de7cad3af
--- /dev/null
+++ b/_posts/2024-08-27-Kubernetes-OOM-Ephemeral-Storage.md
@@ -0,0 +1,160 @@
+---
+title: "Kubernetes Eviction of Ephemeral Storage: Troubleshoot OOMKilled Pods That Won't Restart"
+date: 2024-08-27 23:31:21 +0000
+categories:
+  - CSIE
+  - Kubernetes
+tags:
+  - kubernetes
+  - oom
+  - ephemeral-storage
+  - eviction
+---
+
+## Introduction
+
+According to the Kubernetes documents, when a container in a pod consumes more memory than its limit, the pod is killed. A Pod killed by OOM (`OOMKilled` Pod) will be restarted if the restart policy allow:[^1]
+
+> If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure.
+
+However, sometimes people may see an OOMKilled Pod stuck and the Kubernetes scheduler start another pod without restarting the OOMKilled Pod. This article records a scenario that the `ephemeral-storage` usage of the pod may cause a killed pod not be restarted.
+
+## The Scenario
+
+In most cases,, when a pod is killed by OOM, the pod will be restarted. We can see the behaviour from the following `kubectl get pod` output after executing the example in the Kuberentes document:[^1]
+
+```bash
+$ kubectl get pods -w
+NAME                            READY   STATUS      RESTARTS     AGE
+oom-demo-766c8d556d-2vm85       0/1     OOMKilled          11 (5m10s ago)   31m
+oom-demo-766c8d556d-2vm85       0/1     CrashLoopBackOff   11 (15s ago)     31m
+oom-demo-766c8d556d-2vm85       0/1     OOMKilled          12 (5m9s ago)    36m
+oom-demo-766c8d556d-2vm85       0/1     CrashLoopBackOff   12 (15s ago)     36m
+oom-demo-766c8d556d-2vm85       0/1     OOMKilled          13 (5m7s ago)    41m
+oom-demo-766c8d556d-2vm85       0/1     CrashLoopBackOff   13 (14s ago)     41m
+oom-demo-766c8d556d-2vm85       0/1     OOMKilled          14 (5m5s ago)    46m
+```
+
+However, there are situations where an OOMKilled pod may not be restarted, stuck in the `OOMKilled` or `Error` ,and instead, the Kubernetes scheduler will start a new pod:
+
+```bash
+$ kubectl get pods -o wide
+NAME                          READY   STATUS        RESTARTS       AGE     IP               NODE
+oom-demo-6f7747887b-7vx6v     0/1     Error               1               68s     192.167.35.208   ip-192-167-35-2.ec2.internal
+oom-demo-6f7747887b-m8bh5     1/1     Running             0               1s      192.167.28.227   ip-192-167-19-202.ec2.internal
+```
+
+But why the killed pod is not restarted while the `RestartPolicy` should be `Always` for a deployment[^2] and the pod should be automatically restarts the container after any termination.[^3]
+
+## Explanation
+
+This can happen when the pod is evicted due to reaching the node's ephemeral storage limit. Ephemeral storage is the temporary storage used by a pod for storing logs, data, and other temporary files. Each node has a limit on the amount of ephemeral storage that can be used by pods running on that node. When a pod's ephemeral storage usage exceeds this limit, the pod is evicted from the node.
+
+When a pod is evicted due to exceeding the ephemeral storage limit, the kubelet marks the pod as "Failed" with the reason "Evicted." The scheduler then assumes that the pod cannot be restarted on the same node and starts a new pod instead.
+
+While the killed pod still exist in the cluster, we can check the pod by `kubectl describe pod`:
+
+```bash
+$ kubectl describe pod oom-demo-6f7747887b-7vx6v
+...
+Status:           Failed
+Reason:           Evicted
+Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 8583014118, available: 554332Ki.
+...
+  Warning  Evicted              8m18s                  kubelet            The node was low on resource: ephemeral-storage. Threshold quantity: 8583014118, available: 554332Ki.
+  Normal   Killing              8m18s                  kubelet            Stopping container memory-demo-2-ctr
+  Warning  ExceededGracePeriod  8m8s                   kubelet            Container runtime did not kill the pod within specified grace period.
+```
+
+Here we can see that the pod is actually be evicted because the ephemeral-storage is reached the threshold of the node. We can see the ephemeral-storage threshold is `8583014118` bytes (8 GB) when it is a node of 80 GB disk which is `10%` of the node disk size.
+
+If the pod is running on EKS with control plane logging enabled[^4], we can also query the control plane log to see the message:
+
+```
+fields @timestamp, verb, requestURI, objectRef.namespace, objectRef.resource, objectRef.name , userAgent, requestObject.status.message
+    | filter verb not in ["get", "list", "watch"]
+    | filter objectRef.name  like "oom-demo-6f7747887b-7vx6v" or objectRef.name like "oom-demo-6f7747887b-m8bh5"
+```
+
+| @timestamp              | verb   | requestURI                                                        | objectRef.resource | objectRef.name            | userAgent                                                         | requestObject.status.message                                                                          |
+| ----------------------- | ------ | ----------------------------------------------------------------- | ------------------ | ------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
+| 2024-08-27 14:51:04.952 | patch  | /api/v1/namespaces/default/pods/oom-demo-6f7747887b-m8bh5/status  | pods               | oom-demo-6f7747887b-m8bh5 | kubelet/v1.29.5 (linux/amd64) kubernetes/1109419                  |                                                                                                       |
+| 2024-08-27 14:51:04.951 | create | /api/v1/namespaces/default/pods/oom-demo-6f7747887b-m8bh5/binding | pods               | oom-demo-6f7747887b-m8bh5 | kube-scheduler/v1.29.6 (linux/arm64) kubernetes/c978c80/scheduler |                                                                                                       |
+| 2024-08-27 14:51:04.950 | patch  | /api/v1/namespaces/default/pods/oom-demo-6f7747887b-7vx6v/status  | pods               | oom-demo-6f7747887b-7vx6v | kubelet/v1.29.5 (linux/amd64) kubernetes/1109419                  | The node was low on resource: ephemeral-storage. Threshold quantity: 8583014118, available: 554332Ki. |
+
+## Lab
+
+To reproduce this scenario, we can create a deployment with a container to consume a large amount of ephemeral storage and memory.
+
+1. Create a deployment with a stress container:
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: oom-demo
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: oom-demo
+  template:
+    metadata:
+      labels:
+        app: oom-demo
+    spec:
+      containers:
+        - name: memory-demo
+          image: polinux/stress
+          resources:
+            requests:
+              memory: "50Mi"
+            limits:
+              memory: "100Mi"
+          command: ["sleep"]
+          args: ["3600"]
+```
+
+2. To consume the ephemeral storage and cause the pod killed by OOM, in the container, create a large file to consume ephemeral storage and trigger oom by executing `fallocate` and `stress` as following:
+
+```bash
+$ kubectl exec -it oom-demo-6f7747887b-7vx6v -- /bin/bash
+
+bash-5.0# df -h; fallocate -l 75G storage.log; df -h; stress --vm 1 --vm-bytes 250M --vm-hang 1
+Filesystem                Size      Used Available Use% Mounted on
+overlay                  79.9G      4.4G     75.5G   6% /
+
+Filesystem                Size      Used Available Use% Mounted on
+overlay                  79.9G     79.4G    549.2M  99% /
+
+stress: info: [16] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
+command terminated with exit code 137
+```
+
+At the same time, we can see the pod `oom-demo-6f7747887b-7vx6v` was OOM killed and in this case it was restarted once then entering Error status, without further restarting and a new pod is created:
+
+```
+$ kubectl get pods -o wide -w
+NAME                                                 READY   STATUS        RESTARTS       AGE     IP               NODE
+oom-demo-6f7747887b-7vx6v                            1/1     Running       0              6s      192.167.35.208   ip-192-167-35-2.ec2.internal
+oom-demo-6f7747887b-7vx6v                            0/1     OOMKilled     0              29s     192.167.35.208   ip-192-167-35-2.ec2.internal
+oom-demo-6f7747887b-7vx6v                            1/1     Running       1 (1s ago)      30s     192.167.35.208   ip-192-167-35-2.ec2.internal
+oom-demo-6f7747887b-7vx6v                            0/1     Error         1 (39s ago)     68s     <none>           ip-192-167-35-2.ec2.internal
+oom-demo-6f7747887b-m8bh5                            0/1     Pending       0               0s      <none>           <none>
+oom-demo-6f7747887b-m8bh5                            0/1     ContainerCreating   0               0s      <none>           ip-192-167-19-202.ec2.internal
+oom-demo-6f7747887b-7vx6v                            0/1     Error               1               68s     192.167.35.208   ip-192-167-35-2.ec2.internal
+oom-demo-6f7747887b-m8bh5                            1/1     Running             0               1s      192.167.28.227   ip-192-167-19-202.ec2.internal
+```
+
+## Summary
+
+In summary, when a Kubernetes pod is evicted due to exceeding the node's ephemeral storage limit, the kubelet marks the pod as "Failed" with the reason "Evicted." The scheduler then assumes that the pod cannot be restarted on the same node and starts a new pod instead of restarting the evicted pod. To prevent this issue and ensure that the node has sufficient ephemeral storage capacity, we can monitor the pod disk usage, rotating the pod log/data or even use extra storage like PV/PVC.
+
+In the past, I thought a pod killed by OOM must be restarted by kubelet continuously. From this scenario I learned that a pod being restarted and triggering eviction would continue existing in the cluster while the scheduler already start a new pod and we need to delete the pod manually.
+
+## References
+
+[^1]: [Exceed a Container's memory limit](https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#exceed-a-container-s-memory-limit)
+[^2]: [Pod Template](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#pod-template)
+[^3]: [Container restart policy](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy)
+[^4]: [Send control plane logs to CloudWatch Logs](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html)