Skip to content

Commit

Permalink
add a blog post on separate image filesystem
Browse files Browse the repository at this point in the history
  • Loading branch information
kannon92 committed Nov 16, 2023
1 parent 470b612 commit 4cd2987
Showing 1 changed file with 24 additions and 17 deletions.
41 changes: 24 additions & 17 deletions content/en/blog/_posts/2023-12-05-image-filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ date: 2023-12-05
slug: kubernetes-separate-imagefilesystem
---

**Authors:** Kevin Hannon
**Authors:** Kevin Hannon (Red Hat)

A common issue in running/operating Kubernetes clusters is running out of disk space.
A common issue in running/operating Kubernetes clusters is running out of disk space.
When the node is provisioned, you should aim to have a good amount of storage space for the container runtime.
The container runtime usually writes to the `/var` partition.
The container runtime usually writes to the `/var` partition.
CRI-O, by default, writes its containers and images to `/var/lib/containers`, while Containerd writes its containers and images to `/var/lib/containerd`.

In this blog post, we want to bring attention to ways that you can configure your container runtime to store its content separately from the default partition.
Expand All @@ -35,13 +35,13 @@ The container runtime has two different areas of storage for containers and imag

### CRI-O

CRI-O uses a storage configuration file to control how the container runtime stores persistent data and temporary data.
CRI-O uses a storage configuration file to control how the container runtime stores persistent data and temporary data.
CRI-O utilizes the [storage library](https://github.com/containers/storage).
Some Linux distributions have a manual entry for storage (`man 5 containers-storage.conf`).
Some Linux distributions have a manual entry for storage (`man 5 containers-storage.conf`).
The main configuration for storage is located in `/etc/containers/storage.conf` and one can control the location for temporary data and the root directory.
The root directory is where CRI-O stores the persistent data.

```
```toml
[storage]

# Default Storage Driver
Expand All @@ -52,16 +52,16 @@ runroot = "/var/run/containers/storage"
graphroot = "/var/lib/containers/storage"
```

- Graphroot
- Graphroot
- Persistent data stored from the container runtime
- If SELinux is enabled, this must match the `/var/lib/containers/storage`
- Runroot
- Runroot
- Temporary read/write access for container.
- Recommended to have this on a temporary filesystem.

A quick way to relabel your graphroot directory to match `/var/lib/containers/storage`.

```
```bash
semanage fcontext -a -e /var/lib/containers/storage /YOUR_STORAGE_PATH
restorecon -R -v /YOUR_STORAGE_PATH
```
Expand All @@ -83,14 +83,16 @@ The relevant fields for containerd storage are `root` and `state`.

## Kubernetes Node Pressure Eviction

Kubernetes will automatically detect the if the container filesystem is split from the node filesystem. When one separates the filesystem, Kubernetes is responsible for monitoring both the node filesystem and the container runtime filesystem.
If either the node filesystem or the container runtime filesystem are running out of disk space, then the overal node is considered to have disk pressure.
Kubernetes will automatically detect the if the container filesystem is split from the node filesystem. When one separates the filesystem, Kubernetes is responsible for monitoring both the node filesystem and the container runtime filesystem.
Kubernetes documentaiton refers to the node filesystem and the container runtime filesystem as nodefs and imagefs.
If either nodefs or the imagefs are running out of disk space, then the overal node is considered to have disk pressure.
Kubernetes will first reclaim space by deleting unusued containers and images and then it will resort to evicting pods.
If one has a imagefs, then Kubernetes will garbage collect unusued images on imagefs and will remove dead containers and pods from the nodefs.
If there is only a nodefs, then Kubernetes garbage collections dead containers, dead pods and unusued images.

Kubernetes allows more configurations for determining if your disk is full.
The eviction manager in Kubelet defines API fields to signal pressure.
For filesystems, the relevant API are `nodefs.available, `nodefs.inodesfree`, `imagefs.available`, and `imagefs.inodesfree`.
Nodefs is the node filesystem and imagefs is the container runtime filesystem.
The eviction manager in Kubelet defines API fields to signal pressure.
For filesystems, the relevant API are `nodefs.available`, `nodefs.inodesfree`, `imagefs.available`, and `imagefs.inodesfree`.
If there is no container runtime filesystem then imagefs is ignored.

Users can use the existing defaults:
Expand All @@ -100,16 +102,16 @@ Users can use the existing defaults:
- imagefs.available<15%
- nodefs.inodesFree<5% (Linux nodes)

Kubernetes allows one to set user defined values in `EvictionHard` and `EvictionSoft` in the Kubelet Config.
Kubernetes allows one to set user defined values in `EvictionHard` and `EvictionSoft` in the Kubelet Configuration.

EvictionHard defines limits and once these limits are reached, pods will be evicted without any grace period.
EvictionHard defines limits and once these limits are exceeded, pods will be evicted without any grace period.
EvictionSoft defines limits and once these limits are exceeded, pods will be evicted with a grace period that can be set per signal.

If you specify a `EvictionHard` it will replace the defaults. This means it is important to set all signals in your configuration.

For example, the following Kubelet Configuration could be used to configure eviction signals and grace period options.

```
```yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
Expand All @@ -136,5 +138,10 @@ evictionSoftGracePeriod:
evictionMaxPodGracePeriod: 60s
```
### Problems
We recommend that you either use the default settings for eviction or you set all the fields for eviction.
One can use the default settings or specify your own EvictionHard settings. If you miss a signal, then Kubernetes will not monitor that resource.
One common misconfiguration administrators or users can hit is if they mount a new filesystem to `/var/lib/containers/storage` or `/var/lib/containerd`. Kubernetes will detect a separate filesystem so you want to make sure that `imagefs.inodesfree` and `imagefs.available` in this case.

Another area of confusion is that ephemeral storage reporting does not change if one attaches a imagefs. Kubernetes will always report ephemeral storage capacity and allocations based on the nodefs. Pods that utilize the writeable layer with a imagefs will write to the imagefs and they can all write to the nodefs if the pod is writing to EmptyDir.

0 comments on commit 4cd2987

Please sign in to comment.