Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bootstrap option to create a local NVMe RAID-0 for kubelet and containerd #1171

Merged
merged 1 commit into from
Apr 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions doc/USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ This document includes details about using the AMI template and the resulting AM
1. [Updating known instance types](#updating-known-instance-types)
1. [Version-locked packages](#version-locked-packages)
1. [Image credential provider plugins](#image-credential-provider-plugins)
1. [Ephemeral Storage](#ephemeral-storage)

---

Expand Down Expand Up @@ -321,3 +322,21 @@ ECR credentials should now be obtained via a plugin, the `ecr-credential-provide
Additional image credential provider plugins may be appended to `/etc/eks/image-credential-provider/config.json`. In Kubernetes versions 1.26 and below, all plugins in this file must support `credentialprovider.kubelet.k8s.io/v1alpha1`. In Kubernetes versions 1.27 and above, they must support `credentialprovider.kubelet.k8s.io/v1`.

For more information about image credential provider plugins, refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-credential-provider/).

---

## Ephemeral Storage

Some instance types launch with ephemeral NVMe instance storage (i3, i4i, c5d, c6id, etc). There are two main ways of utilizing this storage within Kubernetes: a single RAID-0 array for use by kubelet and containerd or mounting the individual disks for pod usage.

The EKS Optimized AMI includes a utility script to configure ephemeral storage. The script can be invoked by passing the `--local-disks <raid0 | mount>` flag to the `/etc/eks/bootstrap.sh` script or the script can be invoked directly at `/bin/setup-local-disks`. All disks are formatted with an XFS file system.

Below are details on the two disk setup options:

### RAID-0 for Kubelet and Containerd (raid0)

A RAID-0 array is setup that includes all ephemeral NVMe instance storage disks. The containerd and kubelet state directories (`/var/lib/containerd` and `/var/lib/kubelet`) will then use the ephemeral storage for more and faster node ephemeral-storage. The node's ephemeral storage can be shared among pods that request ephemeral storage and container images that are downloaded to the node.

### Mount for Persistent Volumes (mount)

Another way of utilizing the ephemeral disks is to format and mount the individual disks. Mounting individual disks allows the [local-static-provisioner](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner) DaemonSet to create Persistent Volume Claims that pods can utilize.
220 changes: 220 additions & 0 deletions files/bin/setup-local-disks
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
#!/usr/bin/env bash

set -o errexit
set -o pipefail
set -o nounset

err_report() {
echo "Exited with error on line $1"
}
trap 'err_report $LINENO' ERR

print_help() {
echo "usage: $0 <raid0 | mount>"
echo "Sets up Amazon EC2 Instance Store NVMe disks"
echo ""
echo "-d, --dir directory to mount the filesystem(s) (default: /mnt/k8s-disks/)"
echo "-h, --help print this help"
}

# Sets up a RAID-0 of NVMe instance storage disks, moves
# the contents of /var/lib/kubelet and /var/lib/containerd
# to the new mounted RAID, and bind mounts the kubelet and
# containerd state directories.
maybe_raid0() {
local md_name="kubernetes"
local md_device="/dev/md/${md_name}"
local md_config="/.aws/mdadm.conf"
local array_mount_point="${MNT_DIR}/0"
mkdir -p "$(dirname "${md_config}")"

if [[ ! -s "${md_config}" ]]; then
mdadm --create --force --verbose \
"${md_device}" \
--level=0 \
--name="${md_name}" \
--raid-devices="${#EPHEMERAL_DISKS[@]}" \
"${EPHEMERAL_DISKS[@]}"
while [ -n "$(mdadm --detail "${md_device}" | grep -ioE 'State :.*resyncing')" ]; do
echo "Raid is resyncing..."
sleep 1
done
mdadm --detail --scan > "${md_config}"
fi

## Check if the device symlink has changed on reboot to include a homehost identifier
local current_md_device=$(find /dev/md/ -type l -regex ".*/${md_name}_?[0-9a-z]*$" | tail -n1)
if [[ ! -z ${current_md_device} ]]; then
md_device="${current_md_device}"
fi

# Format the array if not already formatted.
if [[ -z "$(lsblk "${md_device}" -o fstype --noheadings)" ]]; then
## By default, mkfs tries to use the stripe unit of the array (512k),
## for the log stripe unit, but the max log stripe unit is 256k.
## So instead, we use 32k (8 blocks) to avoid a warning of breaching the max.
## mkfs.xfs defaults to 32k after logging the warning since the default log buffer size is 32k.
mkfs.xfs -l su=8b "${md_device}"
fi

## Create the mount directory
mkdir -p "${array_mount_point}"

local dev_uuid=$(blkid -s UUID -o value "${md_device}")
local mount_unit_name="$(systemd-escape --path --suffix=mount "${array_mount_point}")"
cat > "/etc/systemd/system/${mount_unit_name}" << EOF
[Unit]
Description=Mount EC2 Instance Store NVMe disk RAID0
[Mount]
What=UUID=${dev_uuid}
Where=${array_mount_point}
Type=xfs
Options=defaults,noatime
[Install]
WantedBy=multi-user.target
EOF
systemd-analyze verify "${mount_unit_name}"
systemctl enable "${mount_unit_name}" --now

prev_running=""
needs_linked=""
for unit in "kubelet" "containerd"; do
## Check if the bind mount from the RAID already exists
if [[ "$(systemctl is-active var-lib-${unit}.mount)" != "active" ]]; then
# Check if components that depend on the RAID are running and, if so, stop them
if systemctl is-active "${unit}" > /dev/null 2>&1; then
prev_running+=" ${unit}"
fi
needs_linked+=" /var/lib/${unit}"
fi
done

## Check if /var/log/pods has been bind mounted and make sure kubelet is stopped
if [[ "$(systemctl is-active var-log-pods.mount)" != "active" ]]; then
if systemctl is-active "kubelet" > /dev/null 2>&1; then
prev_running+=" ${unit}"
fi
needs_linked+=" /var/log/pods"
fi

if [[ ! -z "${prev_running}" ]]; then
systemctl stop ${prev_running}
bwagner5 marked this conversation as resolved.
Show resolved Hide resolved
fi

# Transfer state directories to the array, if they exist.
for mount_point in ${needs_linked}; do
local unit="$(basename "${mount_point}")"
local array_mount_point_unit="${array_mount_point}/${unit}"
mkdir -p "${mount_point}"
echo "Copying ${mount_point}/ to ${array_mount_point_unit}/"
cp -a "${mount_point}/" "${array_mount_point_unit}/"
local mount_unit_name="$(systemd-escape --path --suffix=mount "${mount_point}")"
cat > "/etc/systemd/system/${mount_unit_name}" << EOF
[Unit]
Description=Mount ${unit} on EC2 Instance Store NVMe RAID0
[Mount]
What=${array_mount_point_unit}
Where=${mount_point}
Type=none
Options=bind
[Install]
WantedBy=multi-user.target
EOF
systemd-analyze verify "${mount_unit_name}"
systemctl enable "${mount_unit_name}" --now
done

if [[ ! -z "${prev_running}" ]]; then
systemctl start ${prev_running}
fi
}

# Mounts and creates xfs file systems on all EC2 instance store NVMe disks
# without existing file systems. Mounts in /mnt/k8s-disks/{1..} by default
maybe_mount() {
idx=1
for dev in "${EPHEMERAL_DISKS[@]}"; do
if [[ -z "$(lsblk "${dev}" -o fstype --noheadings)" ]]; then
mkfs.xfs -l su=8b "${dev}"
fi
if [[ ! -z "$(lsblk "${dev}" -o MOUNTPOINT --noheadings)" ]]; then
echo "${dev} is already mounted."
continue
fi
local mount_point="${MNT_DIR}/${idx}"
local mount_unit_name="$(systemd-escape --path --suffix=mount "${mount_point}")"
mkdir -p "${mount_point}"
cat > "/etc/systemd/system/${mount_unit_name}" << EOF
[Unit]
Description=Mount EC2 Instance Store NVMe disk ${idx}
[Mount]
What=${dev}
Where=${mount_point}
Type=xfs
Options=defaults,noatime
[Install]
WantedBy=multi-user.target
EOF
systemd-analyze verify "${mount_unit_name}"
systemctl enable "${mount_unit_name}" --now
idx=$((idx + 1))
done
}

## Main logic
MNT_DIR="/mnt/k8s-disks"

while [[ $# -gt 0 ]]; do
key="$1"
case $key in
-h | --help)
print_help
exit 0
;;
-d | --dir)
MNT_DIR="$2"
shift
shift
;;
*) # unknown option
POSITIONAL+=("$1") # save it in an array for later
shift # past argument
;;
esac
done

set +u
set -- "${POSITIONAL[@]}" # restore positional parameters
DISK_SETUP="$1"
set -u

if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "mount" ]]; then
echo "Valid disk setup options are: raid0 or mount"
exit 1
fi

disks=($(find -L /dev/disk/by-id/ -xtype l -name '*NVMe_Instance_Storage_*'))
## Bail early if there are no ephemeral disks to setup
if [[ "${#disks[@]}" -eq 0 ]]; then
echo "no ephemeral disks found, skipping disk setup"
exit 0
fi

bwagner5 marked this conversation as resolved.
Show resolved Hide resolved
if [ "$(id --user)" -ne 0 ]; then
echo "Must be run as root"
exit 1
fi

## Get devices of NVMe instance storage ephemeral disks
EPHEMERAL_DISKS=($(realpath "${disks[@]}" | sort -u))

case "${DISK_SETUP}" in
"raid0")
maybe_raid0
echo "Successfully setup RAID-0 consisting of ${EPHEMERAL_DISKS[@]}"
;;
"mount")
maybe_mount
echo "Successfully setup disk mounts consisting of ${EPHEMERAL_DISKS[@]}"
;;
esac
11 changes: 11 additions & 0 deletions files/bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ function print_help {
echo "--enable-local-outpost Enable support for worker nodes to communicate with the local control plane when running on a disconnected Outpost. (true or false)"
echo "--ip-family Specify ip family of the cluster"
echo "--kubelet-extra-args Extra arguments to add to the kubelet. Useful for adding labels or taints."
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods [mount | raid0]"
echo "--mount-bfs-fs Mount a bpffs at /sys/fs/bpf (default: true, for Kubernetes 1.27+; false otherwise)"
echo "--pause-container-account The AWS account (number) to pull the pause container from"
echo "--pause-container-version The tag of the pause container"
Expand Down Expand Up @@ -156,6 +157,11 @@ while [[ $# -gt 0 ]]; do
shift
shift
;;
--local-disks)
LOCAL_DISKS=$2
shift
shift
;;
*) # unknown option
POSITIONAL+=("$1") # save it in an array for later
shift # past argument
Expand Down Expand Up @@ -211,6 +217,11 @@ IP_FAMILY="${IP_FAMILY:-}"
SERVICE_IPV6_CIDR="${SERVICE_IPV6_CIDR:-}"
ENABLE_LOCAL_OUTPOST="${ENABLE_LOCAL_OUTPOST:-}"
CLUSTER_ID="${CLUSTER_ID:-}"
LOCAL_DISKS="${LOCAL_DISKS:-}"

if [[ ! -z ${LOCAL_DISKS} ]]; then
setup-local-disks "${LOCAL_DISKS}"
fi

DEFAULT_MOUNT_BPF_FS="true"
if vercmp "$KUBELET_VERSION" lt "1.27.0"; then
Expand Down
3 changes: 2 additions & 1 deletion scripts/install-worker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ sudo yum install -y \
unzip \
wget \
yum-utils \
yum-plugin-versionlock
yum-plugin-versionlock \
mdadm

# Remove any old kernel versions. `--count=1` here means "only leave 1 kernel version installed"
sudo package-cleanup --oldkernels --count=1 -y
Expand Down