Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bootstrap option to create a local NVMe RAID-0 for kubelet and containerd #1171

Merged
merged 1 commit into from
Apr 25, 2023

Conversation

bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Feb 6, 2023

Issue #, if available:

Description of changes:

  • Adds an option to the bootstrap script to setup a RAID-0 XFS of any NVMe instance storage disks, moves the contents of /var/lib/kubelet and /var/lib/containerd to the new RAID and then symlink those state dirs back in the root filesystem.
  • Installs mdadm in the AMI build process so that the RAID can be created
  • Safe to run on instances without NVMe instance storage disks, since steps will just be skipped. The script is also idempotent.
  • The implementation of the script is decoupled from the bootstrap script and doesn't require any configuration, so it's possible that the RAID script lives in the repo without including the bootstrap option.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

c5d.4xlarge (1 NVMe instance storage disk)

> aws ec2 describe-instance-types --instance-type c5d.4xlarge
...
        "InstanceStorageInfo": {
            "Disks": [
                {
                    "Count": 1,
                    "SizeInGB": 400,
                    "Type": "ssd"
                }
            ],
            "EncryptionSupport": "required",
            "NvmeSupport": "required",
            "TotalSizeInGB": 400
        },
        "InstanceStorageSupported": true,
...

> /etc/eks/bootstrap.sh my-cluster --raid-local-disks true --apiserver-endpoint <...> --b64-cluster-ca <...> --container-runtime containerd --kubelet-extra-args --node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=default

> lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme1n1       259:0    0 372.5G  0 disk
└─md127         9:127  0 372.4G  0 raid0 /mnt/kubernetes
nvme0n1       259:1    0    20G  0 disk
├─nvme0n1p1   259:2    0    20G  0 part  /
└─nvme0n1p128 259:3    0     1M  0 part

> find /var/lib/ -type l -ls
2222112    0 lrwxrwxrwx   1 root     root           44 Feb  6 19:34 /var/lib/cloud/instance -> /var/lib/cloud/instances/i-087f6f66328338083
114456    0 lrwxrwxrwx   1 root     root           26 Feb  6 19:37 /var/lib/containerd -> /mnt/kubernetes/containerd
114458    0 lrwxrwxrwx   1 root     root           23 Feb  6 19:37 /var/lib/kubelet -> /mnt/kubernetes/kubelet

Tested a reboot as well and observed the /mnt/kubernetes dir properly mounted and the node operates normally.

c5d.24xlarge (4 NVMe instance storage disk)

> aws ec2 describe-instance-types --instance-type c5d.24xlarge
...
 "InstanceStorageInfo": {
            "Disks": [
                {
                    "Count": 4,
                    "SizeInGB": 900,
                    "Type": "ssd"
                }
            ],
            "EncryptionSupport": "required",
            "NvmeSupport": "required",
            "TotalSizeInGB": 3600
        },
        "InstanceStorageSupported": true,
...

> /etc/eks/bootstrap.sh my-cluster --raid-local-disks true --apiserver-endpoint <...> --b64-cluster-ca <...> --container-runtime containerd --kubelet-extra-args --node-labels=karpenter.sh/capacity-type=on-demand,karpenter.sh/provisioner-name=default

> lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme1n1       259:0    0 838.2G  0 disk
└─md127         9:127  0   3.3T  0 raid0 /mnt/kubernetes
nvme2n1       259:1    0 838.2G  0 disk
└─md127         9:127  0   3.3T  0 raid0 /mnt/kubernetes
nvme3n1       259:2    0 838.2G  0 disk
└─md127         9:127  0   3.3T  0 raid0 /mnt/kubernetes
nvme4n1       259:3    0 838.2G  0 disk
└─md127         9:127  0   3.3T  0 raid0 /mnt/kubernetes
nvme0n1       259:4    0    20G  0 disk
├─nvme0n1p1   259:5    0    20G  0 part  /
└─nvme0n1p128 259:6    0     1M  0 part

> find /var/lib/ -type l -ls
2222112    0 lrwxrwxrwx   1 root     root           44 Feb  6 19:46 /var/lib/cloud/instance -> /var/lib/cloud/instances/i-0886983fdb2339a75
114458    0 lrwxrwxrwx   1 root     root           26 Feb  6 19:46 /var/lib/containerd -> /mnt/kubernetes/containerd
114459    0 lrwxrwxrwx   1 root     root           23 Feb  6 19:46 /var/lib/kubelet -> /mnt/kubernetes/kubelet

> kubectl describe node
... 
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         96
  ephemeral-storage:           3513382752Ki     (<------- ~3.12 TiB)
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      193675380Ki
  pods:                        737
...

c5.4xlarge (0 NVMe instance storage disks):

> aws ec2 describe-instance-types --instance-type c5.4xlarge
...
"InstanceStorageSupported": false,
...

> lsblk
NAME          MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1       259:0    0  20G  0 disk
├─nvme0n1p1   259:1    0  20G  0 part /
└─nvme0n1p128 259:2    0   1M  0 part

> kgn i-0bdd982382eb075d3.us-east-2.compute.internal
NAME                                             STATUS   ROLES    AGE     VERSION               ARCH    INSTANCE-TYPE   PROVISIONER-NAME   ZONE         CAPACITY-TYPE
i-0bdd982382eb075d3.us-east-2.compute.internal   Ready    <none>   3m51s   v1.24.9-eks-49d8fe8   amd64   c5.4xlarge      default            us-east-2a   on-demand

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

files/bootstrap.sh Outdated Show resolved Hide resolved
Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments; and I think we need a user guide section about this.

files/bin/raid-local-disks Outdated Show resolved Hide resolved
files/bin/raid-local-disks Outdated Show resolved Hide resolved
Copy link
Contributor

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great addition, would it also be possible to get this logic added to Bottlerocket (it currently mounts by default)?

I would suggest changing the flag to --local-disks and providing 3 options or raid (this logic), mount (mount the disks) or ignore. This would then support using the local disk provisioner CSI when set to mount.

@bwagner5
Copy link
Contributor Author

This looks like a great addition, would it also be possible to get this logic added to Bottlerocket (it currently mounts by default)?

I would like to get this into Bottlerocket, still need to see exactly how we can do that :)

I would suggest changing the flag to --local-disks and providing 3 options or raid (this logic), mount (mount the disks) or ignore. This would then support using the local disk provisioner CSI when set to mount.

I like the --local-disks suggestion on the flag name. It also gets us away from boolean flags which is a little awkward to specify with the bootstrap script today since an actual boolean value is required.

@stevehipwell
Copy link
Contributor

The Bottlerocket code to do the mount version of this was added in bottlerocket-os/bottlerocket#1173.

@vara-bonthu
Copy link

vara-bonthu commented Mar 23, 2023

This is most requested feature from customers who run data workloads on EKS.

We are solving RAID-0 config currently using AWS Node templates in Data on EKS blueprints.

https://github.com/awslabs/data-on-eks/blob/4ac2e2ae961ef60eab1ca37aa224d113d3245b55/analytics/terraform/emr-eks-karpenter/karpenter-provisioners/spark-compute-optimized-provisioner.yaml#L64

@bryantbiggs
Copy link
Contributor

just checking in to see if theres any ETA on when this might land? it would really simplify a lot of user data scripts out there in the wild 😅

Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, but LGTM.

files/bin/setup-local-disks Outdated Show resolved Hide resolved
files/bin/setup-local-disks Show resolved Hide resolved
files/bin/setup-local-disks Show resolved Hide resolved
@bwagner5
Copy link
Contributor Author

Ok, should be all ready to go. I did some disk tests with fio as well:

c6id.4xlarge - EBS volume GP3 (20 GiB)

$ fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fio-ebs-test --filename=testfio-ebs --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
fio-ebs-test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.14
Starting 1 process
fio-ebs-test: Laying out IO file(s) (1 file(s) / 8192MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [9116KB/2872KB/0KB /s] [2279/718/0 iops] [eta 00m:00s]
fio-ebs-test: (groupid=0, jobs=1): err= 0: pid=7341: Tue Apr 25 03:58:32 2023
  read : io=6142.3MB, bw=9003.8KB/s, iops=2250, runt=698555msec
  write: io=2049.8MB, bw=3004.8KB/s, iops=751, runt=698555msec
  cpu          : usr=0.30%, sys=1.16%, ctx=400059, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1572409/w=524743/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=6142.3MB, aggrb=9003KB/s, minb=9003KB/s, maxb=9003KB/s, mint=698555msec, maxt=698555msec
  WRITE: io=2049.8MB, aggrb=3004KB/s, minb=3004KB/s, maxb=3004KB/s, mint=698555msec, maxt=698555msec

Disk stats (read/write):
  nvme0n1: ios=1572610/525117, merge=5/42, ticks=31453573/10594148, in_queue=42047721, util=99.99%

c6id.4xlarge - RAID-0 1 disk (900GiB)

$ fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fio-raid0-test --filename=testfio-raid0 --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
fio-raid0-test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.14
Starting 1 process
fio-raid0-test: Laying out IO file(s) (1 file(s) / 8192MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [1092MB/364.5MB/0KB /s] [279K/93.3K/0 iops] [eta 00m:00s]
fio-raid0-test: (groupid=0, jobs=1): err= 0: pid=12053: Tue Apr 25 04:03:30 2023
  read : io=6142.3MB, bw=1092.6MB/s, iops=279688, runt=  5622msec
  write: io=2049.8MB, bw=373350KB/s, iops=93337, runt=  5622msec
  cpu          : usr=17.06%, sys=65.74%, ctx=266256, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1572409/w=524743/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=6142.3MB, aggrb=1092.6MB/s, minb=1092.6MB/s, maxb=1092.6MB/s, mint=5622msec, maxt=5622msec
  WRITE: io=2049.8MB, aggrb=373349KB/s, minb=373349KB/s, maxb=373349KB/s, mint=5622msec, maxt=5622msec

Disk stats (read/write):
    md127: ios=1558747/520146, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=1572409/524743, aggrmerge=0/0, aggrticks=274266/35367, aggrin_queue=309634, aggrutil=97.47%
  nvme1n1: ios=1572409/524743, merge=0/0, ticks=274266/35367, in_queue=309634, util=97.47%

i4i.32xlarge - RAID-0 8 disks (28TiB)

$ fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fio-raid0-test --filename=testfio-raid0 --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
fio-raid0-test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.14
Starting 1 process
Jobs: 1 (f=1): [m(1)] [100.0% done] [745.6MB/249.8MB/0KB /s] [191K/63.1K/0 iops] [eta 00m:00s]
fio-raid0-test: (groupid=0, jobs=1): err= 0: pid=26368: Tue Apr 25 04:31:57 2023
  read : io=6142.3MB, bw=764604KB/s, iops=191151, runt=  8226msec
  write: io=2049.8MB, bw=255163KB/s, iops=63790, runt=  8226msec
  cpu          : usr=18.33%, sys=71.49%, ctx=208737, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1572409/w=524743/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=6142.3MB, aggrb=764604KB/s, minb=764604KB/s, maxb=764604KB/s, mint=8226msec, maxt=8226msec
  WRITE: io=2049.8MB, aggrb=255163KB/s, minb=255163KB/s, maxb=255163KB/s, mint=8226msec, maxt=8226msec

Disk stats (read/write):
    md127: ios=1557658/519781, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=196551/65592, aggrmerge=0/0, aggrticks=21772/1514, aggrin_queue=23287, aggrutil=98.25%
  nvme3n1: ios=196420/65724, merge=0/0, ticks=21720/1514, in_queue=23234, util=98.15%
  nvme6n1: ios=197084/65060, merge=0/0, ticks=21866/1494, in_queue=23360, util=98.25%
  nvme2n1: ios=196466/65678, merge=0/0, ticks=21986/1601, in_queue=23588, util=98.15%
  nvme5n1: ios=196586/65558, merge=0/0, ticks=21647/1463, in_queue=23110, util=98.15%
  nvme8n1: ios=196510/65634, merge=0/0, ticks=21664/1489, in_queue=23152, util=98.15%
  nvme1n1: ios=196548/65596, merge=0/0, ticks=21788/1531, in_queue=23319, util=98.15%
  nvme4n1: ios=196450/65694, merge=0/0, ticks=21920/1571, in_queue=23492, util=98.15%
  nvme7n1: ios=196345/65799, merge=0/0, ticks=21588/1456, in_queue=23044, util=98.15%

@bwagner5 bwagner5 merged commit fcfca67 into awslabs:master Apr 25, 2023
@FernandoMiguel
Copy link

are there plans to include this into bottlerocket too?

@mmerkes
Copy link
Member

mmerkes commented Apr 27, 2023

@FernandoMiguel I'm not sure about Bottlerocket. Those AMIs are owned by a separate team, but you can open a feature request if you'd like to see it there.

@afirth
Copy link

afirth commented Nov 28, 2023

would be nice to get this into eksctl. apologies if it is already but i can't find it. has been in GKE since 1.25

@afirth
Copy link

afirth commented Nov 28, 2023

opened eksctl-io/eksctl#7341 for it

@iancward
Copy link

iancward commented Jun 7, 2024

It would be great if the Windows bootstrap supported this as well, considering how slow images are to pull and extract. I've opened aws/containers-roadmap#2360 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants