Skip to content

Conversation

@Callisto13
Copy link
Contributor

@Callisto13 Callisto13 commented Dec 31, 2020

Description

Fixes #3005. Maybe.

#2962 set the cgroupdriver of both the docker daemon and the kubelet to systemd. Unfortunately, the GPU AMIs remove the /etc/docker/daemon.json file* and set things via flags. This meant that 0.35.0 broke things for people who want to create new node groups with GPU instances: the kubelet was set to use systemd, while the docker daemon (that file removed) defaulted to cgroupfs.

* annoying.

There are 2 workarounds (in the form of preBootstrapCommands) for users. This PR is a quick hack to get things working again, while we hopefully figure out something better OR wait for AWS to default to systemd in those accelerated AMIs (the standard ones are about to do this).

Alternatives to simply sedding the flag into that OPTIONS string:

  • Provide our own /etc/sysconfig/docker file
  • Provide our own /etc/systemd/system/docker.service.d/override.conf
  • Write to that file a bit nicer than using sed 🤷‍♀️

The first 2 run the risk of getting out of step with the originals. The 3rd may be more trouble than it is worth.

I have asked AWS to ensure those accelerated AMIs set the cgroup driver to systemd by default, so we could just leave the hack in while we wait. Don't know how long that will be tho... (The standard AMIs are about to do systemd by default, but a different builder is used for GPU ones.)

I am open to any and all suggestions.

More information can be found here #3005 (comment)

I am gonna add an integration test as part of regression testing.

(I also simplified some other stuff around writing config files, but can take or leave that.)

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Added labels for change area (e.g. area/nodegroup), target version (e.g. version/0.12.0) and kind (e.g. kind/improvement)
  • Make sure the title of the PR is a good description that can go into the release notes

@Callisto13 Callisto13 requested a review from a team January 4, 2021 08:40
Copy link
Contributor

@aclevername aclevername left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (I haven't tested the sed command myself)

An integration test would be a great way to ensure we don't break this again in the future! 😄

@Callisto13
Copy link
Contributor Author

👍 I will do the integration test separately bc I want to get a release out today

Less neat to read maybe, but removes an unnecessary outer loop when it
comes to processing the list.

Also, only write docker daemon.json when not using GPU instance type.
The bootstrap script in those AMIs will remove the daemon file, so there
is no reason to write it.
@Callisto13 Callisto13 force-pushed the cb-gpu-cgroupdriver-fix branch from 0c7ffbe to 31b371a Compare January 4, 2021 10:09
@Callisto13 Callisto13 merged commit b7cda37 into eksctl-io:master Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kubelet.yaml set cgroupDriver: systemd instead of cgroupDriver: cgroupfs in AL2-GPU instances

4 participants