Set cgroupdriver to systemd on GPU nodes #3007
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes #3005. Maybe.
#2962 set the cgroupdriver of both the docker daemon and the kubelet to systemd. Unfortunately, the GPU AMIs remove the
/etc/docker/daemon.jsonfile* and set things via flags. This meant that0.35.0broke things for people who want to create new node groups with GPU instances: the kubelet was set to usesystemd, while the docker daemon (that file removed) defaulted tocgroupfs.* annoying.
There are 2 workarounds (in the form of
preBootstrapCommands) for users. This PR is a quick hack to get things working again, while we hopefully figure out something better OR wait for AWS to default tosystemdin those accelerated AMIs (the standard ones are about to do this).Alternatives to simply sedding the flag into that
OPTIONSstring:/etc/sysconfig/dockerfile/etc/systemd/system/docker.service.d/override.confsed🤷♀️The first 2 run the risk of getting out of step with the originals. The 3rd may be more trouble than it is worth.
I have asked AWS to ensure those accelerated AMIs set the cgroup driver to
systemdby default, so we could just leave the hack in while we wait. Don't know how long that will be tho... (The standard AMIs are about to dosystemdby default, but a different builder is used for GPU ones.)I am open to any and all suggestions.
More information can be found here #3005 (comment)
I am gonna add an integration test as part of regression testing.
(I also simplified some other stuff around writing config files, but can take or leave that.)
Checklist
README.md, or theuserdocsdirectory)area/nodegroup), target version (e.g.version/0.12.0) and kind (e.g.kind/improvement)