Can not create container when systemd-resolved not running #11810
Description
What happened?
Playbook cluster.yml crash on Kubeadm | Create kubeadm config
when systemd-resolved is not running because /run/systemd/resolve/resolv.conf
file is missing.
What did you expect to happen?
Kubespray will configure kubelet to use /etc/resolv.conf
instead of missing /run/systemd/resolve/resolv.conf
,
How can we reproduce it (as minimally and precisely as possible)?
Run cluster.yml on kube nodes running Ubuntu 24.04 with systemd-resolved masked.
OS
Linux 6.1.113-zfs226 x86_64
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
Version of Ansible
ansible [core 2.16.14]
config file = /var/home/kubespray-2.26.0/ansible.cfg
configured module search path = ['/var/home/kubespray-2.26.0/library']
ansible python module location = /var/home/kubespray-2.26.0/venv/lib/python3.12/site-packages/ansible
ansible collection location = /var/home/ansible/collections:/usr/share/ansible/collections
executable location = /var/home/kubespray-2.26.0/venv/bin/ansible
python version = 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] (/var/home/kubespray-2.26.0/venv/bin/python)
jinja version = 3.1.4
libyaml = True
Version of Python
Python 3.12.3
Version of Kubespray (commit)
kubespray-2.26.0
Network plugin used
calico
Full inventory with variables
Default kubespray-2.26.0 variables
Command used to invoke ansible
ansible-playbook -i inventory/cluster/inventory.ini cluster.yml
Output of ansible run
TASK [kubernetes/control-plane : Kubeadm | Create kubeadm config] **************************************************************************************************************************************************************************
changed: [XXX-prod-master1]
changed: [XXX-prod-master2]
changed: [XXX-prod-master3]
Tuesday 17 December 2024 12:38:03 +0100 (0:00:00.492) 0:07:22.794 ******
Tuesday 17 December 2024 12:38:03 +0100 (0:00:00.043) 0:07:22.837 ******
Tuesday 17 December 2024 12:38:03 +0100 (0:00:00.047) 0:07:22.885 ******
Tuesday 17 December 2024 12:38:03 +0100 (0:00:00.041) 0:07:22.927 ******
Tuesday 17 December 2024 12:38:03 +0100 (0:00:00.048) 0:07:22.976 ******
Tuesday 17 December 2024 12:38:04 +0100 (0:00:00.090) 0:07:23.067 ******
Tuesday 17 December 2024 12:38:04 +0100 (0:00:00.100) 0:07:23.167 ******
Tuesday 17 December 2024 12:38:04 +0100 (0:00:00.054) 0:07:23.221 ******
Tuesday 17 December 2024 12:38:04 +0100 (0:00:00.044) 0:07:23.266 ******
Tuesday 17 December 2024 12:38:04 +0100 (0:00:00.048) 0:07:23.315 ******
Tuesday 17 December 2024 12:38:04 +0100 (0:00:00.055) 0:07:23.370 ******
FAILED - RETRYING: [XXX-prod-master1]: Kubeadm | Initialize first master (3 retries left).
FAILED - RETRYING: [XXX-prod-master1]: Kubeadm | Initialize first master (2 retries left).
FAILED - RETRYING: [XXX-prod-master1]: Kubeadm | Initialize first master (1 retries left).
Anything else we need to know
Problem is that in roles/kubernetes/preinstall/tasks/main.yml
there is detection if systemd-resolved is running but it is only used to detect if include 0060-resolvconf.yml
or 0061-systemd-resolved.yml
.
But in roles/kubernetes/node/tasks/facts.yml
is included OS specific var file from roles/kubernetes/node/vars
and in that file resolvconf path is hardcoded for most distributions to /run/systemd/resolve/resolv.conf
.
And it cause kubelet fail to create any container.
On control-plane servers this cause kubelet can not create any container with error:
Dec 17 14:01:15 XXX-prod-master1 kubelet[25126]: E1217 14:01:15.267321 25126 dns.go:284] "Could not open resolv conf file." err="open /run/systemd/resolve/resolv.conf: no such file or directory"
Dec 17 14:01:15 XXX-prod-master1 kubelet[25126]: E1217 14:01:15.267332 25126 kuberuntime_sandbox.go:45] "Failed to generate sandbox config for pod" err="open /run/systemd/resolve/resolv.conf: no such file or directory" pod="kube-system/kube-controller-manager-XXX-prod-master1"
Dec 17 14:01:15 XXX-prod-master1 kubelet[25126]: E1217 14:01:15.267342 25126 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="open /run/systemd/resolve/resolv.conf: no such file or directory" pod="kube-system/kube-controller-manager-XXX-prod-master1"
Dec 17 14:01:15 XXX-prod-master1 kubelet[25126]: E1217 14:01:15.267361 25126 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-XXX-prod-master1_kube-system(bce3ce42e0aef110c5773ef4027de42c)\" with CreatePodSandboxError: \"Failed to generate sandbox config for pod \\\"kube-controller-manager-XXX-prod-master1_kube-system(bce3ce42e0aef110c5773ef4027de42c)\\\": open /run/systemd/resolve/resolv.conf: no such file or directory\"" pod="kube-system/kube-controller-manager-XXX-prod-master1" podUID="bce3ce42e0aef110c5773ef4027de42c"
When systemd-resolved is not running on worker nodes, any container is stuck in ContainerCreating state with error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m52s default-scheduler Successfully assigned kube-system/kube-proxy-6hnnc to XXX-prod-worker2
Warning FailedCreatePodSandBox 44s (x26 over 5m52s) kubelet Failed to create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory