Add system-upgrade to upgrade-cluster playbook #10184

sathieu · 2023-06-02T16:42:19Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

We want to upgrade the system packages of the nodes (including the linux kernel).

Currently we do this with a specific playbook, but this leads to cordoning each node twice.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

System upgrade for Debian-family nodes is available with system_upgrade=true

k8s-ci-robot · 2023-06-02T16:42:29Z

Hi @sathieu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sathieu · 2023-06-12T15:46:11Z

I've tested successfully this PR, marking ready.

MrFreezeex · 2023-06-12T16:27:52Z

Hi @sathieu, thanks for your contribution! I am not very convinced that this should be included into kubespray though, maybe it should be inside some kind of pre-role/playbook in your tooling for instance? I am not extremely against it though, if some other reviewers think it's useful, I will not block it.

That being said, if that's something that we want to include it would be nice to at least have support for debian AND debian derivatives like Ubuntu + redhat/CentOS distros. This should cover a wide amount of users...

sathieu · 2023-06-13T09:27:55Z

@MrFreezeex Thanks for your review.

Hi @sathieu, thanks for your contribution! I am not very convinced that this should be included into kubespray though, maybe it should be inside some kind of pre-role/playbook in your tooling for instance? I am not extremely against it though, if some other reviewers think it's useful, I will not block it.

We are currently using a specific playbook for this, but this leads to draining+uncordoning every node twice (one for the system upgrade, then latter for the kubespray upgrade).

I have done a full test of this PR and it's not working. As rebooting cleans up /tmp, kubespray fails latter when trying to do CNI | Copy cni plugins. Any idea how to fix this?

That being said, if that's something that we want to include it would be nice to at least have support for debian AND debian derivatives like Ubuntu + redhat/CentOS distros. This should cover a wide amount of users...

The current patch is for Debian and derivatives. I would happily add support for other distros but I need a code snippet.

MrFreezeex · 2023-06-13T10:16:03Z

We are currently using a specific playbook for this, but this leads to draining+uncordoning every node twice (one for the system upgrade, then latter for the kubespray upgrade).

Ah yeah, I see make sense 👍

The current patch is for Debian and derivatives. I would happily add support for other distros but I need a code snippet.

For Ubuntu support and so on you need to modify the clause so that it can run but yes. For CentOS/fedora/rocky/... You can probably use the yum module https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_module.html#ansible-collections-ansible-builtin-yum-module there is an example in the doc there:

- name: Upgrade all packages
  ansible.builtin.yum:
    name: '*'
    state: latest

And you probably need another task to update the cache as well (with update_cache to True, similar to an apt update basically)

sathieu · 2023-06-13T11:48:34Z

@MrFreezeex I've added yum upgrade. I don't think yum update_cache is needed, it's automatically done when old.

For apt, this is done by:

kubespray/roles/kubernetes/preinstall/tasks/0070-system-packages.yml

Lines 34 to 40 in a962fa2

    
           - name: Update package management cache (APT) 
        
             apt: 
        
               update_cache: yes 
        
               cache_valid_time: 3600 
        
             when: ansible_os_family == "Debian" 
        
             tags: 
        
               - bootstrap-os

But the main problem remains, /tmp is empty after reboot, and kubespray fails with missing downloaded files.

MrFreezeex · 2023-06-13T13:18:47Z

Hmmm I think you probably need to trigger the download role after the reboot unfortunately 🤔. To me you would have to move the system upgrade right after the cordon and then trigger the download role if system upgrade is invoked + make the first invocation of downlad role skipped if system upgrade is ran.

MrFreezeex · 2023-06-13T13:20:57Z

roles/upgrade/system-upgrade/tasks/main.yml

@@ -0,0 +1,37 @@
+---
+# Debian


It's a bit a nitpick but I think it would a bit cleaner to create two other files debian.yml and redhat.yml and include them from the main.yml.

yankay · 2023-06-15T02:25:09Z

/ok-to-test

sathieu · 2023-06-15T11:02:21Z

@MrFreezeex I've pushed a newer version of it.

make the first invocation of downlad role skipped if system upgrade is ran.

This is not possible, because etcd and other components need download. So the behavior is not optimal, but for us this is better than cordoning nodes twice.

MrFreezeex

Thanks for your contribution and all the changes you made!
/lgtm

sathieu · 2023-06-19T08:57:04Z

@MrFreezeex Please re-approve, I've fixed a linting problem.

MrFreezeex · 2023-06-19T08:59:10Z

/lgtm

sathieu · 2023-06-26T06:52:54Z

@MrFreezeex Thanks for approving this PR. What is the next step to have it merged?

MrFreezeex · 2023-06-26T09:12:14Z

@MrFreezeex Thanks for approving this PR. What is the next step to have it merged?

You need another review/approval from a kubespray team member

oomichi · 2023-06-27T01:21:45Z

looks good for me, thanks @sathieu

/approve

k8s-ci-robot · 2023-06-27T01:21:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MrFreezeex, oomichi, sathieu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [oomichi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sathieu · 2023-06-27T07:00:25Z

Thanks @MrFreezeex and @oomichi !

nicolas-goudry · 2023-08-09T13:50:04Z

@sathieu I tried to use the system-upgrade role that you added with this PR on a Rocky Linux 8 backed cluster installed with Kubespray 2.22.1, but the “YUM upgrade all packages” task hangs “forever” for some reason…

I’ll try to provide as much details as I can.

Here is my directory listing (redacted to omit irrelevant output):

.
├── exec
│   ├── kubespray-values.yaml
│   └── inventory
├── setup
│   ├── kubespray
│   ├── system-upgrade
│   └── upgrade-os.yaml
└── ansible.cfg

I’m running ansible from the root (.) directory. So I had to create an ansible.cfg file:

[defaults]
roles_path = setup/kubespray/roles

As seen in the directory listing, I imported the system-upgrade role under the setup directory, alongside kubespray. For convenience, I added a defaults/main.yml file to the system-upgrade role:

---
system_upgrade: true
system_upgrade_reboot: on-upgrade # never, always

And I created an upgrade-os.yaml playbook under the setup directory (thus the import_playbook path differing from upgrade-cluster.yml):

# This playbook borrows parts of the upgrade-cluster.yml play from Kubespray.
# It also makes use of the system-upgrade Kubespray role which has not yet been released.
# See https://github.com/kubernetes-sigs/kubespray/blob/36e5d742dc2b3f7984398c38009f236be7c3c065/playbooks/upgrade_cluster.yml
# See https://github.com/kubernetes-sigs/kubespray/blob/36e5d742dc2b3f7984398c38009f236be7c3c065/roles/upgrade/system-upgrade/tasks/main.yml
---
- name: Check ansible version
  import_playbook: kubespray/playbooks/ansible_version.yml

- name: Ensure compatibility with old groups
  import_playbook: kubespray/playbooks/legacy_groups.yml

- name: Gather facts
  tags: always
  import_playbook: kubespray/playbooks/facts.yml

- name: Handle upgrades to control plane hosts
  gather_facts: False
  hosts: kube_control_plane
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: 1
  roles:
    - { role: kubespray-defaults }
    - { role: upgrade/pre-upgrade, tags: pre-upgrade }
    - { role: system-upgrade, tags: system-upgrade }
    - { role: upgrade/post-upgrade, tags: post-upgrade }

- name: Handle upgrades to worker hosts
  hosts: kube_node:calico_rr:!kube_control_plane
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: "{{ serial | default('20%') }}"
  roles:
    - { role: kubespray-defaults }
    - { role: upgrade/pre-upgrade, tags: pre-upgrade }
    - { role: system-upgrade, tags: system-upgrade }
    - { role: upgrade/post-upgrade, tags: post-upgrade }

I’m taking advantage of the existing roles used by upgrade-cluster.yml to cordon/drain/uncordon in addition to your system-upgrade role to only perform a system upgrade on my nodes.

Here is how I run it:

# Upgrade control plane hosts system
ansible-playbook -u node-user -b --become-user=root -i exec/inventory -e "@exec/kubespray-values.yaml" setup/update-os.yaml --limit=kube_control_plane,etcd

# Upgrade worker hosts system
ansible-playbook -u node-user -b --become-user=root -i exec/inventory -e "@exec/kubespray-values.yaml" setup/update-os.yaml --limit=kube_node

But the play stays stuck on the “YUM upgrade all packages” task. I ssh'ed in the node after more than 45 minutes waiting for the task to end, only to see that the following command only reported itself:

ps -aux | grep ansible

Also, running:

sudo yum update

Gave me the following result:

Last metadata expiration check: 1:38:59 ago on Wed 09 Aug 2023 11:59:58 AM UTC.
Dependencies resolved.
Nothing to do.
Complete!

So, the yum update did complete, but the results were never reported to the ansible play run.

After killing the play, I went ahead and tried to run the yum module through ansible directly on one host only with the following command:

ansible all -u node-user -b --become-user=root -i exec/inventory -m yum -a 'name=* state=latest' -vvvv --limit=worker1

Here is the output (redacted):

ansible [core 2.12.5]
  config file = /home/nicolas/test-upgrade-os/ansible.cfg
  configured module search path = ['/home/nicolas/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible
  ansible collection location = /home/nicolas/.ansible/collections:/usr/share/ansible/collections
  executable location = ./config/venv/bin/ansible
  python version = 3.8.16 (default, Jun 25 2023, 05:53:51) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)]
  jinja version = 3.1.2
  libyaml = True
Using /home/nicolas/test-upgrade-os/ansible.cfg as config file
setting up inventory plugins
host_list declined parsing /home/nicolas/test-upgrade-os/exec/inventory as it did not pass its verify_file() method
script declined parsing /home/nicolas/test-upgrade-os/exec/inventory as it did not pass its verify_file() method
auto declined parsing /home/nicolas/test-upgrade-os/exec/inventory as it did not pass its verify_file() method
Parsed /home/nicolas/test-upgrade-os/exec/inventory inventory source with ini plugin
Loading callback plugin minimal of type stdout, v2.0 from /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible/plugins/callback/minimal.py
Skipping callback 'default', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.
META: ran handlers
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'echo ~node-user && sleep 0'"'"''
<10.10.0.101> (0, b'/home/node-user\n', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /home/node-user/.ansible/tmp `"&& mkdir "` echo /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576 `" && echo ansible-tmp-1691583637.8116903-3768362-148267575047576="` echo /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576 `" ) && sleep 0'"'"''
<10.10.0.101> (0, b'ansible-tmp-1691583637.8116903-3768362-148267575047576=/home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576\n', b'')
<worker1> Attempting python interpreter discovery
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'python3.10'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.9'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.8'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<10.10.0.101> (0, b'PLATFORM\nLinux\nFOUND\n/usr/libexec/platform-python\nENDFOUND\n', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'/usr/libexec/platform-python && sleep 0'"'"''
<10.10.0.101> (0, b'{"platform_dist_result": ["centos", "8.5", "Green Obsidian"], "osrelease_content": "NAME=\\"Rocky Linux\\"\\nVERSION=\\"8.5 (Green Obsidian)\\"\\nID=\\"rocky\\"\\nID_LIKE=\\"rhel centos fedora\\"\\nVERSION_ID=\\"8.5\\"\\nPLATFORM_ID=\\"platform:el8\\"\\nPRETTY_NAME=\\"Rocky Linux 8.5 (Green Obsidian)\\"\\nANSI_COLOR=\\"0;32\\"\\nCPE_NAME=\\"cpe:/o:rocky:rocky:8:GA\\"\\nHOME_URL=\\"https://rockylinux.org/\\"\\nBUG_REPORT_URL=\\"https://bugs.rockylinux.org/\\"\\nROCKY_SUPPORT_PRODUCT=\\"Rocky Linux\\"\\nROCKY_SUPPORT_PRODUCT_VERSION=\\"8\\"\\n"}\n', b'')
Using module file /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible/modules/setup.py
<10.10.0.101> PUT /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpy4qpsqz0 TO /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py
<10.10.0.101> SSH: EXEC sftp -b - -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="k
ubonode"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' '[10.10.0.101]'
<10.10.0.101> (0, b'sftp> put /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpy4qpsqz0 /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py\n', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'chmod u+x /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/ /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py && sleep 0'"'"''
<10.10.0.101> (0, b'', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' -tt 10.10.0.101 '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ztvxikfxzuzwogfymzcnlpfaroxhooqg ; /usr/libexec/platform-python /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<10.10.0.101> (0, b'\r\n{"ansible_facts": {"ansible_pkg_mgr": "dnf"}, "invocation": {"module_args": {"filter": ["ansible_pkg_mgr"], "gather_subset": ["!all"], "gather_timeout": 10, "fact_path": "/etc/ansible/facts.d"}}}\r\n', b'')
Running ansible.legacy.dnf as the backend for the yum action plugin
Using module file /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible/modules/dnf.py
<10.10.0.101> PUT /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpomw666d5 TO /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py
<10.10.0.101> SSH: EXEC sftp -b - -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' '[10.10.0.101]'
<10.10.0.101> (0, b'sftp> put /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpomw666d5 /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py\n', b
'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'chmod u+x /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/ /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py && sleep 0'"'"''
<10.10.0.101> (0, b'', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' -tt 10.10.0.101 '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-gjdfwphkqonajiudmalgairdspobkjad ; /usr/libexec/platform-python /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded

Before running ansible, I ssh'ed in the node and ran:

watch "ps -aux | grep ansible"

While ansible was performing the yum update, I saw the process /usr/libexec/platform-python /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py was running for about 10-15 minutes and when it had disappear, ansible kept running for more than an hour before failing with the following error:

worker1 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ",
    "unreachable": true
}

I was afraid that something related to the ssh server was updated and broke ansible, but my watch is still running as I write this issue. And anyway the changes would have take effect after a reboot, so that existing SSH connections wouldn’t be dropped.

I think this is an ansible issue, but am not really sure. Would you have any insights on why this is happening and how it could be fixed?
I would be happy to file a PR, but can’t get my head around this atm.

PS : I tried using the dnf and package modules, which gave the exact same results.
PS2 : out of curiosity, I tried updating a single package (tar) and it worked with yum, dnf and package modules.

nicolas-goudry · 2023-08-09T15:42:42Z

I need to do some further testing, but I think I found an acceptable workaround. See master...nicolas-goudry:kubespray:fix/yum-system-upgrade.

@sathieu @MrFreezeex what do you think about that ?

Edit: after some tests, this workaround needs further tweaking and testing. I’ll let you know how it goes.

Edit 2:

Something really weird is happening with this workaround.

When I run the update-os.yaml play, the yum module hangs and results in a fatal: [master2]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ", "unreachable": true} error.

But, when I update the yum.yml file to transform it into a playbook like so:

- hosts: all
  gather_facts: no
  tasks:
    # Workaround to whole system update not working with yum: name=*
    - name: YUM | Get available package updates
      yum:
        list: updates
      register: yum_available_package_updates

    - name: YUM | Debug packages to update
      debug:
        msg: "{{ yum_available_package_updates.results | map(attribute='name') | list }}"

    - name: YUM | Update packages  # noqa package-latest
      yum:
        name: "{{ yum_available_package_updates.results | map(attribute='name') | list }}"
        state: 'latest'
      register: yum_upgrade
#
#    - name: YUM | Reboot after packages updates  # noqa no-handler
#      when:
#      - yum_upgrade.changed or system_upgrade_reboot == 'always'
#      - system_upgrade_reboot != 'never'
#      reboot:

And run it with the following command:

ansible-playbook -u node-user -b --become-user=root -i exec/inventory setup/system-upgrade/tasks/yum.yml --limit=master3

It works…

sathieu · 2023-09-15T08:47:00Z

@nicolas-goudry Sorry for this late response. I don't use Rocky Linux or any of the RPM-based distro. I took the code from the yum module doc.

Could you propose a PR with your improvement?

nicolas-goudry · 2023-09-19T09:58:21Z

@sathieu I still have some weird issues with this workaround so I’m not sure it should land in Kubespray just yet.

I think it’s something related to Python versions discrepancies between the control node and managed nodes. Not quite sure though. I’ll have to do more tests but I’m struggling to find the time to do so. I’ll keep you posted.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2023

k8s-ci-robot requested review from bozzo and liupeng0518 June 2, 2023 16:42

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 2, 2023

sathieu marked this pull request as draft June 2, 2023 16:44

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 2, 2023

sathieu force-pushed the system-upgrade branch from 00d9b11 to 88d85bc Compare June 12, 2023 15:45

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2023

sathieu marked this pull request as ready for review June 12, 2023 15:45

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2023

k8s-ci-robot requested a review from MrFreezeex June 12, 2023 15:45

sathieu force-pushed the system-upgrade branch from 88d85bc to 2b847f6 Compare June 12, 2023 16:00

sathieu force-pushed the system-upgrade branch from 2b847f6 to 617a777 Compare June 13, 2023 11:43

sathieu force-pushed the system-upgrade branch from 617a777 to c3756d2 Compare June 13, 2023 11:54

MrFreezeex reviewed Jun 13, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 15, 2023

sathieu force-pushed the system-upgrade branch 2 times, most recently from f2f9848 to f4c9ffd Compare June 15, 2023 11:01

sathieu force-pushed the system-upgrade branch from f4c9ffd to 20458eb Compare June 15, 2023 11:04

MrFreezeex approved these changes Jun 15, 2023

View reviewed changes

k8s-ci-robot assigned MrFreezeex Jun 15, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023

Add system-upgrade to upgrade-cluster playbook

65f8a87

sathieu force-pushed the system-upgrade branch from 20458eb to 65f8a87 Compare June 19, 2023 07:41

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2023

k8s-ci-robot merged commit 7706935 into kubernetes-sigs:master Jun 27, 2023

sathieu deleted the system-upgrade branch June 27, 2023 07:00

yankay mentioned this pull request Aug 24, 2023

Release Proposal v2.23 #10389

Closed

pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024

Add system-upgrade to upgrade-cluster playbook (kubernetes-sigs#10184)

e8ff1d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add system-upgrade to upgrade-cluster playbook #10184

Add system-upgrade to upgrade-cluster playbook #10184

sathieu commented Jun 2, 2023

k8s-ci-robot commented Jun 2, 2023

sathieu commented Jun 12, 2023

MrFreezeex commented Jun 12, 2023 •

edited

Loading

sathieu commented Jun 13, 2023

MrFreezeex commented Jun 13, 2023

sathieu commented Jun 13, 2023 •

edited

Loading

MrFreezeex commented Jun 13, 2023 •

edited

Loading

MrFreezeex Jun 13, 2023

sathieu Jun 15, 2023

yankay commented Jun 15, 2023

sathieu commented Jun 15, 2023

MrFreezeex left a comment

sathieu commented Jun 19, 2023

MrFreezeex commented Jun 19, 2023

sathieu commented Jun 26, 2023

MrFreezeex commented Jun 26, 2023

oomichi commented Jun 27, 2023

k8s-ci-robot commented Jun 27, 2023

sathieu commented Jun 27, 2023

nicolas-goudry commented Aug 9, 2023 •

edited

Loading

nicolas-goudry commented Aug 9, 2023 •

edited

Loading

sathieu commented Sep 15, 2023

nicolas-goudry commented Sep 19, 2023

Add system-upgrade to upgrade-cluster playbook #10184

Add system-upgrade to upgrade-cluster playbook #10184

Conversation

sathieu commented Jun 2, 2023

k8s-ci-robot commented Jun 2, 2023

sathieu commented Jun 12, 2023

MrFreezeex commented Jun 12, 2023 • edited Loading

sathieu commented Jun 13, 2023

MrFreezeex commented Jun 13, 2023

sathieu commented Jun 13, 2023 • edited Loading

MrFreezeex commented Jun 13, 2023 • edited Loading

MrFreezeex Jun 13, 2023

Choose a reason for hiding this comment

sathieu Jun 15, 2023

Choose a reason for hiding this comment

yankay commented Jun 15, 2023

sathieu commented Jun 15, 2023

MrFreezeex left a comment

Choose a reason for hiding this comment

sathieu commented Jun 19, 2023

MrFreezeex commented Jun 19, 2023

sathieu commented Jun 26, 2023

MrFreezeex commented Jun 26, 2023

oomichi commented Jun 27, 2023

k8s-ci-robot commented Jun 27, 2023

sathieu commented Jun 27, 2023

nicolas-goudry commented Aug 9, 2023 • edited Loading

nicolas-goudry commented Aug 9, 2023 • edited Loading

sathieu commented Sep 15, 2023

nicolas-goudry commented Sep 19, 2023

MrFreezeex commented Jun 12, 2023 •

edited

Loading

sathieu commented Jun 13, 2023 •

edited

Loading

MrFreezeex commented Jun 13, 2023 •

edited

Loading

nicolas-goudry commented Aug 9, 2023 •

edited

Loading

nicolas-goudry commented Aug 9, 2023 •

edited

Loading