Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add system-upgrade to upgrade-cluster playbook #10184

Merged
merged 1 commit into from
Jun 27, 2023

Conversation

sathieu
Copy link
Contributor

@sathieu sathieu commented Jun 2, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it:

We want to upgrade the system packages of the nodes (including the linux kernel).

Currently we do this with a specific playbook, but this leads to cordoning each node twice.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

System upgrade for Debian-family nodes is available with system_upgrade=true

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @sathieu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 2, 2023
@sathieu sathieu marked this pull request as draft June 2, 2023 16:44
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 2, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2023
@sathieu sathieu marked this pull request as ready for review June 12, 2023 15:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2023
@sathieu
Copy link
Contributor Author

sathieu commented Jun 12, 2023

I've tested successfully this PR, marking ready.

@MrFreezeex
Copy link
Member

MrFreezeex commented Jun 12, 2023

Hi @sathieu, thanks for your contribution! I am not very convinced that this should be included into kubespray though, maybe it should be inside some kind of pre-role/playbook in your tooling for instance? I am not extremely against it though, if some other reviewers think it's useful, I will not block it.

That being said, if that's something that we want to include it would be nice to at least have support for debian AND debian derivatives like Ubuntu + redhat/CentOS distros. This should cover a wide amount of users...

@sathieu
Copy link
Contributor Author

sathieu commented Jun 13, 2023

@MrFreezeex Thanks for your review.

Hi @sathieu, thanks for your contribution! I am not very convinced that this should be included into kubespray though, maybe it should be inside some kind of pre-role/playbook in your tooling for instance? I am not extremely against it though, if some other reviewers think it's useful, I will not block it.

We are currently using a specific playbook for this, but this leads to draining+uncordoning every node twice (one for the system upgrade, then latter for the kubespray upgrade).

I have done a full test of this PR and it's not working. As rebooting cleans up /tmp, kubespray fails latter when trying to do CNI | Copy cni plugins. Any idea how to fix this?

That being said, if that's something that we want to include it would be nice to at least have support for debian AND debian derivatives like Ubuntu + redhat/CentOS distros. This should cover a wide amount of users...

The current patch is for Debian and derivatives. I would happily add support for other distros but I need a code snippet.

@MrFreezeex
Copy link
Member

We are currently using a specific playbook for this, but this leads to draining+uncordoning every node twice (one for the system upgrade, then latter for the kubespray upgrade).

Ah yeah, I see make sense 👍

The current patch is for Debian and derivatives. I would happily add support for other distros but I need a code snippet.

For Ubuntu support and so on you need to modify the clause so that it can run but yes. For CentOS/fedora/rocky/... You can probably use the yum module https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_module.html#ansible-collections-ansible-builtin-yum-module there is an example in the doc there:

- name: Upgrade all packages
  ansible.builtin.yum:
    name: '*'
    state: latest

And you probably need another task to update the cache as well (with update_cache to True, similar to an apt update basically)

@sathieu
Copy link
Contributor Author

sathieu commented Jun 13, 2023

@MrFreezeex I've added yum upgrade. I don't think yum update_cache is needed, it's automatically done when old.

For apt, this is done by:

- name: Update package management cache (APT)
apt:
update_cache: yes
cache_valid_time: 3600
when: ansible_os_family == "Debian"
tags:
- bootstrap-os

But the main problem remains, /tmp is empty after reboot, and kubespray fails with missing downloaded files.

@MrFreezeex
Copy link
Member

MrFreezeex commented Jun 13, 2023

Hmmm I think you probably need to trigger the download role after the reboot unfortunately 🤔. To me you would have to move the system upgrade right after the cordon and then trigger the download role if system upgrade is invoked + make the first invocation of downlad role skipped if system upgrade is ran.

@@ -0,0 +1,37 @@
---
# Debian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit a nitpick but I think it would a bit cleaner to create two other files debian.yml and redhat.yml and include them from the main.yml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yankay
Copy link
Member

yankay commented Jun 15, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 15, 2023
@sathieu sathieu force-pushed the system-upgrade branch 2 times, most recently from f2f9848 to f4c9ffd Compare June 15, 2023 11:01
@sathieu
Copy link
Contributor Author

sathieu commented Jun 15, 2023

@MrFreezeex I've pushed a newer version of it.

make the first invocation of downlad role skipped if system upgrade is ran.

This is not possible, because etcd and other components need download. So the behavior is not optimal, but for us this is better than cordoning nodes twice.

Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution and all the changes you made!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2023
@sathieu
Copy link
Contributor Author

sathieu commented Jun 19, 2023

@MrFreezeex Please re-approve, I've fixed a linting problem.

@MrFreezeex
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2023
@sathieu
Copy link
Contributor Author

sathieu commented Jun 26, 2023

@MrFreezeex Thanks for approving this PR. What is the next step to have it merged?

@MrFreezeex
Copy link
Member

@MrFreezeex Thanks for approving this PR. What is the next step to have it merged?

You need another review/approval from a kubespray team member

@oomichi
Copy link
Contributor

oomichi commented Jun 27, 2023

looks good for me, thanks @sathieu

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MrFreezeex, oomichi, sathieu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit 7706935 into kubernetes-sigs:master Jun 27, 2023
@sathieu sathieu deleted the system-upgrade branch June 27, 2023 07:00
@sathieu
Copy link
Contributor Author

sathieu commented Jun 27, 2023

Thanks @MrFreezeex and @oomichi !

@nicolas-goudry
Copy link
Contributor

nicolas-goudry commented Aug 9, 2023

@sathieu I tried to use the system-upgrade role that you added with this PR on a Rocky Linux 8 backed cluster installed with Kubespray 2.22.1, but the “YUM upgrade all packages” task hangs “forever” for some reason…

I’ll try to provide as much details as I can.


Here is my directory listing (redacted to omit irrelevant output):

.
├── exec
│   ├── kubespray-values.yaml
│   └── inventory
├── setup
│   ├── kubespray
│   ├── system-upgrade
│   └── upgrade-os.yaml
└── ansible.cfg

I’m running ansible from the root (.) directory. So I had to create an ansible.cfg file:

[defaults]
roles_path = setup/kubespray/roles

As seen in the directory listing, I imported the system-upgrade role under the setup directory, alongside kubespray. For convenience, I added a defaults/main.yml file to the system-upgrade role:

---
system_upgrade: true
system_upgrade_reboot: on-upgrade # never, always

And I created an upgrade-os.yaml playbook under the setup directory (thus the import_playbook path differing from upgrade-cluster.yml):

# This playbook borrows parts of the upgrade-cluster.yml play from Kubespray.
# It also makes use of the system-upgrade Kubespray role which has not yet been released.
# See https://github.com/kubernetes-sigs/kubespray/blob/36e5d742dc2b3f7984398c38009f236be7c3c065/playbooks/upgrade_cluster.yml
# See https://github.com/kubernetes-sigs/kubespray/blob/36e5d742dc2b3f7984398c38009f236be7c3c065/roles/upgrade/system-upgrade/tasks/main.yml
---
- name: Check ansible version
  import_playbook: kubespray/playbooks/ansible_version.yml

- name: Ensure compatibility with old groups
  import_playbook: kubespray/playbooks/legacy_groups.yml

- name: Gather facts
  tags: always
  import_playbook: kubespray/playbooks/facts.yml

- name: Handle upgrades to control plane hosts
  gather_facts: False
  hosts: kube_control_plane
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: 1
  roles:
    - { role: kubespray-defaults }
    - { role: upgrade/pre-upgrade, tags: pre-upgrade }
    - { role: system-upgrade, tags: system-upgrade }
    - { role: upgrade/post-upgrade, tags: post-upgrade }

- name: Handle upgrades to worker hosts
  hosts: kube_node:calico_rr:!kube_control_plane
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: "{{ serial | default('20%') }}"
  roles:
    - { role: kubespray-defaults }
    - { role: upgrade/pre-upgrade, tags: pre-upgrade }
    - { role: system-upgrade, tags: system-upgrade }
    - { role: upgrade/post-upgrade, tags: post-upgrade }

I’m taking advantage of the existing roles used by upgrade-cluster.yml to cordon/drain/uncordon in addition to your system-upgrade role to only perform a system upgrade on my nodes.


Here is how I run it:

# Upgrade control plane hosts system
ansible-playbook -u node-user -b --become-user=root -i exec/inventory -e "@exec/kubespray-values.yaml" setup/update-os.yaml --limit=kube_control_plane,etcd

# Upgrade worker hosts system
ansible-playbook -u node-user -b --become-user=root -i exec/inventory -e "@exec/kubespray-values.yaml" setup/update-os.yaml --limit=kube_node

But the play stays stuck on the “YUM upgrade all packages” task. I ssh'ed in the node after more than 45 minutes waiting for the task to end, only to see that the following command only reported itself:

ps -aux | grep ansible

Also, running:

sudo yum update

Gave me the following result:

Last metadata expiration check: 1:38:59 ago on Wed 09 Aug 2023 11:59:58 AM UTC.
Dependencies resolved.
Nothing to do.
Complete!

So, the yum update did complete, but the results were never reported to the ansible play run.


After killing the play, I went ahead and tried to run the yum module through ansible directly on one host only with the following command:

ansible all -u node-user -b --become-user=root -i exec/inventory -m yum -a 'name=* state=latest' -vvvv --limit=worker1

Here is the output (redacted):

ansible [core 2.12.5]
  config file = /home/nicolas/test-upgrade-os/ansible.cfg
  configured module search path = ['/home/nicolas/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible
  ansible collection location = /home/nicolas/.ansible/collections:/usr/share/ansible/collections
  executable location = ./config/venv/bin/ansible
  python version = 3.8.16 (default, Jun 25 2023, 05:53:51) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)]
  jinja version = 3.1.2
  libyaml = True
Using /home/nicolas/test-upgrade-os/ansible.cfg as config file
setting up inventory plugins
host_list declined parsing /home/nicolas/test-upgrade-os/exec/inventory as it did not pass its verify_file() method
script declined parsing /home/nicolas/test-upgrade-os/exec/inventory as it did not pass its verify_file() method
auto declined parsing /home/nicolas/test-upgrade-os/exec/inventory as it did not pass its verify_file() method
Parsed /home/nicolas/test-upgrade-os/exec/inventory inventory source with ini plugin
Loading callback plugin minimal of type stdout, v2.0 from /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible/plugins/callback/minimal.py
Skipping callback 'default', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.
META: ran handlers
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'echo ~node-user && sleep 0'"'"''
<10.10.0.101> (0, b'/home/node-user\n', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /home/node-user/.ansible/tmp `"&& mkdir "` echo /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576 `" && echo ansible-tmp-1691583637.8116903-3768362-148267575047576="` echo /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576 `" ) && sleep 0'"'"''
<10.10.0.101> (0, b'ansible-tmp-1691583637.8116903-3768362-148267575047576=/home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576\n', b'')
<worker1> Attempting python interpreter discovery
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'python3.10'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.9'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.8'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<10.10.0.101> (0, b'PLATFORM\nLinux\nFOUND\n/usr/libexec/platform-python\nENDFOUND\n', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'/usr/libexec/platform-python && sleep 0'"'"''
<10.10.0.101> (0, b'{"platform_dist_result": ["centos", "8.5", "Green Obsidian"], "osrelease_content": "NAME=\\"Rocky Linux\\"\\nVERSION=\\"8.5 (Green Obsidian)\\"\\nID=\\"rocky\\"\\nID_LIKE=\\"rhel centos fedora\\"\\nVERSION_ID=\\"8.5\\"\\nPLATFORM_ID=\\"platform:el8\\"\\nPRETTY_NAME=\\"Rocky Linux 8.5 (Green Obsidian)\\"\\nANSI_COLOR=\\"0;32\\"\\nCPE_NAME=\\"cpe:/o:rocky:rocky:8:GA\\"\\nHOME_URL=\\"https://rockylinux.org/\\"\\nBUG_REPORT_URL=\\"https://bugs.rockylinux.org/\\"\\nROCKY_SUPPORT_PRODUCT=\\"Rocky Linux\\"\\nROCKY_SUPPORT_PRODUCT_VERSION=\\"8\\"\\n"}\n', b'')
Using module file /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible/modules/setup.py
<10.10.0.101> PUT /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpy4qpsqz0 TO /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py
<10.10.0.101> SSH: EXEC sftp -b - -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="k
ubonode"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' '[10.10.0.101]'
<10.10.0.101> (0, b'sftp> put /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpy4qpsqz0 /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py\n', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'chmod u+x /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/ /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py && sleep 0'"'"''
<10.10.0.101> (0, b'', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' -tt 10.10.0.101 '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ztvxikfxzuzwogfymzcnlpfaroxhooqg ; /usr/libexec/platform-python /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_setup.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<10.10.0.101> (0, b'\r\n{"ansible_facts": {"ansible_pkg_mgr": "dnf"}, "invocation": {"module_args": {"filter": ["ansible_pkg_mgr"], "gather_subset": ["!all"], "gather_timeout": 10, "fact_path": "/etc/ansible/facts.d"}}}\r\n', b'')
Running ansible.legacy.dnf as the backend for the yum action plugin
Using module file /home/nicolas/test-upgrade-os/config/venv/lib64/python3.8/site-packages/ansible/modules/dnf.py
<10.10.0.101> PUT /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpomw666d5 TO /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py
<10.10.0.101> SSH: EXEC sftp -b - -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' '[10.10.0.101]'
<10.10.0.101> (0, b'sftp> put /home/nicolas/test-upgrade-os/config/ansible/tmp/ansible-local-3768356wtqis0tq/tmpomw666d5 /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py\n', b
'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' 10.10.0.101 '/bin/sh -c '"'"'chmod u+x /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/ /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py && sleep 0'"'"''
<10.10.0.101> (0, b'', b'')
<10.10.0.101> ESTABLISH SSH CONNECTION FOR USER: node-user
<10.10.0.101> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="node-user"' -o ConnectTimeout=10 -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/node-user -o 'ProxyCommand=ssh -q -o UserKnownHostsFile=ssh/known_hosts -i ssh/bastion-user -W %h:%p -p22 bastion-user@W.X.Y.Z' -o 'ControlPath="/home/nicolas/test-upgrade-os/config/ansible/cp/09896940d7"' -tt 10.10.0.101 '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-gjdfwphkqonajiudmalgairdspobkjad ; /usr/libexec/platform-python /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded

Before running ansible, I ssh'ed in the node and ran:

watch "ps -aux | grep ansible"

While ansible was performing the yum update, I saw the process /usr/libexec/platform-python /home/node-user/.ansible/tmp/ansible-tmp-1691583637.8116903-3768362-148267575047576/AnsiballZ_dnf.py was running for about 10-15 minutes and when it had disappear, ansible kept running for more than an hour before failing with the following error:

worker1 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ",
    "unreachable": true
}

I was afraid that something related to the ssh server was updated and broke ansible, but my watch is still running as I write this issue. And anyway the changes would have take effect after a reboot, so that existing SSH connections wouldn’t be dropped.

I think this is an ansible issue, but am not really sure. Would you have any insights on why this is happening and how it could be fixed?
I would be happy to file a PR, but can’t get my head around this atm.


PS : I tried using the dnf and package modules, which gave the exact same results.
PS2 : out of curiosity, I tried updating a single package (tar) and it worked with yum, dnf and package modules.

@nicolas-goudry
Copy link
Contributor

nicolas-goudry commented Aug 9, 2023

I need to do some further testing, but I think I found an acceptable workaround. See master...nicolas-goudry:kubespray:fix/yum-system-upgrade.

@sathieu @MrFreezeex what do you think about that ?


Edit: after some tests, this workaround needs further tweaking and testing. I’ll let you know how it goes.

Edit 2:

Something really weird is happening with this workaround.

When I run the update-os.yaml play, the yum module hangs and results in a fatal: [master2]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ", "unreachable": true} error.

But, when I update the yum.yml file to transform it into a playbook like so:

- hosts: all
  gather_facts: no
  tasks:
    # Workaround to whole system update not working with yum: name=*
    - name: YUM | Get available package updates
      yum:
        list: updates
      register: yum_available_package_updates

    - name: YUM | Debug packages to update
      debug:
        msg: "{{ yum_available_package_updates.results | map(attribute='name') | list }}"

    - name: YUM | Update packages  # noqa package-latest
      yum:
        name: "{{ yum_available_package_updates.results | map(attribute='name') | list }}"
        state: 'latest'
      register: yum_upgrade
#
#    - name: YUM | Reboot after packages updates  # noqa no-handler
#      when:
#      - yum_upgrade.changed or system_upgrade_reboot == 'always'
#      - system_upgrade_reboot != 'never'
#      reboot:

And run it with the following command:

ansible-playbook -u node-user -b --become-user=root -i exec/inventory setup/system-upgrade/tasks/yum.yml --limit=master3

It works…

@yankay yankay mentioned this pull request Aug 24, 2023
@sathieu
Copy link
Contributor Author

sathieu commented Sep 15, 2023

@nicolas-goudry Sorry for this late response. I don't use Rocky Linux or any of the RPM-based distro. I took the code from the yum module doc.

Could you propose a PR with your improvement?

@nicolas-goudry
Copy link
Contributor

@sathieu I still have some weird issues with this workaround so I’m not sure it should land in Kubespray just yet.

I think it’s something related to Python versions discrepancies between the control node and managed nodes. Not quite sure though. I’ll have to do more tests but I’m struggling to find the time to do so. I’ll keep you posted.

pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants