Reminder: If any problems are encountered and the procedure or command output does not provide relevant guidance, see Relevant troubleshooting links for upgrade-related issues.
- Start typescript on
ncn-m001
- Stage 2.1 - Master node image upgrade
- Argo workflows
- Stage 2.2 - Worker node image upgrade
- Stage 2.3 -
ncn-m001
upgrade - Stage 2.4 - Upgrade
weave
andmultus
- Stage 2.5 -
coredns
anti-affinity - Stage 2.6 - Complete Kubernetes upgrade
- Stop typescript on
ncn-m002
- Stage completed
-
(
ncn-m001#
) If a typescript session is already running in the shell, then first stop it with theexit
command. -
(
ncn-m001#
) Start a typescript.script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_2_ncn-m001.txt export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
If additional shells are opened during this procedure, then record those with typescripts as well. When resuming a procedure after a break, always be sure that a typescript is running before proceeding.
-
(
ncn-m001#
) Runncn-upgrade-master-nodes.sh
forncn-m002
.Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m002
NOTE
Theroot
user password for the node may need to be reset after it is rebooted. -
Repeat the previous step for each other master node excluding
ncn-m001
, one at a time.
Before starting Stage 2.2 - Worker node image upgrade, access the Argo UI to view the progress of this stage. Note that the progress for the current stage will not show up in Argo before the worker node image upgrade script has been started.
For more information, see Using the Argo UI and Using Argo Workflows.
NOTE
One of the Argo steps (wait-for-cfs
) will prevent the upgrade of a worker node from proceeding if the CFS component status for that worker is in anError
state, and this must be fixed in order for the upgrade to continue. The following steps can be used to reset the component state in CFS (replaceXNAME
below with theXNAME
for the worker node:
cray cfs components update --error-count 0 <XNAME>
cray cfs components update --state '[]' <XNAME>
There are two options available for upgrading worker nodes.
-
(
ncn-m001#
) Runncn-upgrade-worker-storage-nodes.sh
forncn-w001
.Follow output of the script carefully. The script will pause for manual interaction.
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w001
NOTE
Theroot
user password for the node may need to be reset after it is rebooted. -
Repeat the previous steps for each other worker node, one at a time.
Multiple workers can be upgraded simultaneously by passing them as a comma-separated list into the upgrade script.
In some cases, it is not possible to upgrade all workers in one request. It is system administrator's responsibility to make sure that the following conditions are met:
-
If the system has more than five workers, then they cannot all be upgraded with a single request.
In this case, the upgrade should be split into multiple requests, with each request specifying no more than five workers.
-
No single upgrade request should include all of the worker nodes that have DVS running on them.
(ncn-m001#
) An example of a single request to upgrade multiple worker nodes simultaneously:
/usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-worker-storage-nodes.sh ncn-w002,ncn-w003,ncn-w004
By this point, all NCNs have been upgraded, except for ncn-m001
. In the upgrade process so far, ncn-m001
has been the "stable node" -- that is, the node from which the other nodes were upgraded. At this point, the
upgrade procedure pivots to use ncn-m002
as the new "stable node", in order to allow the upgrade of ncn-m001
.
For any typescripts that were started earlier on ncn-m001
, stop them with the exit
command.
-
(
ncn-m001#
) Create an archive of the artifacts.BACKUP_TARFILE="csm_upgrade.pre_m001_reboot_artifacts.$(date +%Y%m%d_%H%M%S).tgz" ls -d \ /root/apply_csm_configuration.* \ /root/csm_upgrade.* \ /root/output.log 2>/dev/null | sed 's_^/__' | xargs tar -C / -czvf "/root/${BACKUP_TARFILE}"
-
(
ncn-m001#
) Upload the archive to S3 in the cluster.cray artifacts create config-data "${BACKUP_TARFILE}" "/root/${BACKUP_TARFILE}"
-
Log out of
ncn-m001
. -
Log in to
ncn-m002
from outside the cluster.NOTE
Very rarely, a password hash for theroot
user that works properly on a SLES SP2 NCN is not recognized on a SLES SP3 NCN. If password login fails, then log in toncn-m002
fromncn-m001
and use thepasswd
command to reset the password. Then log in using the CMN IP address as directed below. Oncencn-m001
has been upgraded, log in fromncn-m002
and use thepasswd
command to reset the password. The other NCNs will have their passwords updated when NCN personalization is run in a subsequent step.ssh
to thebond0.cmn0
/CMN IP address ofncn-m002
.
-
(
ncn-m002#
) Start a typescript.script -af /root/csm_upgrade.$(date +%Y%m%d_%H%M%S).stage_2_ncn-m002.txt export PS1='\u@\H \D{%Y-%m-%d} \t \w # '
-
Authenticate with the Cray CLI on
ncn-m002
.See Configure the Cray Command Line Interface for details on how to do this.
-
(
ncn-m002#
) Set upgrade variables.source /etc/cray/upgrade/csm/myenv echo "${CSM_REL_NAME}"
-
(
ncn-m002#
) Copy artifacts fromncn-m001
.A later stage of the upgrade expects the
docs-csm
RPM to be located at/root/docs-csm-latest.noarch.rpm
onncn-m002
; that is why this command copies it there.scp ncn-m001:/root/csm_upgrade.pre_m001_reboot_artifacts.*.tgz /root csi_rpm=$(find "/etc/cray/upgrade/csm/${CSM_REL_NAME}/tarball/${CSM_REL_NAME}/rpm/cray/csm/" -name 'cray-site-init*.rpm') && scp ncn-m001:/root/docs-csm-*.noarch.rpm /root/docs-csm-latest.noarch.rpm && rpm -Uvh --force "${csi_rpm}" /root/docs-csm-latest.noarch.rpm
-
Upgrade
ncn-m001
./usr/share/doc/csm/upgrade/scripts/upgrade/ncn-upgrade-master-nodes.sh ncn-m001
Run the following command to complete the upgrade of the weave
and multus
manifest versions:
/srv/cray/scripts/common/apply-networking-manifests.sh
Run the following script to apply anti-affinity to coredns
pods:
/usr/share/doc/csm/upgrade/scripts/k8s/apply-coredns-pod-affinity.sh
Complete the Kubernetes upgrade. This script will restart several pods on each master node to their new Docker containers.
/usr/share/doc/csm/upgrade/scripts/k8s/upgrade_control_plane.sh
NOTE
:kubelet
has been upgraded already, ignore the warning to upgrade it.
For any typescripts that were started during this stage on ncn-m002
, stop them with the exit
command.
All Kubernetes nodes have been rebooted into the new image.
REMINDER: If password for
ncn-m002
was reset during Stage 2.3, then also reset the password onncn-m001
at this time.
This stage is completed. Continue to Stage 3.