Description
openedon Aug 9, 2024
The issue
Initiated upgrade of standalone (on VMs) k0s cluster on AWS infrastructure v1.29.6+k0s.0
-> v1.30.3+k0s.0
.
Upgrade initiated by changing .spec.version
in the K0sControlPlane
resource as well as in .spec.template.spec.version
in the K0sWorkerConfigTemplate
.
I do have CRs statuses updated (which is counterintuitive, since nothing happened yet) with a new version, for example :
kubectl get k0scontrolplane aws-cl-1-cp -o jsonpath="{.status.version}"
v1.30.3+k0s.0
But the actual cluster remains on previous version and it stuck in this indefinitely.
Root cause
After some analysis I noticed that the autopilot's Plan
resource has wrong node names in it:
apiVersion: autopilot.k0sproject.io/v1beta2
kind: Plan
metadata:
creationTimestamp: "2024-08-09T16:38:26Z"
generation: 1
name: autopilot
resourceVersion: "4594"
uid: 0563012c-9f57-48f6-99e3-d624cdc7fb9c
spec:
commands:
- k0supdate:
platforms:
linux-amd64:
url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-amd64
linux-arm:
url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-arm
linux-arm64:
url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-arm64
targets:
controllers:
discovery:
static:
nodes:
- aws-cl-1-cp-1
- aws-cl-1-cp-2
- aws-cl-1-cp-0
limits:
concurrent: 1
version: v1.30.3+k0s.0
id: id-aws-cl-1-cp-1723221506
timestamp: "1723221506"
So it uses node names which were given by CAPI
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
aws-cl-1-cp-0 aws-cl-1 ip-10-0-90-234.us-west-1.compute.internal aws:///us-west-1b/i-00811729183d98794 Running 39m v1.30.3
aws-cl-1-cp-1 aws-cl-1 ip-10-0-81-251.us-west-1.compute.internal aws:///us-west-1b/i-0a8274ae033847b4a Running 39m v1.30.3
aws-cl-1-cp-2 aws-cl-1 ip-10-0-75-41.us-west-1.compute.internal aws:///us-west-1b/i-0116f1582637d2b7f Running 39m v1.30.3
aws-cl-1-md-jgtpl-jlm64 aws-cl-1 ip-10-0-91-155.us-west-1.compute.internal aws:///us-west-1b/i-0f59b5d169e8a88b6 Running 39m
And not names that are actually present in the cluster:
NAME STATUS ROLES AGE VERSION
ip-10-0-75-41.us-west-1.compute.internal Ready control-plane 29m v1.29.6+k0s
ip-10-0-81-251.us-west-1.compute.internal Ready control-plane 18m v1.29.6+k0s
ip-10-0-90-234.us-west-1.compute.internal Ready control-plane 30m v1.29.6+k0s
ip-10-0-91-155.us-west-1.compute.internal Ready <none> 25m v1.29.6+k0s
This is confirmed by the following error in k0s log:
Aug 09 16:41:14 ip-10-0-81-251.us-west-1.compute.internal k0s[4170]: time="2024-08-09 16:41:14" level=info msg="starting to cordon node aws-cl-1-cp-1" component=autopilot controller=ControlNode leadermode=false
object=ControlNode reconciler=cordoning signalnode=aws-cl-1-cp-1
Aug 09 16:41:14 ip-10-0-81-251.us-west-1.compute.internal k0s[4170]: time="2024-08-09 16:41:14" level=error msg="Reconciler error" ControlNode="{aws-cl-1-cp-1 }" component=controller-runtime controller=controlno
de controllerGroup=autopilot.k0sproject.io controllerKind=ControlNode error="failed to get node: Node \"aws-cl-1-cp-1\" not found" name=aws-cl-1-cp-1 namespace= reconcileID="\"b20149ec-44b0-4b69-b45a-f71ec1f056f5\""
Conclusion
k0smotron should use .status.nodeRef.name
of the Machine
resource as node name in the Plan
, not its .metadata.name
(or whatever is used).