Unable to upgrade standalone k0s cluster on AWS

## The issue

Initiated upgrade of standalone (on VMs) k0s cluster on AWS infrastructure `v1.29.6+k0s.0` -> `v1.30.3+k0s.0`.

Upgrade initiated by changing `.spec.version` in the `K0sControlPlane` resource as well as in `.spec.template.spec.version` in the `K0sWorkerConfigTemplate`.

I do have CRs statuses updated (which is counterintuitive, since nothing happened yet) with a new version, for example : 

```bash
kubectl get k0scontrolplane aws-cl-1-cp -o jsonpath="{.status.version}"
v1.30.3+k0s.0
```

But the actual cluster remains on previous version and it stuck in this indefinitely.

## Root cause

After some analysis I noticed that the autopilot's `Plan` resource has **wrong** node names in it:

```yaml
  apiVersion: autopilot.k0sproject.io/v1beta2
  kind: Plan
  metadata:
    creationTimestamp: "2024-08-09T16:38:26Z"
    generation: 1
    name: autopilot
    resourceVersion: "4594"
    uid: 0563012c-9f57-48f6-99e3-d624cdc7fb9c
  spec:
    commands:
    - k0supdate:
        platforms:
          linux-amd64:
            url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-amd64
          linux-arm:
            url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-arm
          linux-arm64:
            url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-arm64
        targets:
          controllers:
            discovery:
              static:
                nodes:
                - aws-cl-1-cp-1
                - aws-cl-1-cp-2
                - aws-cl-1-cp-0
            limits:
              concurrent: 1
        version: v1.30.3+k0s.0
    id: id-aws-cl-1-cp-1723221506
    timestamp: "1723221506"
```

So it uses node names which were given by CAPI

```
  NAME                      CLUSTER    NODENAME                                    PROVIDERID                              PHASE     AGE   VERSION
  aws-cl-1-cp-0             aws-cl-1   ip-10-0-90-234.us-west-1.compute.internal   aws:///us-west-1b/i-00811729183d98794   Running   39m   v1.30.3
  aws-cl-1-cp-1             aws-cl-1   ip-10-0-81-251.us-west-1.compute.internal   aws:///us-west-1b/i-0a8274ae033847b4a   Running   39m   v1.30.3
  aws-cl-1-cp-2             aws-cl-1   ip-10-0-75-41.us-west-1.compute.internal    aws:///us-west-1b/i-0116f1582637d2b7f   Running   39m   v1.30.3
  aws-cl-1-md-jgtpl-jlm64   aws-cl-1   ip-10-0-91-155.us-west-1.compute.internal   aws:///us-west-1b/i-0f59b5d169e8a88b6   Running   39m
```

And not names that are actually present in the cluster:

```
  NAME                                        STATUS   ROLES           AGE   VERSION
  ip-10-0-75-41.us-west-1.compute.internal    Ready    control-plane   29m   v1.29.6+k0s
  ip-10-0-81-251.us-west-1.compute.internal   Ready    control-plane   18m   v1.29.6+k0s
  ip-10-0-90-234.us-west-1.compute.internal   Ready    control-plane   30m   v1.29.6+k0s
  ip-10-0-91-155.us-west-1.compute.internal   Ready    <none>          25m   v1.29.6+k0s
```

This is confirmed by the following error in k0s log:

```
Aug 09 16:41:14 ip-10-0-81-251.us-west-1.compute.internal k0s[4170]: time="2024-08-09 16:41:14" level=info msg="starting to cordon node aws-cl-1-cp-1" component=autopilot controller=ControlNode leadermode=false
object=ControlNode reconciler=cordoning signalnode=aws-cl-1-cp-1
Aug 09 16:41:14 ip-10-0-81-251.us-west-1.compute.internal k0s[4170]: time="2024-08-09 16:41:14" level=error msg="Reconciler error" ControlNode="{aws-cl-1-cp-1 }" component=controller-runtime controller=controlno
de controllerGroup=autopilot.k0sproject.io controllerKind=ControlNode error="failed to get node: Node \"aws-cl-1-cp-1\" not found" name=aws-cl-1-cp-1 namespace= reconcileID="\"b20149ec-44b0-4b69-b45a-f71ec1f056f5\""
```

## Conclusion

k0smotron should use `.status.nodeRef.name` of the `Machine` resource as node name in the `Plan`, not its `.metadata.name` (or whatever is used).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to upgrade standalone k0s cluster on AWS #670

a13x5
openedon Aug 9, 2024

The issue

Root cause

Conclusion

Assignees

Labels

Type

Projects

Milestone

Relationships

Development