Skip to content

Unable to upgrade standalone k0s cluster on AWS #670

Closed

Description

The issue

Initiated upgrade of standalone (on VMs) k0s cluster on AWS infrastructure v1.29.6+k0s.0 -> v1.30.3+k0s.0.

Upgrade initiated by changing .spec.version in the K0sControlPlane resource as well as in .spec.template.spec.version in the K0sWorkerConfigTemplate.

I do have CRs statuses updated (which is counterintuitive, since nothing happened yet) with a new version, for example :

kubectl get k0scontrolplane aws-cl-1-cp -o jsonpath="{.status.version}"
v1.30.3+k0s.0

But the actual cluster remains on previous version and it stuck in this indefinitely.

Root cause

After some analysis I noticed that the autopilot's Plan resource has wrong node names in it:

  apiVersion: autopilot.k0sproject.io/v1beta2
  kind: Plan
  metadata:
    creationTimestamp: "2024-08-09T16:38:26Z"
    generation: 1
    name: autopilot
    resourceVersion: "4594"
    uid: 0563012c-9f57-48f6-99e3-d624cdc7fb9c
  spec:
    commands:
    - k0supdate:
        platforms:
          linux-amd64:
            url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-amd64
          linux-arm:
            url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-arm
          linux-arm64:
            url: https://get.k0sproject.io/v1.30.3+k0s.0/k0s-v1.30.3+k0s.0-arm64
        targets:
          controllers:
            discovery:
              static:
                nodes:
                - aws-cl-1-cp-1
                - aws-cl-1-cp-2
                - aws-cl-1-cp-0
            limits:
              concurrent: 1
        version: v1.30.3+k0s.0
    id: id-aws-cl-1-cp-1723221506
    timestamp: "1723221506"

So it uses node names which were given by CAPI

  NAME                      CLUSTER    NODENAME                                    PROVIDERID                              PHASE     AGE   VERSION
  aws-cl-1-cp-0             aws-cl-1   ip-10-0-90-234.us-west-1.compute.internal   aws:///us-west-1b/i-00811729183d98794   Running   39m   v1.30.3
  aws-cl-1-cp-1             aws-cl-1   ip-10-0-81-251.us-west-1.compute.internal   aws:///us-west-1b/i-0a8274ae033847b4a   Running   39m   v1.30.3
  aws-cl-1-cp-2             aws-cl-1   ip-10-0-75-41.us-west-1.compute.internal    aws:///us-west-1b/i-0116f1582637d2b7f   Running   39m   v1.30.3
  aws-cl-1-md-jgtpl-jlm64   aws-cl-1   ip-10-0-91-155.us-west-1.compute.internal   aws:///us-west-1b/i-0f59b5d169e8a88b6   Running   39m

And not names that are actually present in the cluster:

  NAME                                        STATUS   ROLES           AGE   VERSION
  ip-10-0-75-41.us-west-1.compute.internal    Ready    control-plane   29m   v1.29.6+k0s
  ip-10-0-81-251.us-west-1.compute.internal   Ready    control-plane   18m   v1.29.6+k0s
  ip-10-0-90-234.us-west-1.compute.internal   Ready    control-plane   30m   v1.29.6+k0s
  ip-10-0-91-155.us-west-1.compute.internal   Ready    <none>          25m   v1.29.6+k0s

This is confirmed by the following error in k0s log:

Aug 09 16:41:14 ip-10-0-81-251.us-west-1.compute.internal k0s[4170]: time="2024-08-09 16:41:14" level=info msg="starting to cordon node aws-cl-1-cp-1" component=autopilot controller=ControlNode leadermode=false
object=ControlNode reconciler=cordoning signalnode=aws-cl-1-cp-1
Aug 09 16:41:14 ip-10-0-81-251.us-west-1.compute.internal k0s[4170]: time="2024-08-09 16:41:14" level=error msg="Reconciler error" ControlNode="{aws-cl-1-cp-1 }" component=controller-runtime controller=controlno
de controllerGroup=autopilot.k0sproject.io controllerKind=ControlNode error="failed to get node: Node \"aws-cl-1-cp-1\" not found" name=aws-cl-1-cp-1 namespace= reconcileID="\"b20149ec-44b0-4b69-b45a-f71ec1f056f5\""

Conclusion

k0smotron should use .status.nodeRef.name of the Machine resource as node name in the Plan, not its .metadata.name (or whatever is used).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions