Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check rke2-serving secret to determine controlPlane.Status.Initialized #302

Merged
merged 2 commits into from
May 6, 2024

Conversation

anmazzotti
Copy link
Contributor

@anmazzotti anmazzotti commented Apr 19, 2024

kind/bug

What this PR does / why we need it:

This should fix the behavior around setting the controlPlane.Status.Initialized flag.
As per documentation, this flag is:

a boolean field that is true when the target cluster has completed initialization such that at least once, the target's control plane has been contactable.

The implementation is in line with Kubeadm, but instead of checking if the kubeadm-config ConfigMap was uploaded on the cluster, we check if the rke2-serving secret was.
This should be equivalent and as well a good enough marker that the control plane was indeed initialized.

Which issue(s) this PR fixes:
This should fix the RKE2 control plane provider deadlocking when disabling the cloudController kubernetes component.

  serverConfig:
    disableComponents:
      kubernetesComponents:
        - cloudController

Disabling this component will forbid RKE2 from applying the node.Spec.ProviderID automatically. CAPI Infrastructure providers should manage that instead, but by contract they need the ControlPlaneInitialized condition to be true before attempting. This is the cause of the deadlock.

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@anmazzotti anmazzotti force-pushed the set_control_plane_initialized branch from ee213ba to 785e80a Compare April 19, 2024 12:46
@anmazzotti
Copy link
Contributor Author

Locally built and tested, I can see it working with the cloudController disabled and ProviderID set as expected:

NAME                       CLUSTER   NODENAME                                 PROVIDERID                                                   PHASE     AGE     VERSION
rke2-control-plane-l9fk5   rke2      m-651ddd3f-d4f4-48dc-919c-118d3e16c2c9   elemental://default/m-651ddd3f-d4f4-48dc-919c-118d3e16c2c9   Running   5m51s   v1.28.7
rke2-md-0-hkx5q-xxqql      rke2      m-8cc9f136-6849-4d44-89c3-8c26fa29b337   elemental://default/m-8cc9f136-6849-4d44-89c3-8c26fa29b337   Running   5m52s   v1.28.7+rke2r1

@anmazzotti anmazzotti marked this pull request as ready for review April 19, 2024 14:44
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
@anmazzotti anmazzotti force-pushed the set_control_plane_initialized branch from 785e80a to 8b643b3 Compare April 19, 2024 14:52
@alexander-demicev alexander-demicev added the kind/bug Something isn't working label Apr 22, 2024
@anmazzotti
Copy link
Contributor Author

anmazzotti commented Apr 22, 2024

Still looking at the failing test.
Seems to be a rollout issue, one new Machine was created and in Provisioning phase, but the underlying DockerMachine failed to apply the bootstrap.

status:
  conditions:
  - lastTransitionTime: 2024-04-22T08:41:07Z
    message: 1 of 2 completed
    reason: BootstrapFailed
    severity: Warning
    status: 'False'
    type: Ready
  - lastTransitionTime: 2024-04-22T08:41:07Z
    message: Repeating bootstrap
    reason: BootstrapFailed
    severity: Warning
    status: 'False'
    type: BootstrapExecSucceeded
  - lastTransitionTime: 2024-04-22T08:26:07Z
    status: 'True'
    type: ContainerProvisioned

On the failing container itself I can see a lot of:

Apr 22 09:04:58.039538 caprke2-e2e-5ioxg7-md-0-rvxmc-2nj9n rke2[264]: time="2024-04-22T09:04:58Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read

So the test runs in a 1 control-plane 3 worker nodes cluster.
The kubernetes version is bumped on control plane and machine deployment at the same time. The control plane rollout succeeds, the new control plane docker machine is able to bootstrap, however this new pod has now a different address 172.18.0.8 that the one used as controlPlaneEndpoint 172.18.0.3 so I guess once the rolled out control plane pod disappears, all next nodes will fail to contact the endpoint.

Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
@mbologna mbologna merged commit 7f2096e into rancher:main May 6, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Development

Successfully merging this pull request may close these issues.

4 participants