Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller uncordons node that it did not cordon #341

Closed
disconn3ct opened this issue Dec 9, 2024 · 13 comments
Closed

Controller uncordons node that it did not cordon #341

disconn3ct opened this issue Dec 9, 2024 · 13 comments

Comments

@disconn3ct
Copy link

Version

rancher/system-upgrade-controller:v0.14.2@sha256:3cdbfdd90f814702cefb832fc4bdb09ea93865a4d06c6bafd019d1dc6a9f34c9
Platform/Architecture

Linux Arm64
Describe the bug

When a node is externally cordoned, the upgrade controller will uncordon it after completing the upgrade. This happens even if there is no upgrade, such as on a brand new node.

To Reproduce

Create a new node and cordon it manually. Once the update job runs, the node will be uncordoned even if nothing is changed.
Expected behavior

The controller should only remove cordons that it created.
Actual behavior

The controller uncordons anything that passes the upgrade job.
Additional context

@brandond
Copy link
Member

brandond commented Dec 9, 2024

If you drain/cordon a node using the SUC, it will uncordon it at the end of the upgrade. It does not have a way to determine if it was the one that cordoned it or not.

If you don't want this, don't use the drain/cordon options on your plan, and cordon/uncordon your nodes some other way - either manually, or using a custom plan running a manual command.

This happens even if there is no upgrade, such as on a brand new node.

The plan is executed on all nodes regardless of version. The SUC doesn't know if a node actually needs upgrading or not, it just knows if it has the plan hash annotation or not. If the node doesn't need upgrading, then the job with be a no-op and will just run successfully without replacing the binary.

@brandond brandond closed this as completed Dec 9, 2024
@disconn3ct
Copy link
Author

disconn3ct commented Dec 9, 2024

Am I understanding that this controller will uncordon any node in the cluster, at random, and that is by design?
There are no guardrails to track which nodes it is taking action on?

Edit:

If you drain/cordon a node using the SUC

This is in reference to an external cordon, in place prior to the SUC executing at all.

@brandond
Copy link
Member

brandond commented Dec 9, 2024

Am I understanding that this controller will uncordon any node in the cluster, at random

No, not at random. It will cordon/uncordon it when the plan job executes on the node. The plan will be executed on the node if it lacks a plan hash label that matches the current plan. This is covered in the readme.

This is in reference to an external cordon, in place prior to the SUC executing at all.

Maybe I can say this differently. If you use the drain/cordon option in your plan, the node will be drained/cordoned when the upgrade job runs against that node (even if it is already cordoned), and it will be uncordoned at the end of the upgrade job. The ONLY thing the SUC knows about the node is whether or not it has an up-to-date plan hash label, it does NOT know if it actually needs to be upgraded or not.

@disconn3ct
Copy link
Author

disconn3ct commented Dec 9, 2024

even if it is already cordoned

That is where I think the bug is. It should not operate on nodes that are cordoned unless the administrator deliberately configures that. (Or invert that, for backwards compatibility: There should be a flag to obey an existing cordon.)

For example, when Kured reboots a node it annotates and cordons it. Unless I am missing something, that is the best-practices way of preventing other tasks from interfering. It is unexpected at best to ignore the cordon and do something that also requires the safety of a cordon. (And I maintain that an application that meddles with external cordons is not behaving appropriately.)

@brandond
Copy link
Member

brandond commented Dec 9, 2024

There is no concept of an "external" or "internal" cordon. There is no record of what controller "owns" the cordoning. The job just knows if the plan has cordon or drain set, and if so takes the appropriate actions to drain+cordon before and uncordon after. You don't HAVE to use these plan options, if you don't like how it works then don't use it.

@disconn3ct
Copy link
Author

This isn't about "like". When an administrator enables cordoning, that is explicitly telling the SUC that cordoning is important in the environment. There is a step where the SUC cordons a node. If that node is already cordoned prior to that, the SUC should not proceed. (With the usual "unless overridden blah blah".)

If the node is already cordoned when SUC goes to apply a cordon, then that either an external actor (anything that is not SUC) cordoned it or something went wrong during the SUC activity. In neither case is blindly proceeding ahead a good answer.

@brandond
Copy link
Member

brandond commented Dec 9, 2024

There is a step where the SUC cordons a node. If that node is already cordoned prior to that, the SUC should not proceed.

I don't agree. If you've ever looked at the k3s or rke2 upgrade job or followed its execution, you'll notice that it frequently runs twice:

  1. The first time, the initContainer runs to drain/cordon the node, and the primary container replaces the binary and kills the process to trigger a restart . At this point the job pod MAY fail, due to the local node being restarted while it was running.
  2. After the service is restarted into the new version, the job runs again. At this point the binary has already been replaced, and the primary container exits successfully without making any changes
  3. Once the job completes successfully, the node is uncordoned by the controller.

If you want to consider an already-cordoned node an error during the cordon step executed by the initContainer, you can set
the the controller pod's SYSTEM_UPGRADE_JOB_KUBECTL_IMAGE env var to an image that implements this logic. By default, it is perfectly valid to run kubectl cordon on a node that is already cordoned, and the kubectl command exits with a 0 return code.

  • KubectlImage = func(defaultValue string) string {
    if str := os.Getenv("SYSTEM_UPGRADE_JOB_KUBECTL_IMAGE"); str != "" {
    return str
    }
    return defaultValue
    }(defaultKubectlImage)
  • podTemplate.Spec.InitContainers = append(podTemplate.Spec.InitContainers,
    upgradectr.New("cordon", upgradeapiv1.ContainerSpec{
    Image: KubectlImage,
    Args: []string{"cordon", node.Name},
    },
    upgradectr.WithSecrets(plan.Spec.Secrets),
    upgradectr.WithPlanEnvironment(plan.Name, plan.Status),
    upgradectr.WithImagePullPolicy(ImagePullPolicy),
    upgradectr.WithVolumes(plan.Spec.Upgrade.Volumes),
    ),
    )

@disconn3ct
Copy link
Author

followed its execution

A bug in the design is still a bug.

perfectly valid to run kubectl cordon on a node that is already cordoned

It is perfectly valid to do lots of harmful, strange or just ineffective things. In many cases, "making sure the door is locked" does not cause harm. Altering the cluster control software is not many cases. (I am curious about "by default" though. Is there a case where that API call fails?)

kubectl command exits with a 0

If the API threw an error, this bug wouldn't exist to begin with, so I'm not sure what bearing kubectl has on it.

Since you brought up other tools, kured comes straight from CNCF and uses proper locking around cordons and behaves cautiously in the face of the unexpected. And it is arguably less impactful than SUC since it only does a subset of the impactful activities.

In the case that multiple operators are attempting to cordon a node and make dangerous changes, one of which is SUC, how should they behave? How are other operators expected to code around this tool?

@kfox1111
Copy link

kfox1111 commented Dec 9, 2024

@brandond
Copy link
Member

brandond commented Dec 9, 2024

If the API threw an error, this bug wouldn't exist to begin with, so I'm not sure what bearing kubectl has on it.

What are you talking about? What API would throw an error? I showed you that we use kubectl to do the drain/cordon, thats why it has bearing on this discussion.

uses proper locking around cordons

What exactly is "locking" around cordons? There is a single field for cordoning (Node.Spec.Unschedulable) and no standard way to indicate why it is unschedulable in a way that other controllers or CLI tools will respect. How do you expect us to "lock" this field exactly?

@brandond
Copy link
Member

brandond commented Dec 10, 2024

I am going to leave this closed pending resolution of the discussion at kubernetes/enhancements#4212 - ref: https://groups.google.com/g/kubernetes-sig-architecture/c/Tb_3oDMAHrg/m/tNHNooWDAAAJ

I really like the idea of node maintenance being declarative; there's no solid convention for different tools to mark when they are draining a node, let alone give people an idea about why. An API would solve that.

If that ever sees any progress we can reevaluate this.

@disconn3ct in advance of there being an actual standard or API for this, if you can point me at some non-project-specific method that kured is using to "lock" the cordoned status, we can see if it is usable for this controller.

@disconn3ct
Copy link
Author

To start, I'm not going to continue responding to "we did it this way already". The bug is in the design.

When an administrator sets the SUC cordon flag, they are explicitly stating that cordoning is important in the environment. Most controllers will behave politely and obey the cordon. If cordons were unimportant, the administrator would not have turned on the flag to begin with.

If something has already cordoned the node, it is reserved. There doesn't need to be anything else. "I went to put the lockout tag up but it was already locked. I know nothing about who locked it so I am not continuing." Throw a failed job or something if you need to brute-force your way to the administrator's attention.

Kured sets an annotation that it then reads to determine whether the cordon is owned by kured (and other checks not related, such as pod presence and "lock" annotation expiration.)

Please explain how you expect other "dangerous" controllers to work when SUC ignores existing cordons. How do those controllers prevent SUC from yanking the rug out during their operations? Kured avoids SUC by obeying the cordon. (It can also look for specific pods, alerts, etc but that is beyond this scope.) After a reboot, step 2 above, Kured checks it's annotations before undoing a cordon.

1 Node is selected for activity
2 Cordon is verified off (plus any other configured checks)
3 Cordon node
4 Annotate node with kured-specific information about the cordon
5 Do stuff (in this case it is reboot)
6 "Wake up" after reboot
7 Check annotation to ensure the cordon belongs to kured
8 Uncordon node
9 Update annotation (to prevent repeat of 7)

The only race between well-behaved apps is between steps 2 and 3. Multiple apps could think they got the cordon if they were aligned. That is an improvement on the existing design, where the entire operation is at risk of SUC activating in the middle and downing the node.

Have you glanced at the kured docs? The intro describes the basic flow:

Watches for the presence of a reboot sentinel file e.g. /var/run/reboot-required or the successful run of a sentinel command.
Utilises a lock in the API server to ensure only one node reboots at a time
Optionally defers reboots in the presence of active Prometheus alerts or selected pods
Cordons & drains worker nodes before reboot, uncordoning them after

(FYI if you want to continue being aggressive about the word "lock" please take it up with CNCF. I used their usage. They refer to both a kured-controller concurrency lock and the kured-specific annotations as locks.)

@kfox1111
Copy link

the 2/3 race can be dealt with, with an atomic update. Grab the resource version at 2, then when patching at 3, include it. If 3 succeeds, there was no intermediate change. If it fails, someone else did it and you should restart.

Also, 3 and 4 probably should be done in the same commit.

5 should also include drain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants