-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Add In-place updates proposal #11029
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c77a225
to
be97dc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the write up.
i left some comments, but i did not go in detailed review on the controller interaction (diagrams) part.
|
||
An External Update Extension implementing custom update strategies will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them. | ||
|
||
If the totality of the required changes cannot be covered by the defined extensions, Cluster API will allow to fall back to the current behavior (rolling update). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might be only what some users want. IMO, if a in-place update fails, it should fail and give the signal for it. there could be a "fallback" option with default value "false", but it also opens to some questions - what if the external update tempered with objects in a way that the fallback is no longer possible? i think that in-place upgrades should be a "hard-toggle" i.e. it's either replace or in-place. no fallbacks from CAPIs perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic can also use fallback scenario in case of timeout or some general condition. It might not scale well with multiple upgraders, but having options here would seem beneficial.
Since the changes are constrained on the single machine, machine replace should still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the external update tempered with objects in a way that the fallback is no longer possible
You mean there is (or there will be) case that external update can do but rollout update can't? If it happens, we can introduce some verification logic to determine if it can fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of having a possibility to disable fallback to rollout updates. In some cases, users would want only certain fields to be handled in-place, for example, instance tags, if any other fields were changed it should be ok to do rollout update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 A couple of clarifications:
- The fallback strategy is not meant for the scenario where the in-place update starts and fails. In this case, the update will remain in "failed" state until either the user manually intervenes or remediation (if configured) kicks in a deletes the failed machine. The fallback strategy is meant for when the external updaters cannot handle the desired update. In other words, when capi detects the need for an update, it queries the external updaters and decides to either start an in-place update or a rolling update (fallback strategy). But once it makes that decision and the update starts, it doesn't switch strategies.
- We were thinking that the fallback strategy would be optional. TBD if opt-in or opt-out, pending the discussion on the API changes.
|
||
As this proposal is an output of the In-place updates Feature Group, ensuring that the rollout extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort. | ||
|
||
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by 'validated', do you mean something CAPI will maintain e2e tests for?
i would think there could be some community owned e2e tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 can you take a look at "Test Plan" section at the end of the proposal? The initial plan was to have it in CAPI CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this paragraph tries to say is that although the concept of "external updater" theoretically allows to implement different types of update strategies (other than in-place), our focus here is to ensure that it can be used to implement in-place updates and that's what we will validate.
|
||
### Non-Goals | ||
|
||
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to the earlier points, if in-place fails, how would the controllers know to leave it to the user for a manual fix vs rollout the machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controllers will never rollout the machine in case of in-place update failure. At most, MHC might mark the machine for remediation. But that's a separate process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @neolit123's point, this should be configurable — not everyone will want to fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea is for both the update fallback strategy and MHC remediation (already is) to be optional
end | ||
mach->>apiserver: Mark Machine as updated | ||
end | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the diagram is missing the feedback signal from external updater to CAPI controllers whether the update has passed and what is the follow up for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's correct. This is a high level flow that simplifies certain things. The idea is to help get a high level understanding of the flow with subsequent sections digging into the details of each part of the flow.
|
||
If this set is reduced to zero, then CAPI will determine that the update can be performed using the external strategy. CAPI will define the update plan as a list of sequential external updaters in a particular order and proceed to execute it. The update plan will be stored in the Machine object as an array of strings (the names of the selected external updaters). | ||
|
||
If after iterating over all external updaters the remaining set still contains uncovered changes, CAPI will determine the desired state cannot be reached through external updaters. If a fallback rolling update strategy has been configured (this is optional), CAPI will replace the machines. If no fallback strategy is configured, we will surface the issue in the resource status. Machines will remain unchanged and the desired state won't be reached unless remediated by the user. Depending on the scenario, users can: ammend the desired state to something that the registered updaters can cover, register additional updaters capable of handling the desired changes or simply enable the fallback strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the order of external upgraders can be defined? Since there will be implicit requirements which will make them dependent on each other.
Since the idea is to iterate over an array of upgraders, this should have support for multiple iterations, and more clever mechanism than substraction. One iteration will not be enough to mark desired state unreachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current proposal, updaters need to be independent in order to be scheduled in the same upgrade plan. Updaters just look at the set of required changes and tell capi what is the subset of changes they can take care of. And they need to be capable of updating those fields regardless of how many other updaters are scheduled and no matter if they run before or after.
If for some reason an updater needs certain fields to be updated first before being able to execute its update, then two update plans will be needed, hence the change would need to be performed by the user in two phases.
We could (probably in future iterations) add a "priority" property to the updaters that would help order updaters when they have overlapping functions. However this would be a global priority and not relative between updaters.
Now all that said, this is a what we are proposing, which might not cover all usecases. Do you have a particular usecase where order matters and updaters must be dependent on each other.
|
||
Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates, they first detect if an update is required and then based on the configured strategy follow the appropiate update logic (note that today there is only one valid strategy, `RollingUpdate`). | ||
|
||
With `ExternalUpdate` strategy, CAPI controllers will compute the set of desired changes and iterate over the registered external updaters, requesting through the Runtime Hook the set of changes each updater can handle. The changes supported by an updater can be the complete set of desired changes, a subset of them or an empty set, signaling it cannot handle any of the desired changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're falling back to rolling update, to @neolit123's point, it doesn't make sense to me that ExternalUpdate
is a rollout strategy on its own, but rather it should be a field, or set of fields within rolling update that control its behavior?
Note that technically, a rolling update it doesn't have to be a replace operation, but it can be done in place, so imo it can be expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting point. I'm not against representing external updates as a subtype of rolling update strategy. You are right that we what we are proposing here, CAPI is following a rolling update process except it delegates the machine update instead of replacing the machine by itself. But capi orchestrates the rolling process.
As long as we can represent the fallback as optional, I'm ok with this if folks think it makes more sense.
|
||
CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update reaches a terminal state. | ||
|
||
Once the update completes, the Machine controller will remove the name of the updater that has finished from the list of updaters and will start the next one. If the update fails, this will be reflected in the Machine status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we're tracking state and keeping this state in the Machine controller itself. This is usually a common source of issues given that the state can drift from reality. Have we considered the set of hooks be only ever present on the MachineDeployment object, and the Machine object only contain its status, hence every updater has to be 1) re-entrant and 2) track where it "left-off".
This way, the status can be calculated from scratch at every iteration, rather than rely on sync calls and other means of strict operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I follow. What state are you referring to? The list of updaters to be run?
Answering your other question, yeah we opted to have the set of hooks at the Machine level because that allows to reuse the same mechanism for both KCP and MD machines.
Regarding re-entrance for updaters: yeah, that is the idea here (it might need more clarification in the doc). CAPI will continue call the /UpdateMachine
endpoint of an updater until this either returns success or failure. It's up to the updater to track the "update progress". Or maybe I didn't understand your comment correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding reading the proposal, it sounds like we're building a plan and tracking it in the Machine spec it self, which can be error prone; I'd suggest instead to find an approach that's ultimately declarative: declare the plan somewhere else and reflect the status of that plan in Machine status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand now, thanks for the clarification. Yeah this sounds very reasonable, let me give it a thought and I'll come back to it.
* A way to define different rules for Machines on-going an update. This might involve new fields in the MHC object. We will decouple these API changes from this proposal. For the first implementation of in-place updates, we might decide to just disable remediation for Machines that are on-going an update. | ||
|
||
|
||
### API Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to start the proposal with an example in how we envision the end state to look like, from defining the state and provide an in depth example with KCP, kubeadm bootstrap provider, and an example infra provider (like AWS or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we do this before we define the API changes? or as part of that work?
We purposefully left the API design for later so we can focus the conversation on the core ideas and high level flow and make sure we are aligned there first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without seeing the API changes we're proposing it's generally hard to grasp the high level concept. I would like to see it from a user/operator perspective:
- How will we setup this feature in yaml?
- What are the required pieces that we need to install?
- Are there any assumptions we're making?
|
||
We propose a pluggable update strategy architecture that allows External Update Extension to handle the update process. The design decouples core CAPI controllers from the specific extension implementation responsible for updating a machine. The External Update Strategy will be configured reusing the existing field in KCP and MD resources, by introducing new type of strategy called `ExternalUpdate` (reusing the existing field in KCP and MD). This allows us to provide a consistent user experience: the interaction witht he CAPI resources is the same as in rolling updates. | ||
|
||
This proposal introduces a Lifecycle Hook named `ExternalUpdate` for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is a missing details in how the external updater logic would work, and how the "kind of changes they can handle" is handled. How is that going to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It'd be good for the proposal to include a reference external updater implementation and shape around one common/trivial driving use case. E.g perform an in-place rolling update of the kubernetes version for a pool of Nodes. Then we can grasp and discuss design implications for RBAC, drain...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre In the 'test plan' section we mention a "CAPD Kubeadm Updater", which will be a reference implementation and also used for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "how is that going to work?"? Are you referring to how the external updater knows what are the desired changes? Or how does the external updater compute what changes it can perform and what changes it can't?
Trying to give a generic answer here, the external updater will receive something like "current state" and "desired state" for a particular machine (including machine, infra machine and bootstrap) in the CanUpdateRequest
. Then it will respond with something like an array of fields for those objects (kubeadmconfig -> ["spec.files", "spec.mounts", "spec.files"]
), which would signal the subset of fields that it can update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre
The idea of opening the draft at this stage for review is to get feedback on the core ideas and high level flow before we invest more time on this direction. Unless you think that a reference implementation is necessary to have these discussions, I would prefer to avoid that work.
That said, I totally get that it's possible that the lack of detail in certain areas is making difficult to have the high level discussion. If that's the case, we are happy to add that detail wherever needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to give a generic answer here, the external updater will receive something like "current state" and "desired state" for a particular machine (including machine, infra machine and bootstrap) in the CanUpdateRequest. Then it will respond with something like an array of fields for those objects (kubeadmconfig -> ["spec.files", "spec.mounts", "spec.files"]), which would signal the subset of fields that it can update.
These details must be part of the proposal, the details on how the entire flow from MachineDeployment, to the external request, back to the Machine, and reflecting status are not present, which makes it hard to understand how the technical flow will go and/or propose alternative solutions.
|
||
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time. | ||
* Single node cluster without extra hardware available. | ||
* `TODO: looking for more real life usecases here` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include certificate rotation in the use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great usecase. However, I'm not sure if we should add it because what we have in this doc doesn't really solve that problem.
The abstractions/ideas we present can totally be used for cert rotation. However, what we have only covers changes triggered by updates to the KCP/MD specs. If I'm not mistaken, in-place cert rotation would be a separate process, similar to what capi does today, where the expiration date of certs is tracked in the background and handled separately from machine rollouts.
Opinions?
Co-authored-by: Lubomir I. Ivanov <neolit123@gmail.com>
Co-authored-by: Lubomir I. Ivanov <neolit123@gmail.com>
Co-authored-by: Lubomir I. Ivanov <neolit123@gmail.com>
5eb6664
to
472a336
Compare
Hey folks 👋 @g-gaston Dropping by from the Flatcar Container Linux project - we're a container optimised Linux distro; we joined the CNCF a few weeks ago (incubating). We've been driving implementation spikes of in-place OS and Kubernetes updates in ClusterAPI for some time - at the OS level. Your proposal looks great from our point of view. While progress has been slower in the recent months due to project resource constraints, Flatcar has working proof-of-concept implementations for both in-place updating the OS and Kubernetes - independently. Our implementation is near production ready on the OS level, update activation can be coordinated via kured, and the worker cluster control plane picks up the correct versions. We do lack any signalling to the management cluster as well as more advanced features like coordinated roll-backs (though this would be easy to implement on the OS level). In theory, our approach of in-place Kubernetes updates is distro agnostic (given the "mutable sysext" changes in recent versions of systemd starting with release 256). We presented our work in a CAPZ office hours call earlier this year: https://youtu.be/Fpn-E9832UQ?feature=shared&t=164 (slide deck: https://drive.google.com/file/d/1MfBQcRvGHsb-xNU3g_MqvY4haNJl-WY2/view). We hope our work can provide some insights that help to further flesh out this proposal. Happy to chat if folks are interested. (CC: @tormath1 for visibility) EDIT after initial feedback from @neolit123 : in-place updates of Kubernetes in CAPI are in "proof of concept" stage. Just using sysexts to ship Kubernetes (with and without CAPI) has been in production on (at least) Flatcar for quite some time. Several CAPI providers (OpenStack, Linode) use sysexts as preferred mechanism for Flatcar worker nodes. |
i don't think i've seen usage of sysext with k8s. it's provisioning of image extensions seems like something users can do, but they might as well stick to the vanilla way of using the k8s package registries and employing update scripts for e.g. containerd. the kubeadm upgrade docs, just leverage the package manager upgrade way: one concern that i think i have with systemd-sysext that you still have a intermediate build process for the extension, while the k8s package build process is already done by the k8s release folks. |
On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros. I'd argue that the overhead is negligible: download release binaries into a sub-directory and run Drawbacks of the packaging process are:
Sysexts are already used by the ClusterAPI OpenStack and the Linode providers with Flatcar (though without in-place updates). |
|
||
If this set is reduced to zero, then CAPI will determine that the update can be performed using the external strategy. CAPI will define the update plan as a list of sequential external updaters in a particular order and proceed to execute it. The update plan will be stored in the Machine object as an array of strings (the names of the selected external updaters). | ||
|
||
If after iterating over all external updaters the remaining set still contains uncovered changes, CAPI will determine the desired state cannot be reached through external updaters. If a fallback rolling update strategy has been configured (this is optional), CAPI will replace the machines. If no fallback strategy is configured, we will surface the issue in the resource status. Machines will remain unchanged and the desired state won't be reached unless remediated by the user. Depending on the scenario, users can: ammend the desired state to something that the registered updaters can cover, register additional updaters capable of handling the desired changes or simply enable the fallback strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If after iterating over all external updaters the remaining set still contains uncovered changes
How do we envision this to take place? Diffing each field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envision something like capi would generate a set with all the fields that are changing for an object by diffing current state with desired state. Then as it iterates over the updaters, it would remove fields from the set. If it finish iterating over the updaters and there are still fields left in the set, then the update can't be performed in-place.
the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically? |
Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker. (Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere) |
while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without
i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept. |
Shipping this file in a sysext is straightforward. In fact, the kubernetes sysexts we publish in our "sysext bakery" include it.
That's what originally motivated me to speak up: the proposal appears to discuss the control plane "upper half" our proof of concept implementation lacks. As stated we're OS folks :) And we're very happy to see this gets some traction. |
@t-lo thanks for reaching out! really appreciated +1 from me to keep discussion on this PR focused on the first layer But great to see things are moving for the Flatcar Container Linux project; let's make sure the design work that is happening here does not prevent using Flatcar in place upgrade capabilities (but at the same time, we should make sure it could work with other OS as well, even the ones less "cloud native") |
It would be nice also to ensure the process is also compatible or at least gears well with talos.dev. Which is managed completely by a set of controllers that expose just an API. Useful for single-node long-lived clusters. As far as I read I see no complications yet for it. |
Hello folks, We've briefly discussed systemd-sysext and its potential uses for ClusterAPI in the September 25, 2024 ClusterAPI meeting (https://docs.google.com/document/d/1GgFbaYs-H6J5HSQ6a7n4aKpk0nDLE2hgG2NSOM9YIRw/edit#heading=h.s6d5g3hqxxzt). Summarising the points made here so you don't need to watch the recording 😉 . Let's wrap up the sysext discussion in this PR so we can get the focus back to in-place updates. If there's more interest in this technology from ClusterAPI folks I'm happy to have a separate discussion (here: #11227).
|
What this PR does / why we need it:
Proposal doc for In-place updates written by the In-place updates feature group.
Starting this as a draft to collect early feedback on the main ideas and high level flow. APIs and some other lower level details are left purposefully as TODOs to focus the conversation on the rest of the doc, speed up consensus and avoid rework.
Fixes #9489
/area documentation