OTA-253: Stubbing out preflight from target release#1930
OTA-253: Stubbing out preflight from target release#1930fao89 wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc JoelSpeed hongkailiu wking |
|
@fao89: This pull request references OTA-253 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5dac841 to
f201cd1
Compare
|
/cc |
|
|
||
| Success criteria: | ||
| - Administrators can run `oc adm upgrade --preflight --to=<version>` to check compatibility. | ||
| - Preflight results appear in ClusterVersion `status` alongside other conditional update risks. |
There was a problem hiding this comment.
Is this expressive enough? Have we considered a pre-flight upgrade object that would run and report results as a single object that could be created separately whenever you wanted?
There was a problem hiding this comment.
Probably worth a new Alternatives subsection on this, @fao89 . We talked about it while discussing #363, although none of the discussions seem to have made it as far as the Markdown floated there for easy reference now. I'm not hugely opposed to having a separate, preflight-specific CRD, but my concerns are:
- It would be one more CRD, in a world with many existing CRDs. And it would only come up very briefly pre-update, so the cost of a long-lived CRD doesn't have much actually-being-used time to amortize over.
- It would be one more place for all our uses to have to figure out how to check, and then they'd have to work out how to integrate it with the other update advice we're storing in ClusterVersion
status. It's possible. But if we perform the integration in the CVO and report the integrated advice in a single ClusterVersionstatus, that seems like less work in all our consumers. And I can't think of preflight results that the OTA-1544: update: support accepted risks #1807 accept-risks API we're building on couldn't cover. Maybe that's a failure of my imagination? But if we have preflight situations it can't cover, I'd expect those same situations could show up in conditional update risks, and I'd expect us to want to expand the accept-risks API to cover them. So even if my imagination is failing me, there's still the "make the accept-risks API richer" safety valve. And if we end up deciding that the only way to make the accept-risks API rich enough is to spin it out into a new CRD, 🤷, then that's where we'd be. But I don't see an update risk that would push us into that space yet. Imaginative suggestions welcome :)
| desiredUpdate: | ||
| mode: Preflight | ||
| version: 5.2.0 |
There was a problem hiding this comment.
Is there a way I could request multiple pre-flights? Perhaps I want to compare results?
There was a problem hiding this comment.
There isn't. Would you? Trying to think about when you might want to...
- If you're trying to pick between X.(Y+1).Z and X.(Y+1).Z', you might think "hmm, I wonder if their preflights are different?". But they're two patch releases in the same stream, their preflights are identical outside of bugfixes, because we nominally don't backport features into existing X.(Y+1). Maybe the later Z' grew a new preflight check or fixed a bug in an existing check, but in that case, I'd expect the preflight improvement to be either very minor, or for us to have declared a conditional update risk to X.(Y+1).Z to cover for its preflight gap. So I'm not seeing a need to compare between X.(Y+1).Z and X.(Y+1).Z' results.
- If you're trying to pick between X.(Y+1).Z and X.(Y+2).Z' (in a future world where skip-level updates have made that possible), you might have more significant divergence in preflight logic. But X.(Y+1) and X.(Y+2) are going to be fairly different for many reasons (different release note pages, different features, maybe completely new components, etc.), and I personally don't see a problem with asking admins to run checks for one, think them over, and then clear those out to run checks for the other. Having the ability to simultaneously run two sets of prechecks might help the UX here for some customers, but I think the API complexity and the potential to confuse folks into running probably-useless parallel checks for X.(Y+1).Z and X.(Y+1).Z' tips me away from wanting to build an API that would allow that.
Maybe I'm missing a use case though? Or not understanding some wrinkles of the two situations I have tried to think through?
However this conversation works out, it's probably worth capturing in a new Alternatives subsection, @fao89, to give future folks easier access to the trade-offs we considered and why we decided to move whatever way we decide to move.
There was a problem hiding this comment.
So I was thinking about the skip level and future where we have multiple paths for a user to take. For example if a user were getting towards the end of 5, and had a choice of 5.9 or 6.0, and wanted to be able to see which was right for them, I could imagine in this case that ti would be helpful to be able to have the two results side by side, and therefore probably in parallel?
But I guess they can also just go ahead and capture the results of one, then run the next, and save the results at each stage.
There was a problem hiding this comment.
But I guess they can also just go ahead and capture the results of one, then run the next, and save the results at each stage.
Right, this is the approach I'd expect. For example, in a hypothetical world where skip-level exists, and you're running 5.8.z, and you set a *-6.0 channel, I'd expect you to see 6.0.z targets as direct-hop skip-level options. And also any 5.9.z that were old enough to be part of classic update-chains that get you from your current 5.8.z through a 5.9.z' and up to a 6.0.z". But you might not see a new 5.9.z, unless there was a sufficiently new 6.0.z to update to (more on the OpenShift backport policy vs. minor-version update connectivity in OTA-1352). To see that newer 5.9.z, you'd need to set a *-5.9 channel, so the cluster new you'd rather get as many 5.9 bugfixes picked up as you could, even if that meant temporarily losing access to 6.0 (until newer 6.0.z patches come out). So folks may be manipulating the channel already to compare between the two targets, and launching separate, serial preflights fits that approach conveniently, even if it does mean the customer has a bunch to keep track of as they try to make their comparison analysis.
|
|
||
| #### Hypershift / Hosted Control Planes | ||
|
|
||
| HyperShift is out of scope for now, as we rush to get something tech-preview for standalone. |
There was a problem hiding this comment.
What's the rush? When are we trying to get this in for?
There was a problem hiding this comment.
We're thinking about skip-level updates, starting with standalone clusters. Possibly for 5.0 -> 5.2. One blocker for that would be some kind of guard for 5.0 -> 5.2 updates, to warn 5.0 cluster admins pre-update about anything they needed to sort out before launching an update. We have Upgradeable, but it does not allow a 5.0 controller to distinguish between "sort before updating to 5.1" and "sort before updating to 5.2". We have conditional updates, but some consumers like the labs update tool don't have access to the cluster to know if a particular update risk applies, so they get confusing if we drag a risk along for too long without some way to either resolve it or get a guard baked into shipped product code. This preflight enhancement is trying to fill that gap, and allow the 5.0 cluster to run 5.2 preflights, to warn about the issues that 5.0 Upgradeable and conditional risks don't cover.
In order to GA 5.0 as a launch point for preflights for standalone clusters, we want to be tech-preview in 4.22. Not all that much dev-time left to get that done, which is why we're in a hurry here. If we miss, maybe we can go from zero to GA in 5.0. Or maybe we miss GAing in 5.0, and 5.0 to 5.2 skip-levels don't happen. Or maybe [the required KEP is still alpha, or maybe folks can't convince enough component operators to commit to the amount of API skew or slower API development skip-level would require, or maybe other things. This enhancement (or some alternative) is necessary for skip-level, but certainly not sufficient.
We could do other things for skip-level, like an UpgradeableNextEUS condition. But I don't like the lack-of-granularity in the current Upgradeable condition. Operators might have several, orthogonal concerns, and a single condition doesn't give them the ability to distinguish. Maybe an UpgradeableNextEUS.* regexp condition? Possible, but then folks are still stuck doing the manual-cred check themselves. So I am trying to get some kind of from-the-target-version preflight in place, because it seems like it neatly addresses those other concerns. And I'm trying to get in place quickly, because folks are currently talking about 5.0 as the time it needs to be GA on standalone, and I realize that GA features take time. But 🤷, I also realize that tech-preview features take time, and if we're already too late in 4.22 to get this in, the earlier folks tell me, the earlier I can start trying to explain the side-effects to the folks who had been pushing that skip-level timing.
There was a problem hiding this comment.
Ok, context of this needs to be ready in 5.0 to support skip level makes sense, might want to clarify that in this statement. I'd also outline what risks we have of ignoring hypershift for now, are we likely to design something here that doesn't work for hypershift later?
There was a problem hiding this comment.
... are we likely to design something here that doesn't work for hypershift later?
Hopefully not 😅 HyperShift already runs CVO as a child Deployment, so a similar pattern their seems like it should be possible, although obviously it will depend on HyperShift updating their HostedCluster and HostedControlPlane CRDs to make more room for the new ClusterVersion properties behind this effort. Or for folks to plan out some alternative API for carrying that information, if for some reason folks think lifting up the ClusterVersion properties wouldn't work in HyperShift-land. But if anyone does have concerns about boxing ourselves in vs. HyperShift that we can try to address in this enhancement, I'm happy to chip in trying to respond to those concerns and hopefully we can find a place where folks feel like we're sufficiently safe.
| When the target release CVO is invoked with the `preflight --format=preflight-v1-json` argument, it runs preflight checks, and reports the results to the cluster's running CVO via a host-mounted volume (just like the `version-*` Pod FIXME https://github.com/openshift/cluster-version-operator/blob/83243780aed4e0d9c4ebff528e54b918d4170fd3/pkg/cvo/updatepayload.go#L271-L275). | ||
| The results will be JSON. | ||
| Because they will be [propagated into `conditionalUpdateRisks`](#retrieving-preflight-check-results), we'll use that structure: |
There was a problem hiding this comment.
Why doesn't it update the status directly?
There was a problem hiding this comment.
Two controllers (current CVO and preflight CVO) trying to share a single property-space can cause write conflicts. Sharing is hard. By maintaining a clear preflight-CVO -> current-CVO -> ClusterVersion status pipeline, there's no need to sort out a sharing plan.
There was a problem hiding this comment.
We have API tooling specifically to deal with this (see applyconfiguration), but perhaps the existing CVO code is too set as it is to update to be compatible with this newer pattern?
There was a problem hiding this comment.
https://pkg.go.dev/k8s.io/client-go/applyconfigurations says the overall approach is based on server-side apply, and https://pkg.go.dev/github.com/openshift/client-go/config/applyconfigurations/config/v1#ClusterVersionStatusApplyConfiguration.WithConditionalUpdates gives a way to append conditional updates. But I don't understand how either server-side apply or the current OpenShift client-go code could be used to get two controllers to convenienetly share a single map/slice. For example, even as a single writer, how would you use the apply-configuration API to distinguish between "append this new conditional update" and "append this new one, but also clear away any old, stale entries that I used to think were relevant"?
There was a problem hiding this comment.
Generally the complex scenario is sharing a list, for a list to be shareable between two entities, it must be marked up as +listType=map, and have some sort of unique identifier (or combination of identifiers) with that is set as the +listMapKey.
Once you've done this, each writer has the opportunity to write to the list with their own opinion of the content of the list. So writer A (which will be called manager A in managedFields) writes the list with entries
- name: Foo
- name: Bar
Each time manager A writes to this list, they must send their entire opinion. If they omit the Bar entry next write, it gets removed, if they write with that list empty, all of their entiries get removed. If they want to append, they must write Foo, Bar and their new entry
Write B can come along and they write the list as
- name: Baz
- name: Qux
A reader reading the list would then see
- name: Foo
- name: Bar
- name: Baz
- name: Qux
If at any point two managers (A/B) tried to write an object with the same key (name in this example), then a conflict would arise and the write would be rejected. Though writers can force their way through this saying no, I absolutely own this opinion.
In the example where we had two controllers writing to that same status but not sharing a field, ie one writes to status.foo and the other writes to status.bar. As long as both are well behaved (using apply type patches) and not the traditional Update approach, then you'd see managed fields entries showing each owning the bits they care about. Anyone writing with an update would be equivalent to forcing ownership of the entire object and would be expected to do client side merging as many controllers do today
| - **Existing alternatives**: Administrators can already create custom ClusterOperators with `Upgradeable=False` conditions for custom blocking logic | ||
|
|
||
|
|
||
| ## Open Questions [optional] |
There was a problem hiding this comment.
As a developer on a team that is not CVO, how am I expected to write a preflight check and have CVO execute it?
There was a problem hiding this comment.
That is an important question, but one I am hoping to defer until there's more time. This enhancement is focused on the admin<->ClusterVersion<->current-CVO<->target-CVO chain. I agree that we will almost certainly want to extend that chain with target-CVO<->target-component-operator. But that contract only needs to be sorted out in the target release, while the other portions of the chain need to be sorted out and GA in the launching/current release. I'm trying to defer the target-release questions, to give us the smallest portion of the chain we need to sort out and agree on, to try and make this easier to merge, and get the launching-release part shipped more quickly. Then we can come back with more lead time and fill in the details for intra-target-release communication.
In the worst possible case (for me as a CVO maintainer), we fail to come to any consensus on how intra-target-release communication should happen. Then the CVO is on the hook for locally coding all the checks that need to happen. That seems unlikely to me, but either it would be survivable pain, or we would be motivated to find some kind of agreement to reduce the pain. Having the contract be entirely intra-target-release by limiting the inter-release interaction to the CVO gives us maximum flexibility for those future negotiations.
There was a problem hiding this comment.
Ok got it, so we are saying that by having a contract CVO to CVO, we can keep that contract fixed over many releases, but the actual CVO to other component contract could flex over time, provided the CVO still knows how to return the data in the format that other CVOs understand?
That makes sense to me, though I am super curious to see how CVO to other components would interact
https://issues.redhat.com/browse/OCPSTRAT-2843 Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
f201cd1 to
496ddef
Compare
Implements preflight update mode that allows checking update compatibility without committing to performing the actual cluster update. Changes: - Add optional 'mode' field to Update struct with value "Preflight" - Gate new field behind ClusterUpdatePreflight feature (TechPreviewNoUpgrade) - When mode="Preflight", cluster runs compatibility checks only - When mode omitted, normal update behavior is preserved Related: openshift/enhancements#1930 Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
|
@fao89: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Implements preflight update mode that allows checking update compatibility without committing to performing the actual cluster update. Changes: - Add optional 'mode' field to Update struct with value "Preflight" - Gate new field behind ClusterUpdatePreflight feature (TechPreviewNoUpgrade) - When mode="Preflight", cluster runs compatibility checks only - When mode omitted, normal update behavior is preserved Related: openshift/enhancements#1930 Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Implements preflight update mode that allows checking update compatibility without committing to performing the actual cluster update. Changes: - Add optional 'mode' field to Update struct with value "Preflight" - Gate new field behind ClusterUpdatePreflight feature (TechPreviewNoUpgrade) - When mode="Preflight", cluster runs compatibility checks only - When mode omitted, normal update behavior is preserved Related: openshift/enhancements#1930 Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
|
|
||
| 1. **Micro version upgrades**: Preflight functionality operates at the minor version boundary (4.21 to 4.22), so micro version updates (4.22.1 to 4.22.3) do not affect preflight compatibility. | ||
|
|
||
| 1. **Minor version upgrades**: Preflight checks can target next-minor versions (5.0 checking 5.1) but may have limited functionality when checking much newer versions that introduce preflight protocol changes. |
There was a problem hiding this comment.
I'm not all that clear about what this line is saying. The previous line about `Micro version upgrades" is about X.Y.Z -> X.Y.Z' patch updates, and sets them out of scope, while calling out X.Y -> X.(Y+1) updates as in-scope. Maybe we can drop this line?
|
|
||
| ### Version Skew Considerations | ||
|
|
||
| 1. **Micro version upgrades**: Preflight functionality operates at the minor version boundary (4.21 to 4.22), so micro version updates (4.22.1 to 4.22.3) do not affect preflight compatibility. |
There was a problem hiding this comment.
nit: SemVer uses MAJOR.MINOR.PATCH; I'd have expected "patch version updates" here instead of "micro version updates".
| * **Flexible execution model**: Support both one-time preflight validation and continuous preflight monitoring for target releases. | ||
|
|
||
| Success criteria: | ||
| - Administrators can run `oc adm upgrade --preflight --to=<version>` to check compatibility. |
There was a problem hiding this comment.
Does the <version> parameter restricted to one of the available versions? similar with <target-version> in other lines(such as https://github.com/openshift/enhancements/pull/1930/changes#diff-8829175143337c6675c8d93640509187ac5ecdd054dad4b6f8e600998eef17b6R912). Is it expected to be one version of the not-recommended list?
| The preflight Deployment continues running until the administrator either: | ||
| - Clears the `desiredUpdate.mode: Preflight` setting | ||
| - Initiates an actual update by removing the `mode` field | ||
| - Sets a different target version for preflight evaluation |
There was a problem hiding this comment.
When setting to a different target version to trigger a new preflight check, will the prior preflight check outcome be preserved in status.conditionalUpdateRisks? If so, what is the maximum number of past results that can be retained?
|
|
||
| **Health monitoring for preflight operations:** | ||
| - **CVO condition monitoring**: `ClusterVersion.status.conditions` includes preflight-specific conditions: | ||
| - `type: PreflightRunning` - indicates active preflight deployment |
There was a problem hiding this comment.
If a preflight check fails or is still running (i.e., PreflightFailed or PreflightRunning is true), are the check results stored in conditionalUpdateRisks? If they are, do we have a designated field or part of the result that signals whether the output is partial or incomplete.
|
|
||
| 1. **Clear preflight request:** | ||
| ```bash | ||
| oc patch clusterversion version --type='merge' -p '{"spec":{"desiredUpdate":{"mode":null}}}' |
There was a problem hiding this comment.
There is --clear command to cancel a not-started upgrade, so is there plan to support cancel prefilight check through subcommand?
| - type: Applies | ||
| status: True | ||
| reason: "PreflightValidation" | ||
| message: "Risk identified during preflight check for 5.2.0." |
There was a problem hiding this comment.
Should we include the source version of the preflight check in the result? This is important because the result is retained after a cluster upgrade (as seen in https://github.com/openshift/enhancements/pull/1930/changes#diff-8829175143337c6675c8d93640509187ac5ecdd054dad4b6f8e600998eef17b6R590).
Considering the following scenario:
- An administrator runs a preflight check for version 5.2.0 on a cluster currently running 5.0.0. The result is integrated into conditionalUpdateRisks.
- The administrator then upgrades the cluster to 5.1.0.
- The historical preflight result (5.0.0 -> 5.2.0) will still be displayed.
If the source version is missing, users reading the 5.2.0 result might incorrectly assume the check was based on the current 5.1.0 version.
https://issues.redhat.com/browse/OCPSTRAT-2843
rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED