Skip to content

OTA-253: Stubbing out preflight from target release#1930

Open
fao89 wants to merge 1 commit intoopenshift:masterfrom
fao89:preflight-from-target-release
Open

OTA-253: Stubbing out preflight from target release#1930
fao89 wants to merge 1 commit intoopenshift:masterfrom
fao89:preflight-from-target-release

Conversation

@fao89
Copy link
Member

@fao89 fao89 commented Jan 27, 2026

https://issues.redhat.com/browse/OCPSTRAT-2843

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pratikmahajan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fao89
Copy link
Member Author

fao89 commented Jan 27, 2026

/cc JoelSpeed hongkailiu wking

@fao89 fao89 changed the title Stubbing out preflight from target release OTA-253: Stubbing out preflight from target release Jan 27, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 27, 2026

@fao89: This pull request references OTA-253 which is a valid jira issue.

Details

In response to this:

https://issues.redhat.com/browse/OCPSTRAT-2843

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@fao89 fao89 force-pushed the preflight-from-target-release branch from 5dac841 to f201cd1 Compare January 27, 2026 12:00
@DavidHurta
Copy link
Contributor

/cc

@openshift-ci openshift-ci bot requested a review from DavidHurta January 28, 2026 18:20

Success criteria:
- Administrators can run `oc adm upgrade --preflight --to=<version>` to check compatibility.
- Preflight results appear in ClusterVersion `status` alongside other conditional update risks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expressive enough? Have we considered a pre-flight upgrade object that would run and report results as a single object that could be created separately whenever you wanted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a new Alternatives subsection on this, @fao89 . We talked about it while discussing #363, although none of the discussions seem to have made it as far as the Markdown floated there for easy reference now. I'm not hugely opposed to having a separate, preflight-specific CRD, but my concerns are:

  • It would be one more CRD, in a world with many existing CRDs. And it would only come up very briefly pre-update, so the cost of a long-lived CRD doesn't have much actually-being-used time to amortize over.
  • It would be one more place for all our uses to have to figure out how to check, and then they'd have to work out how to integrate it with the other update advice we're storing in ClusterVersion status. It's possible. But if we perform the integration in the CVO and report the integrated advice in a single ClusterVersion status, that seems like less work in all our consumers. And I can't think of preflight results that the OTA-1544: update: support accepted risks #1807 accept-risks API we're building on couldn't cover. Maybe that's a failure of my imagination? But if we have preflight situations it can't cover, I'd expect those same situations could show up in conditional update risks, and I'd expect us to want to expand the accept-risks API to cover them. So even if my imagination is failing me, there's still the "make the accept-risks API richer" safety valve. And if we end up deciding that the only way to make the accept-risks API rich enough is to spin it out into a new CRD, 🤷, then that's where we'd be. But I don't see an update risk that would push us into that space yet. Imaginative suggestions welcome :)

Comment on lines +162 to +163
desiredUpdate:
mode: Preflight
version: 5.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way I could request multiple pre-flights? Perhaps I want to compare results?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't. Would you? Trying to think about when you might want to...

  • If you're trying to pick between X.(Y+1).Z and X.(Y+1).Z', you might think "hmm, I wonder if their preflights are different?". But they're two patch releases in the same stream, their preflights are identical outside of bugfixes, because we nominally don't backport features into existing X.(Y+1). Maybe the later Z' grew a new preflight check or fixed a bug in an existing check, but in that case, I'd expect the preflight improvement to be either very minor, or for us to have declared a conditional update risk to X.(Y+1).Z to cover for its preflight gap. So I'm not seeing a need to compare between X.(Y+1).Z and X.(Y+1).Z' results.
  • If you're trying to pick between X.(Y+1).Z and X.(Y+2).Z' (in a future world where skip-level updates have made that possible), you might have more significant divergence in preflight logic. But X.(Y+1) and X.(Y+2) are going to be fairly different for many reasons (different release note pages, different features, maybe completely new components, etc.), and I personally don't see a problem with asking admins to run checks for one, think them over, and then clear those out to run checks for the other. Having the ability to simultaneously run two sets of prechecks might help the UX here for some customers, but I think the API complexity and the potential to confuse folks into running probably-useless parallel checks for X.(Y+1).Z and X.(Y+1).Z' tips me away from wanting to build an API that would allow that.

Maybe I'm missing a use case though? Or not understanding some wrinkles of the two situations I have tried to think through?

However this conversation works out, it's probably worth capturing in a new Alternatives subsection, @fao89, to give future folks easier access to the trade-offs we considered and why we decided to move whatever way we decide to move.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was thinking about the skip level and future where we have multiple paths for a user to take. For example if a user were getting towards the end of 5, and had a choice of 5.9 or 6.0, and wanted to be able to see which was right for them, I could imagine in this case that ti would be helpful to be able to have the two results side by side, and therefore probably in parallel?

But I guess they can also just go ahead and capture the results of one, then run the next, and save the results at each stage.

Copy link
Member

@wking wking Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess they can also just go ahead and capture the results of one, then run the next, and save the results at each stage.

Right, this is the approach I'd expect. For example, in a hypothetical world where skip-level exists, and you're running 5.8.z, and you set a *-6.0 channel, I'd expect you to see 6.0.z targets as direct-hop skip-level options. And also any 5.9.z that were old enough to be part of classic update-chains that get you from your current 5.8.z through a 5.9.z' and up to a 6.0.z". But you might not see a new 5.9.z, unless there was a sufficiently new 6.0.z to update to (more on the OpenShift backport policy vs. minor-version update connectivity in OTA-1352). To see that newer 5.9.z, you'd need to set a *-5.9 channel, so the cluster new you'd rather get as many 5.9 bugfixes picked up as you could, even if that meant temporarily losing access to 6.0 (until newer 6.0.z patches come out). So folks may be manipulating the channel already to compare between the two targets, and launching separate, serial preflights fits that approach conveniently, even if it does mean the customer has a bunch to keep track of as they try to make their comparison analysis.


#### Hypershift / Hosted Control Planes

HyperShift is out of scope for now, as we rush to get something tech-preview for standalone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rush? When are we trying to get this in for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're thinking about skip-level updates, starting with standalone clusters. Possibly for 5.0 -> 5.2. One blocker for that would be some kind of guard for 5.0 -> 5.2 updates, to warn 5.0 cluster admins pre-update about anything they needed to sort out before launching an update. We have Upgradeable, but it does not allow a 5.0 controller to distinguish between "sort before updating to 5.1" and "sort before updating to 5.2". We have conditional updates, but some consumers like the labs update tool don't have access to the cluster to know if a particular update risk applies, so they get confusing if we drag a risk along for too long without some way to either resolve it or get a guard baked into shipped product code. This preflight enhancement is trying to fill that gap, and allow the 5.0 cluster to run 5.2 preflights, to warn about the issues that 5.0 Upgradeable and conditional risks don't cover.

In order to GA 5.0 as a launch point for preflights for standalone clusters, we want to be tech-preview in 4.22. Not all that much dev-time left to get that done, which is why we're in a hurry here. If we miss, maybe we can go from zero to GA in 5.0. Or maybe we miss GAing in 5.0, and 5.0 to 5.2 skip-levels don't happen. Or maybe [the required KEP is still alpha, or maybe folks can't convince enough component operators to commit to the amount of API skew or slower API development skip-level would require, or maybe other things. This enhancement (or some alternative) is necessary for skip-level, but certainly not sufficient.

We could do other things for skip-level, like an UpgradeableNextEUS condition. But I don't like the lack-of-granularity in the current Upgradeable condition. Operators might have several, orthogonal concerns, and a single condition doesn't give them the ability to distinguish. Maybe an UpgradeableNextEUS.* regexp condition? Possible, but then folks are still stuck doing the manual-cred check themselves. So I am trying to get some kind of from-the-target-version preflight in place, because it seems like it neatly addresses those other concerns. And I'm trying to get in place quickly, because folks are currently talking about 5.0 as the time it needs to be GA on standalone, and I realize that GA features take time. But 🤷, I also realize that tech-preview features take time, and if we're already too late in 4.22 to get this in, the earlier folks tell me, the earlier I can start trying to explain the side-effects to the folks who had been pushing that skip-level timing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, context of this needs to be ready in 5.0 to support skip level makes sense, might want to clarify that in this statement. I'd also outline what risks we have of ignoring hypershift for now, are we likely to design something here that doesn't work for hypershift later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... are we likely to design something here that doesn't work for hypershift later?

Hopefully not 😅 HyperShift already runs CVO as a child Deployment, so a similar pattern their seems like it should be possible, although obviously it will depend on HyperShift updating their HostedCluster and HostedControlPlane CRDs to make more room for the new ClusterVersion properties behind this effort. Or for folks to plan out some alternative API for carrying that information, if for some reason folks think lifting up the ClusterVersion properties wouldn't work in HyperShift-land. But if anyone does have concerns about boxing ourselves in vs. HyperShift that we can try to address in this enhancement, I'm happy to chip in trying to respond to those concerns and hopefully we can find a place where folks feel like we're sufficiently safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre @csrwng Would either of you be able to skim this to provide an initial temperature on concern for the fact we are ignoring HyperShift in the initial iteration?

Comment on lines +211 to +212
When the target release CVO is invoked with the `preflight --format=preflight-v1-json` argument, it runs preflight checks, and reports the results to the cluster's running CVO via a host-mounted volume (just like the `version-*` Pod FIXME https://github.com/openshift/cluster-version-operator/blob/83243780aed4e0d9c4ebff528e54b918d4170fd3/pkg/cvo/updatepayload.go#L271-L275).
The results will be JSON.
Because they will be [propagated into `conditionalUpdateRisks`](#retrieving-preflight-check-results), we'll use that structure:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't it update the status directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two controllers (current CVO and preflight CVO) trying to share a single property-space can cause write conflicts. Sharing is hard. By maintaining a clear preflight-CVO -> current-CVO -> ClusterVersion status pipeline, there's no need to sort out a sharing plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have API tooling specifically to deal with this (see applyconfiguration), but perhaps the existing CVO code is too set as it is to update to be compatible with this newer pattern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pkg.go.dev/k8s.io/client-go/applyconfigurations says the overall approach is based on server-side apply, and https://pkg.go.dev/github.com/openshift/client-go/config/applyconfigurations/config/v1#ClusterVersionStatusApplyConfiguration.WithConditionalUpdates gives a way to append conditional updates. But I don't understand how either server-side apply or the current OpenShift client-go code could be used to get two controllers to convenienetly share a single map/slice. For example, even as a single writer, how would you use the apply-configuration API to distinguish between "append this new conditional update" and "append this new one, but also clear away any old, stale entries that I used to think were relevant"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the complex scenario is sharing a list, for a list to be shareable between two entities, it must be marked up as +listType=map, and have some sort of unique identifier (or combination of identifiers) with that is set as the +listMapKey.

Once you've done this, each writer has the opportunity to write to the list with their own opinion of the content of the list. So writer A (which will be called manager A in managedFields) writes the list with entries

- name: Foo
- name: Bar

Each time manager A writes to this list, they must send their entire opinion. If they omit the Bar entry next write, it gets removed, if they write with that list empty, all of their entiries get removed. If they want to append, they must write Foo, Bar and their new entry

Write B can come along and they write the list as

- name: Baz
- name: Qux

A reader reading the list would then see

- name: Foo
- name: Bar
- name: Baz
- name: Qux

If at any point two managers (A/B) tried to write an object with the same key (name in this example), then a conflict would arise and the write would be rejected. Though writers can force their way through this saying no, I absolutely own this opinion.

In the example where we had two controllers writing to that same status but not sharing a field, ie one writes to status.foo and the other writes to status.bar. As long as both are well behaved (using apply type patches) and not the traditional Update approach, then you'd see managed fields entries showing each owning the bits they care about. Anyone writing with an update would be equivalent to forcing ownership of the entire object and would be expected to do client side merging as many controllers do today

- **Existing alternatives**: Administrators can already create custom ClusterOperators with `Upgradeable=False` conditions for custom blocking logic


## Open Questions [optional]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a developer on a team that is not CVO, how am I expected to write a preflight check and have CVO execute it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an important question, but one I am hoping to defer until there's more time. This enhancement is focused on the admin<->ClusterVersion<->current-CVO<->target-CVO chain. I agree that we will almost certainly want to extend that chain with target-CVO<->target-component-operator. But that contract only needs to be sorted out in the target release, while the other portions of the chain need to be sorted out and GA in the launching/current release. I'm trying to defer the target-release questions, to give us the smallest portion of the chain we need to sort out and agree on, to try and make this easier to merge, and get the launching-release part shipped more quickly. Then we can come back with more lead time and fill in the details for intra-target-release communication.

In the worst possible case (for me as a CVO maintainer), we fail to come to any consensus on how intra-target-release communication should happen. Then the CVO is on the hook for locally coding all the checks that need to happen. That seems unlikely to me, but either it would be survivable pain, or we would be motivated to find some kind of agreement to reduce the pain. Having the contract be entirely intra-target-release by limiting the inter-release interaction to the CVO gives us maximum flexibility for those future negotiations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, so we are saying that by having a contract CVO to CVO, we can keep that contract fixed over many releases, but the actual CVO to other component contract could flex over time, provided the CVO still knows how to return the data in the format that other CVOs understand?

That makes sense to me, though I am super curious to see how CVO to other components would interact

https://issues.redhat.com/browse/OCPSTRAT-2843
Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@fao89 fao89 force-pushed the preflight-from-target-release branch from f201cd1 to 496ddef Compare January 30, 2026 13:06
fao89 added a commit to fao89/api that referenced this pull request Jan 30, 2026
Implements preflight update mode that allows checking update compatibility
without committing to performing the actual cluster update.

Changes:
- Add optional 'mode' field to Update struct with value "Preflight"
- Gate new field behind ClusterUpdatePreflight feature (TechPreviewNoUpgrade)
- When mode="Preflight", cluster runs compatibility checks only
- When mode omitted, normal update behavior is preserved

Related: openshift/enhancements#1930
Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 30, 2026

@fao89: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fao89 added a commit to fao89/api that referenced this pull request Jan 30, 2026
Implements preflight update mode that allows checking update compatibility
without committing to performing the actual cluster update.

Changes:
- Add optional 'mode' field to Update struct with value "Preflight"
- Gate new field behind ClusterUpdatePreflight feature (TechPreviewNoUpgrade)
- When mode="Preflight", cluster runs compatibility checks only
- When mode omitted, normal update behavior is preserved

Related: openshift/enhancements#1930
Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
fao89 added a commit to fao89/api that referenced this pull request Jan 30, 2026
Implements preflight update mode that allows checking update compatibility
without committing to performing the actual cluster update.

Changes:
- Add optional 'mode' field to Update struct with value "Preflight"
- Gate new field behind ClusterUpdatePreflight feature (TechPreviewNoUpgrade)
- When mode="Preflight", cluster runs compatibility checks only
- When mode omitted, normal update behavior is preserved

Related: openshift/enhancements#1930
Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED

1. **Micro version upgrades**: Preflight functionality operates at the minor version boundary (4.21 to 4.22), so micro version updates (4.22.1 to 4.22.3) do not affect preflight compatibility.

1. **Minor version upgrades**: Preflight checks can target next-minor versions (5.0 checking 5.1) but may have limited functionality when checking much newer versions that introduce preflight protocol changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not all that clear about what this line is saying. The previous line about `Micro version upgrades" is about X.Y.Z -> X.Y.Z' patch updates, and sets them out of scope, while calling out X.Y -> X.(Y+1) updates as in-scope. Maybe we can drop this line?


### Version Skew Considerations

1. **Micro version upgrades**: Preflight functionality operates at the minor version boundary (4.21 to 4.22), so micro version updates (4.22.1 to 4.22.3) do not affect preflight compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SemVer uses MAJOR.MINOR.PATCH; I'd have expected "patch version updates" here instead of "micro version updates".

* **Flexible execution model**: Support both one-time preflight validation and continuous preflight monitoring for target releases.

Success criteria:
- Administrators can run `oc adm upgrade --preflight --to=<version>` to check compatibility.
Copy link

@jiajliu jiajliu Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the <version> parameter restricted to one of the available versions? similar with <target-version> in other lines(such as https://github.com/openshift/enhancements/pull/1930/changes#diff-8829175143337c6675c8d93640509187ac5ecdd054dad4b6f8e600998eef17b6R912). Is it expected to be one version of the not-recommended list?

The preflight Deployment continues running until the administrator either:
- Clears the `desiredUpdate.mode: Preflight` setting
- Initiates an actual update by removing the `mode` field
- Sets a different target version for preflight evaluation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When setting to a different target version to trigger a new preflight check, will the prior preflight check outcome be preserved in status.conditionalUpdateRisks? If so, what is the maximum number of past results that can be retained?


**Health monitoring for preflight operations:**
- **CVO condition monitoring**: `ClusterVersion.status.conditions` includes preflight-specific conditions:
- `type: PreflightRunning` - indicates active preflight deployment
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a preflight check fails or is still running (i.e., PreflightFailed or PreflightRunning is true), are the check results stored in conditionalUpdateRisks? If they are, do we have a designated field or part of the result that signals whether the output is partial or incomplete.


1. **Clear preflight request:**
```bash
oc patch clusterversion version --type='merge' -p '{"spec":{"desiredUpdate":{"mode":null}}}'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is --clear command to cancel a not-started upgrade, so is there plan to support cancel prefilight check through subcommand?

- type: Applies
status: True
reason: "PreflightValidation"
message: "Risk identified during preflight check for 5.2.0."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the source version of the preflight check in the result? This is important because the result is retained after a cluster upgrade (as seen in https://github.com/openshift/enhancements/pull/1930/changes#diff-8829175143337c6675c8d93640509187ac5ecdd054dad4b6f8e600998eef17b6R590).

Considering the following scenario:

  • An administrator runs a preflight check for version 5.2.0 on a cluster currently running 5.0.0. The result is integrated into conditionalUpdateRisks.
  • The administrator then upgrades the cluster to 5.1.0.
  • The historical preflight result (5.0.0 -> 5.2.0) will still be displayed.

If the source version is missing, users reading the 5.2.0 result might incorrectly assume the check was based on the current 5.1.0 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants