-
Notifications
You must be signed in to change notification settings - Fork 442
NO-ISSUE: pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete #4637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NO-ISSUE: pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete #4637
Conversation
7c3fb6a
to
1db0be3
Compare
/cc |
Looks good for the CVO waiting on the MCO to bump
The CVO's logging could be more specific about what it's waiting for: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4637/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade/1844510399173496832/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-58fbdfccc7-nw4fq_cluster-version-operator.log | grep 'error running apply for clusteroperator "machine-config"' | tail -n2
E1011 02:22:34.064164 1 task.go:122] error running apply for clusteroperator "machine-config" (762 of 890): Cluster operator machine-config is updating versions
E1011 02:25:48.076463 1 task.go:122] error running apply for clusteroperator "machine-config" (762 of 890): Cluster operator machine-config is updating versions |
…ansitions complete By setting a new 'operator-image' entry in the ClusterOperator status.versions manifest, so the cluster-version operator will wait for the in-cluster status to have both [1]: * The 'operator' value we already declare, so the CVO waits for us to hit the target version. * The 'operator-image' value I'm adding in this commit, to help distinguish between single-arch and multi-arch targets which share the same version string (but have unique release and MCO pullspecs). Without this change, the CVO immediately thinks the MCO has completed its update ("'operator' matches 4.13.0, and that's all I've been asked to match!") while the MCO is still working to pivot to the new, multi-arch component. Folks who try to take advantage of the new multi-arch functionality at this point might be surprised to have ClusterVersion claiming the update to multi-arch is complete, even though the MCO is still only part way through that transition, and has not yet positioned itself to point at the multi-arch RHCOS image when new Machines try to come up [2]. [1]: https://github.com/openshift/cluster-version-operator/blob/e546515213c8681ca44c52f178401cd47ad07d11/pkg/cvo/internal/operatorstatus.go#L174-L176 [2]: https://issues.redhat.com/browse/OCPBUGS-10300
1db0be3
to
6c309bf
Compare
1db0be3 -> 6c309bf adds the missing wiring to bump |
Pod sandboxing, HyperShift infrastructure, PodImageBuilders, and ValidatingAdmissionPolicy all seem unrelated to my change, so at the moment CI is looking good. But I'm not going to launch a retest while we build consensus around this approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Did you want to backport this at all? Or do you want to go with no-jira for this?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retest-required |
6708003
into
openshift:master
[ART PR BUILD NOTIFIER] Distgit: ose-machine-config-operator |
Verify https://issues.redhat.com/browse/OTA-960: $ oc image extract quay.io/openshift-release-dev/ocp-release:4.18.0-ec.3-x86_64 --path /manifests/:. --path /release-manifests/:.
$ cat 0000_80_machine-config_06_clusteroperator.yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a2ca9b7808dd922f00fc37100229c08f005357386e4b24628faeef49222b2ea launch 4.18.0-ec.3 aws,amd64 $ oc get co machine-config -o yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a2ca9b7808dd922f00fc37100229c08f005357386e4b24628faeef49222b2ea
$ oc adm release info -o json | jq -r '.metadata'
{
"kind": "cincinnati-metadata-v0",
"version": "4.18.0-ec.3",
"previous": [
"4.17.0",
"4.17.1",
"4.17.2",
"4.17.3",
"4.18.0-ec.0",
"4.18.0-ec.1",
"4.18.0-ec.2"
]
}
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.0-ec.3 True False 19m Cluster version is 4.18.0-ec.3
$ oc adm upgrade --to-multi-arch
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.0-ec.3 True True 15m Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config
$ oc get clusterversion version -o yaml | yq '.status.history'
[
{
"completionTime": "2024-11-06T01:36:57Z",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:79347bc3313a05e374a40fe47de804e23cc14f795ee51699721e459706cfe2c0",
"startedTime": "2024-11-06T01:15:02Z",
"state": "Completed",
"verified": true,
"version": "4.18.0-ec.3"
},
{
"completionTime": "2024-11-06T00:54:58Z",
"image": "registry.build09.ci.openshift.org/ci-ln-1wyzzqb/release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c",
"startedTime": "2024-11-06T00:34:12Z",
"state": "Completed",
"verified": false,
"version": "4.18.0-ec.3"
}
] It took about 20m to upgrade. Not like "less than 2 minutes" in the description of OTA-960. So the bug is fixed. I will let https://issues.redhat.com/browse/OTA-1386 do further verification by creating a cross-arch node. $ oc adm release info -o json | jq -r '.metadata'
{
"kind": "cincinnati-metadata-v0",
"version": "4.18.0-ec.3",
"previous": [
"4.17.0",
"4.17.1",
"4.17.2",
"4.17.3",
"4.18.0-ec.0",
"4.18.0-ec.1",
"4.18.0-ec.2"
],
"metadata": {
"release.openshift.io/architecture": "multi"
}
}
$ oc get co machine-config -o yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5
$ oc image extract quay.io/openshift-release-dev/ocp-release:4.18.0-ec.3-multi --path /manifests/:. --path /release-manifests/:.
$ cat 0000_80_machine-config_06_clusteroperator.yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5 |
Yeah, that's about consistent with how long it typically takes to update MCO alone in a cluster bot cluster. That's exactly the behavior I'd expect 👍 |
By setting a new
operator-image
entry in the ClusterOperatorstatus.versions
manifest, so the cluster-version operator will wait for the in-cluster status to have both:operator
value we already declare, so the CVO waits for us to hit the target version.operator-image
value I'm adding in this commit, to help distinguish between single-arch and multi-arch targets which share the same version string (but have unique release and MCO pullspecs).Without this change, the CVO immediately thinks the MCO has completed its update ("
operator
matches 4.13.0, and that's all I've been asked to match!") while the MCO is still working to pivot to the new, multi-arch component. Folks who try to take advantage of the new multi-arch functionality at this point might be surprised to have ClusterVersion claiming the update to multi-arch is complete, even though the MCO is still only part way through that transition, and has not yet positioned itself to point at the multi-arch RHCOS image when new Machines try to come up (OCPBUGS-10300).