Skip to content

NO-ISSUE: pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete #4637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wking
Copy link
Member

@wking wking commented Oct 10, 2024

By setting a new operator-image entry in the ClusterOperator status.versions manifest, so the cluster-version operator will wait for the in-cluster status to have both:

  • The operator value we already declare, so the CVO waits for us to hit the target version.
  • The operator-image value I'm adding in this commit, to help distinguish between single-arch and multi-arch targets which share the same version string (but have unique release and MCO pullspecs).

Without this change, the CVO immediately thinks the MCO has completed its update ("operator matches 4.13.0, and that's all I've been asked to match!") while the MCO is still working to pivot to the new, multi-arch component. Folks who try to take advantage of the new multi-arch functionality at this point might be surprised to have ClusterVersion claiming the update to multi-arch is complete, even though the MCO is still only part way through that transition, and has not yet positioned itself to point at the multi-arch RHCOS image when new Machines try to come up (OCPBUGS-10300).

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2024
@openshift-ci openshift-ci bot requested review from djoshy and yuqi-zhang October 10, 2024 21:54
@wking wking force-pushed the cluster-operator-version-operator-image branch from 7c3fb6a to 1db0be3 Compare October 10, 2024 22:48
@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller October 11, 2024 11:59
@wking
Copy link
Member Author

wking commented Oct 11, 2024

Looks good for the CVO waiting on the MCO to bump operator-image (which is not wired up yet):

upgrade: [sig-cluster-lifecycle] Cluster completes upgrade	2h30m0s
{Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.18.0-0.ci.test-2024-10-10-230252-ci-op-36qhs6g6-latest: 761 of 890 done (85% complete), waiting on machine-config  }

The CVO's logging could be more specific about what it's waiting for:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4637/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade/1844510399173496832/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-58fbdfccc7-nw4fq_cluster-version-operator.log | grep 'error running apply for clusteroperator "machine-config"' | tail -n2
E1011 02:22:34.064164       1 task.go:122] error running apply for clusteroperator "machine-config" (762 of 890): Cluster operator machine-config is updating versions
E1011 02:25:48.076463       1 task.go:122] error running apply for clusteroperator "machine-config" (762 of 890): Cluster operator machine-config is updating versions

…ansitions complete

By setting a new 'operator-image' entry in the ClusterOperator
status.versions manifest, so the cluster-version operator will wait
for the in-cluster status to have both [1]:

* The 'operator' value we already declare, so the CVO waits for us to
  hit the target version.
* The 'operator-image' value I'm adding in this commit, to help
  distinguish between single-arch and multi-arch targets which share
  the same version string (but have unique release and MCO pullspecs).

Without this change, the CVO immediately thinks the MCO has completed
its update ("'operator' matches 4.13.0, and that's all I've been asked
to match!") while the MCO is still working to pivot to the new,
multi-arch component.  Folks who try to take advantage of the new
multi-arch functionality at this point might be surprised to have
ClusterVersion claiming the update to multi-arch is complete, even
though the MCO is still only part way through that transition, and has
not yet positioned itself to point at the multi-arch RHCOS image when
new Machines try to come up [2].

[1]: https://github.com/openshift/cluster-version-operator/blob/e546515213c8681ca44c52f178401cd47ad07d11/pkg/cvo/internal/operatorstatus.go#L174-L176
[2]: https://issues.redhat.com/browse/OCPBUGS-10300
@wking wking force-pushed the cluster-operator-version-operator-image branch from 1db0be3 to 6c309bf Compare October 11, 2024 17:22
@wking wking changed the title WIP: pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete Oct 11, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 11, 2024
@wking
Copy link
Member Author

wking commented Oct 11, 2024

1db0be3 -> 6c309bf adds the missing wiring to bump operator-image in status when reconciliation completes (thanks @hongkailiu 🙇)

@wking
Copy link
Member Author

wking commented Oct 11, 2024

Pod sandboxing, HyperShift infrastructure, PodImageBuilders, and ValidatingAdmissionPolicy all seem unrelated to my change, so at the moment CI is looking good. But I'm not going to launch a retest while we build consensus around this approach.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Did you want to backport this at all? Or do you want to go with no-jira for this?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 22, 2024
Copy link
Contributor

openshift-ci bot commented Oct 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2024
@wking wking changed the title pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete NO-ISSUE: pkg/operator/status: Block ClusterVersion updates until multi-arch transitions complete Oct 22, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 22, 2024
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request explicitly references no jira issue.

In response to this:

By setting a new operator-image entry in the ClusterOperator status.versions manifest, so the cluster-version operator will wait for the in-cluster status to have both:

  • The operator value we already declare, so the CVO waits for us to hit the target version.
  • The operator-image value I'm adding in this commit, to help distinguish between single-arch and multi-arch targets which share the same version string (but have unique release and MCO pullspecs).

Without this change, the CVO immediately thinks the MCO has completed its update ("operator matches 4.13.0, and that's all I've been asked to match!") while the MCO is still working to pivot to the new, multi-arch component. Folks who try to take advantage of the new multi-arch functionality at this point might be surprised to have ClusterVersion claiming the update to multi-arch is complete, even though the MCO is still only part way through that transition, and has not yet positioned itself to point at the multi-arch RHCOS image when new Machines try to come up (OCPBUGS-10300).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ade515e and 2 for PR HEAD 6c309bf in total

Copy link
Contributor

openshift-ci bot commented Oct 22, 2024

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 6c309bf link false /test e2e-vsphere-ovn-upi
ci/prow/e2e-gcp-op-techpreview 6c309bf link false /test e2e-gcp-op-techpreview
ci/prow/e2e-aws-ovn-upgrade-out-of-change 6c309bf link false /test e2e-aws-ovn-upgrade-out-of-change
ci/prow/e2e-azure-ovn-upgrade-out-of-change 6c309bf link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Member

/retest-required

@openshift-merge-bot openshift-merge-bot bot merged commit 6708003 into openshift:master Oct 22, 2024
13 of 17 checks passed
@wking wking deleted the cluster-operator-version-operator-image branch October 22, 2024 17:48
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.18.0-202410221941.p0.g6708003.assembly.stream.el9.
All builds following this will include this PR.

@hongkailiu
Copy link
Member

hongkailiu commented Nov 6, 2024

Verify https://issues.redhat.com/browse/OTA-960:

$ oc image extract quay.io/openshift-release-dev/ocp-release:4.18.0-ec.3-x86_64 --path /manifests/:. --path /release-manifests/:.
$ cat 0000_80_machine-config_06_clusteroperator.yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a2ca9b7808dd922f00fc37100229c08f005357386e4b24628faeef49222b2ea

launch 4.18.0-ec.3 aws,amd64

$ oc get co machine-config -o yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a2ca9b7808dd922f00fc37100229c08f005357386e4b24628faeef49222b2ea

$ oc adm release info -o json | jq -r '.metadata'
{
  "kind": "cincinnati-metadata-v0",
  "version": "4.18.0-ec.3",
  "previous": [
    "4.17.0",
    "4.17.1",
    "4.17.2",
    "4.17.3",
    "4.18.0-ec.0",
    "4.18.0-ec.1",
    "4.18.0-ec.2"
  ]
}

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-ec.3   True        False         19m     Cluster version is 4.18.0-ec.3


$ oc adm upgrade --to-multi-arch


$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-ec.3   True        True          15m     Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

$ oc get clusterversion version -o yaml | yq '.status.history'
[
  {
    "completionTime": "2024-11-06T01:36:57Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:79347bc3313a05e374a40fe47de804e23cc14f795ee51699721e459706cfe2c0",
    "startedTime": "2024-11-06T01:15:02Z",
    "state": "Completed",
    "verified": true,
    "version": "4.18.0-ec.3"
  },
  {
    "completionTime": "2024-11-06T00:54:58Z",
    "image": "registry.build09.ci.openshift.org/ci-ln-1wyzzqb/release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c",
    "startedTime": "2024-11-06T00:34:12Z",
    "state": "Completed",
    "verified": false,
    "version": "4.18.0-ec.3"
  }
]

It took about 20m to upgrade. Not like "less than 2 minutes" in the description of OTA-960. So the bug is fixed.
cv.mig2multi.yaml.zip

I will let https://issues.redhat.com/browse/OTA-1386 do further verification by creating a cross-arch node.

$ oc adm release info -o json | jq -r '.metadata'
{
  "kind": "cincinnati-metadata-v0",
  "version": "4.18.0-ec.3",
  "previous": [
    "4.17.0",
    "4.17.1",
    "4.17.2",
    "4.17.3",
    "4.18.0-ec.0",
    "4.18.0-ec.1",
    "4.18.0-ec.2"
  ],
  "metadata": {
    "release.openshift.io/architecture": "multi"
  }
}

$ oc get co machine-config -o yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5

$ oc image extract quay.io/openshift-release-dev/ocp-release:4.18.0-ec.3-multi --path /manifests/:. --path /release-manifests/:.
$ cat 0000_80_machine-config_06_clusteroperator.yaml | yq -y '.status.versions[]|select(.name=="operator-image")'
name: operator-image
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5

@petr-muller
Copy link
Member

petr-muller commented Nov 6, 2024

It took about 20m to upgrade. Not like "less than 2 minutes" in the description of OTA-960. So the bug is fixed.

Yeah, that's about consistent with how long it typically takes to update MCO alone in a cluster bot cluster. That's exactly the behavior I'd expect 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants