OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960

dgrisonnet · 2025-07-07T13:23:10Z

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes afer an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

openshift-ci-robot · 2025-07-07T13:23:18Z

@dgrisonnet: This pull request references Jira Issue OCPBUGS-57032, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sdodson · 2025-07-07T19:22:27Z

/approve

openshift-ci · 2025-07-07T19:22:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgrisonnet, sdodson
Once this PR has been reviewed and has the lgtm label, please assign smg247 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dgrisonnet · 2025-07-08T12:29:14Z

/retest-required

dgrisonnet · 2025-07-08T12:46:54Z

/retest

dgrisonnet · 2025-07-09T12:45:24Z

/retest-required

openshift-ci · 2025-07-09T13:40:48Z

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-fips-serial-2of2	`b490b31`	link	false	`/test e2e-gcp-fips-serial-2of2`
ci/prow/e2e-azure-ovn-etcd-scaling	`b490b31`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-aws-ovn-serial-publicnet-1of2	`b490b31`	link	false	`/test e2e-aws-ovn-serial-publicnet-1of2`
ci/prow/e2e-gcp-fips-serial-1of2	`b490b31`	link	false	`/test e2e-gcp-fips-serial-1of2`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`b490b31`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-openstack-serial	`b490b31`	link	false	`/test e2e-openstack-serial`
ci/prow/e2e-aws-ovn-etcd-scaling	`b490b31`	link	false	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-azure-ovn-upgrade	`b490b31`	link	false	`/test e2e-azure-ovn-upgrade`
ci/prow/e2e-gcp-disruptive	`b490b31`	link	false	`/test e2e-gcp-disruptive`
ci/prow/okd-e2e-gcp	`b490b31`	link	false	`/test okd-e2e-gcp`
ci/prow/e2e-agnostic-ovn-cmd	`b490b31`	link	false	`/test e2e-agnostic-ovn-cmd`
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6	`b490b31`	link	false	`/test e2e-vsphere-ovn-dualstack-primaryv6`
ci/prow/e2e-aws-ovn-upgrade-rollback	`b490b31`	link	false	`/test e2e-aws-ovn-upgrade-rollback`
ci/prow/e2e-aws-disruptive	`b490b31`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-ovn-etcd-scaling	`b490b31`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-metal-ipi-ovn-ipv6	`b490b31`	link	true	`/test e2e-metal-ipi-ovn-ipv6`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2025-07-09T14:17:24Z

Job Failure Risk Analysis for sha: b490b31

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive	Medium [sig-node] static pods should start after being created Potential external regression detected for High Risk Test analysis --- [bz-Etcd] clusteroperator/etcd should not change condition/Available Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling	Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade	Medium Job run should complete before timeout This test has passed 95.40% of 4583 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-gcp-disruptive	Medium [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times Potential external regression detected for High Risk Test analysis

dgoodwin · 2025-07-14T11:31:58Z

test/e2e/upgrade/upgrade.go

@@ -188,6 +188,9 @@ var _ = g.Describe("[sig-arch][Feature:ClusterUpgrade]", func() {
 						clusterUpgrade(f, client, dynamicClient, config, upgCtx.Versions[i]),
 						fmt.Sprintf("during upgrade to %s", upgCtx.Versions[i].NodeImage))
 				}
+				// Sleep to give some time to the workloads on the last upgraded
+				// node to restart.
+				time.Sleep(30 * time.Second)


Could you take the poll approach so if we only need a few seconds, we don't use the full 30? With 5000+ tests we need to minimize the sleeps whenever possible. One of the wait.Poll functions.

Used Cursor to refactor this with a polling loop in #30062 tried to give credit in the PR description for where the real work happened.

dgrisonnet · 2025-08-06T13:14:22Z

superseded by #30062

openshift-ci-robot · 2025-08-06T13:14:29Z

@dgrisonnet: This pull request references Jira Issue OCPBUGS-57032. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jul 7, 2025

openshift-ci bot requested review from jiajliu, p0lyn0mial and sjenning July 7, 2025 13:23

dgoodwin reviewed Jul 14, 2025

View reviewed changes

sdodson mentioned this pull request Aug 5, 2025

OCPBUGS-57032: Add wait.Poll retry logic to checkUpgradeability with 30s timeout #30062

Merged

dgrisonnet closed this Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960

OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960

Uh oh!

dgrisonnet commented Jul 7, 2025

Uh oh!

openshift-ci-robot commented Jul 7, 2025

Uh oh!

sdodson commented Jul 7, 2025

Uh oh!

openshift-ci bot commented Jul 7, 2025

Uh oh!

dgrisonnet commented Jul 8, 2025

Uh oh!

dgrisonnet commented Jul 8, 2025

Uh oh!

dgrisonnet commented Jul 9, 2025

Uh oh!

openshift-ci bot commented Jul 9, 2025

Uh oh!

openshift-trt bot commented Jul 9, 2025

Uh oh!

dgoodwin Jul 14, 2025

Uh oh!

sdodson Aug 5, 2025

Uh oh!

dgrisonnet commented Aug 6, 2025

Uh oh!

openshift-ci-robot commented Aug 6, 2025

Uh oh!

Uh oh!

OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960

OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960

Uh oh!

Conversation

dgrisonnet commented Jul 7, 2025

Uh oh!

openshift-ci-robot commented Jul 7, 2025

Uh oh!

sdodson commented Jul 7, 2025

Uh oh!

openshift-ci bot commented Jul 7, 2025

Uh oh!

dgrisonnet commented Jul 8, 2025

Uh oh!

dgrisonnet commented Jul 8, 2025

Uh oh!

dgrisonnet commented Jul 9, 2025

Uh oh!

openshift-ci bot commented Jul 9, 2025

Uh oh!

openshift-trt bot commented Jul 9, 2025

Uh oh!

dgoodwin Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

sdodson Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

dgrisonnet commented Aug 6, 2025

Uh oh!

openshift-ci-robot commented Aug 6, 2025

Uh oh!

Uh oh!