Skip to content

OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dgrisonnet
Copy link
Member

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.

For the "Cluster should remain functional during upgrade" test, TRT
noticed flakes from the step that verifies that deamonsets are running
on all expected nodes afer an upgrade.  This flake was caused by the
verification of the deamonset happening too quickly after the upgrade.
As soon as the last upgraded node becomes ready the check happens, but it
doesn't always leave enough time for the deamonset to restart, thus
causing the test to fail.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jul 7, 2025
@openshift-ci-robot
Copy link

@dgrisonnet: This pull request references Jira Issue OCPBUGS-57032, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jiajliu, p0lyn0mial and sjenning July 7, 2025 13:23
@sdodson
Copy link
Member

sdodson commented Jul 7, 2025

/approve

Copy link
Contributor

openshift-ci bot commented Jul 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgrisonnet, sdodson
Once this PR has been reviewed and has the lgtm label, please assign smg247 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dgrisonnet
Copy link
Member Author

/retest-required

@dgrisonnet
Copy link
Member Author

/retest

@dgrisonnet
Copy link
Member Author

/retest-required

Copy link
Contributor

openshift-ci bot commented Jul 9, 2025

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-fips-serial-2of2 b490b31 link false /test e2e-gcp-fips-serial-2of2
ci/prow/e2e-azure-ovn-etcd-scaling b490b31 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-serial-publicnet-1of2 b490b31 link false /test e2e-aws-ovn-serial-publicnet-1of2
ci/prow/e2e-gcp-fips-serial-1of2 b490b31 link false /test e2e-gcp-fips-serial-1of2
ci/prow/e2e-vsphere-ovn-etcd-scaling b490b31 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-openstack-serial b490b31 link false /test e2e-openstack-serial
ci/prow/e2e-aws-ovn-etcd-scaling b490b31 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-azure-ovn-upgrade b490b31 link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-gcp-disruptive b490b31 link false /test e2e-gcp-disruptive
ci/prow/okd-e2e-gcp b490b31 link false /test okd-e2e-gcp
ci/prow/e2e-agnostic-ovn-cmd b490b31 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 b490b31 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-aws-ovn-upgrade-rollback b490b31 link false /test e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-aws-disruptive b490b31 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-ovn-etcd-scaling b490b31 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-metal-ipi-ovn-ipv6 b490b31 link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Jul 9, 2025

Job Failure Risk Analysis for sha: b490b31

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive Medium
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
---
[bz-Etcd] clusteroperator/etcd should not change condition/Available
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade Medium
Job run should complete before timeout
This test has passed 95.40% of 4583 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-gcp-disruptive Medium
[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times
Potential external regression detected for High Risk Test analysis

@@ -188,6 +188,9 @@ var _ = g.Describe("[sig-arch][Feature:ClusterUpgrade]", func() {
clusterUpgrade(f, client, dynamicClient, config, upgCtx.Versions[i]),
fmt.Sprintf("during upgrade to %s", upgCtx.Versions[i].NodeImage))
}
// Sleep to give some time to the workloads on the last upgraded
// node to restart.
time.Sleep(30 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you take the poll approach so if we only need a few seconds, we don't use the full 30? With 5000+ tests we need to minimize the sleeps whenever possible. One of the wait.Poll functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used Cursor to refactor this with a polling loop in #30062 tried to give credit in the PR description for where the real work happened.

@dgrisonnet
Copy link
Member Author

superseded by #30062

@dgrisonnet dgrisonnet closed this Aug 6, 2025
@openshift-ci-robot
Copy link

@dgrisonnet: This pull request references Jira Issue OCPBUGS-57032. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants