-
Notifications
You must be signed in to change notification settings - Fork 4.7k
OCPBUGS-57032: upgrade.go: wait some time after node upgrade #29960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes afer an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@dgrisonnet: This pull request references Jira Issue OCPBUGS-57032, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/approve |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dgrisonnet, sdodson The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
/retest |
/retest-required |
@dgrisonnet: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: b490b31
|
@@ -188,6 +188,9 @@ var _ = g.Describe("[sig-arch][Feature:ClusterUpgrade]", func() { | |||
clusterUpgrade(f, client, dynamicClient, config, upgCtx.Versions[i]), | |||
fmt.Sprintf("during upgrade to %s", upgCtx.Versions[i].NodeImage)) | |||
} | |||
// Sleep to give some time to the workloads on the last upgraded | |||
// node to restart. | |||
time.Sleep(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take the poll approach so if we only need a few seconds, we don't use the full 30? With 5000+ tests we need to minimize the sleeps whenever possible. One of the wait.Poll functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used Cursor to refactor this with a polling loop in #30062 tried to give credit in the PR description for where the real work happened.
superseded by #30062 |
@dgrisonnet: This pull request references Jira Issue OCPBUGS-57032. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
For the "Cluster should remain functional during upgrade" test, TRT noticed flakes from the step that verifies that deamonsets are running on all expected nodes after an upgrade. This flake was caused by the verification of the deamonset happening too quickly after the upgrade. As soon as the last upgraded node becomes ready the check happens, but it doesn't always leave enough time for the deamonset to restart, thus causing the test to fail.