Skip to content

OCPBUGS-57032: Add wait.Poll retry logic to checkUpgradeability with 30s timeout #30062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 8, 2025

Conversation

sdodson
Copy link
Member

@sdodson sdodson commented Aug 5, 2025

Based on Damien's analysis the test is failing because there wasn't enough time between when the upgrade completed and the DaemonSet pod to become ready. In the cases he reviewed there were single digit seconds between when the test checked and when pods became ready.

In #29960 Damien added 30s sleep but Devan suggested a polling loop instead to limit downtime.

@openshift-ci openshift-ci bot requested review from deads2k and sjenning August 5, 2025 18:39
@sdodson
Copy link
Member Author

sdodson commented Aug 5, 2025

/uncc @deads2k @sjenning
/cc @dgrisonnet @dgoodwin
/retitle OCPBUGS-57032: Add wait.Poll retry logic to checkUpgradeability with 30s timeout

@openshift-ci openshift-ci bot removed request for sjenning and deads2k August 5, 2025 18:53
@openshift-ci openshift-ci bot changed the title Refactor cluster upgrade loop to use wait.Poll with 30s backoff OCPBUGS-57032: Add wait.Poll retry logic to checkUpgradeability with 30s timeout Aug 5, 2025
@openshift-ci openshift-ci bot requested review from dgoodwin and dgrisonnet August 5, 2025 18:53
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Aug 5, 2025
@openshift-ci-robot
Copy link

@sdodson: This pull request references Jira Issue OCPBUGS-57032, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Based on Damien's analysis the test is failing because there wasn't enough time between when the upgrade completed and the DaemonSet pod to become ready. In the cases he reviewed there were single digit seconds between when the test checked and when pods became ready.

In #29960 Damien added 30s sleep but Devan suggested a polling loop instead to limit downtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdodson
Copy link
Member Author

sdodson commented Aug 6, 2025

/retest-required

Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following up on that Scott, I sadly didn't get time to look into my original PR again.

I believe this change will improve the situation.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 6, 2025
Copy link
Contributor

openshift-ci bot commented Aug 6, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgrisonnet, sdodson
Once this PR has been reviewed and has the lgtm label, please assign dennisperiquet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dgrisonnet
Copy link
Member

/retest-required

@sdodson
Copy link
Member Author

sdodson commented Aug 6, 2025

/label approved

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2025
@sdodson
Copy link
Member Author

sdodson commented Aug 6, 2025

/cherry-pick release-4.19

@openshift-cherrypick-robot

@sdodson: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dinhxuanvu
Copy link
Member

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 05eeb94 and 2 for PR HEAD 7af8448 in total

@dinhxuanvu
Copy link
Member

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 022de33 and 1 for PR HEAD 7af8448 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7ccc307 and 0 for PR HEAD 7af8448 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7ccc307 and 2 for PR HEAD 7af8448 in total

Copy link
Contributor

openshift-ci bot commented Aug 7, 2025

@sdodson: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade 7af8448 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 7af8448 link false /test e2e-gcp-ovn-techpreview-serial-2of2
ci/prow/e2e-aws-ovn-upgrade-rollback 7af8448 link false /test e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway 7af8448 link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-azure 7af8448 link false /test e2e-azure
ci/prow/e2e-gcp-ovn-techpreview 7af8448 link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-gcp-csi 7af8448 link false /test e2e-gcp-csi
ci/prow/e2e-aws-proxy 7af8448 link false /test e2e-aws-proxy
ci/prow/e2e-aws-ovn-single-node-serial 7af8448 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-disruptive 7af8448 link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout 7af8448 link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Aug 7, 2025

Job Failure Risk Analysis for sha: 7af8448

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive IncompleteTests
Tests for this run (106) are below the historical average (221): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-csi IncompleteTests
Tests for this run (21) are below the historical average (1655): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (24) are below the historical average (1264): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@sdodson
Copy link
Member Author

sdodson commented Aug 7, 2025

/override ci/prow/e2e-gcp-ovn-upgrade
This job passed previously but fails now and has been failing for most PRs today.

Copy link
Contributor

openshift-ci bot commented Aug 7, 2025

@sdodson: Overrode contexts on behalf of sdodson: ci/prow/e2e-gcp-ovn-upgrade

In response to this:

/override ci/prow/e2e-gcp-ovn-upgrade
This job passed previously but fails now and has been failing for most PRs today.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 83a2325 and 1 for PR HEAD 7af8448 in total

@openshift-merge-bot openshift-merge-bot bot merged commit a1328e4 into openshift:main Aug 8, 2025
37 of 48 checks passed
@openshift-ci-robot
Copy link

@sdodson: Jira Issue OCPBUGS-57032: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-57032 has been moved to the MODIFIED state.

In response to this:

Based on Damien's analysis the test is failing because there wasn't enough time between when the upgrade completed and the DaemonSet pod to become ready. In the cases he reviewed there were single digit seconds between when the test checked and when pods became ready.

In #29960 Damien added 30s sleep but Devan suggested a polling loop instead to limit downtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@sdodson: new pull request created: #30079

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.20.0-202508081114.p0.ga1328e4.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants