OCPBUGS-60941: add individual context to the health check by tjungblu · Pull Request #1474 · openshift/cluster-etcd-operator

tjungblu · 2025-08-28T10:49:42Z

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.

With this fix only the respective member is being considered unhealthy:

[etcd-operator-566ff8dd4-dn5qx] E0828 11:20:04.415537       1 health.go:120] health check for member (tjungblu15-dq6nb-master-0) failed: err(context deadline exceeded)
[etcd-operator-566ff8dd4-dn5qx] W0828 11:20:04.415708       1 etcdcli.go:356] UnhealthyEtcdMember found: [tjungblu15-dq6nb-master-0]
...
[etcd-operator-566ff8dd4-dn5qx] E0828 11:21:01.300834       1 base_controller.go:279] "Unhandled Error" err="DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, tjungblu15-dq6nb-master-0 is unhealthy"

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled. This adds a new context with an individual timeout to each member health check. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

openshift-ci-robot · 2025-08-28T10:49:47Z

@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is invalid:

expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

lance5890 · 2025-08-28T12:19:33Z

pkg/etcdcli/health.go

+			memberCtx, cancel := context.WithTimeout(ctx, DefaultClientTimeout)
+			defer cancel()
+
+			memberHealth[i] = checkSingleMemberHealth(memberCtx, cli, member)


Maybe there's potential data race issue in the goroutine - it's capturing the loop variable member by reference. This can cause a data race where all goroutines end up checking the last member. we should pass member as an argument: go func(i int, m *etcdserverpb.Member) {...}(i, member)

thanks for the quick review @lance5890 - I think this should be fixed in later go versions?
https://go.dev/blog/loopvar-preview

dusk125 · 2025-08-28T13:51:55Z

/lgtm

openshift-ci · 2025-08-28T13:54:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dusk125,tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tjungblu · 2025-08-28T15:12:13Z

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling
/override ci/prow/e2e-metal-assisted

tjungblu · 2025-08-28T15:12:23Z

/label acknowledge-critical-fixes-only

openshift-ci · 2025-08-28T15:12:32Z

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling, ci/prow/e2e-metal-assisted

Details

In response to this:

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling
/override ci/prow/e2e-metal-assisted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-08-28T15:12:37Z

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-disruptive	`cd42ba7`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-disruptive	`cd42ba7`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-metal-ovn-two-node-fencing	`cd42ba7`	link	false	`/test e2e-metal-ovn-two-node-fencing`
ci/prow/e2e-azure-ovn-etcd-scaling	`cd42ba7`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-gcp-ovn-etcd-scaling	`cd42ba7`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`cd42ba7`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-gcp-disruptive-ovn	`cd42ba7`	link	false	`/test e2e-gcp-disruptive-ovn`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`cd42ba7`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`cd42ba7`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-aws-disruptive-ovn	`cd42ba7`	link	false	`/test e2e-aws-disruptive-ovn`
ci/prow/e2e-aws-etcd-recovery	`cd42ba7`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-aws-etcd-certrotation	`cd42ba7`	link	false	`/test e2e-aws-etcd-certrotation`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tjungblu · 2025-08-28T15:15:41Z

/jira refresh

openshift-ci-robot · 2025-08-28T15:15:46Z

@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is invalid:

expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tjungblu · 2025-08-28T15:16:15Z

/jira refresh

openshift-ci-robot · 2025-08-28T15:16:25Z

@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-08-28T15:21:15Z

@tjungblu: Jira Issue OCPBUGS-60941: All pull requests linked via external trackers have merged:

openshift/cluster-etcd-operator#1474

Jira Issue OCPBUGS-60941 has been moved to the MODIFIED state.

Details

In response to this:

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.

With this fix only the respective member is being considered unhealthy:
[etcd-operator-566ff8dd4-dn5qx] E0828 11:20:04.415537       1 health.go:120] health check for member (tjungblu15-dq6nb-master-0) failed: err(context deadline exceeded)
[etcd-operator-566ff8dd4-dn5qx] W0828 11:20:04.415708       1 etcdcli.go:356] UnhealthyEtcdMember found: [tjungblu15-dq6nb-master-0]
...
[etcd-operator-566ff8dd4-dn5qx] E0828 11:21:01.300834       1 base_controller.go:279] "Unhandled Error" err="DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, tjungblu15-dq6nb-master-0 is unhealthy"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tjungblu · 2025-08-28T15:22:31Z

/cherry-pick release-4.19

openshift-cherrypick-robot · 2025-08-28T15:23:20Z

@tjungblu: new pull request created: #1475

Details

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 28, 2025

openshift-ci bot requested review from dusk125 and jubittajohn August 28, 2025 10:52

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2025

lance5890 reviewed Aug 28, 2025

View reviewed changes

openshift-ci bot assigned dusk125 Aug 28, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 28, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 28, 2025

openshift-ci bot requested a review from geliu2016 August 28, 2025 15:16

openshift-merge-bot bot merged commit 9091149 into openshift:main Aug 28, 2025
22 of 34 checks passed

tjungblu deleted the main branch August 28, 2025 15:22

openshift-cherrypick-robot mentioned this pull request Aug 28, 2025

[release-4.19] OCPBUGS-61019: add individual context to the health check #1475

Merged

Conversation

tjungblu commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 28, 2025

Uh oh!

lance5890 Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjungblu Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

dusk125 commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

tjungblu commented Aug 28, 2025

Uh oh!

tjungblu commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

openshift-ci bot commented Aug 28, 2025

Uh oh!

tjungblu commented Aug 28, 2025

Uh oh!

openshift-ci-robot commented Aug 28, 2025

Uh oh!

tjungblu commented Aug 28, 2025

Uh oh!

openshift-ci-robot commented Aug 28, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 28, 2025

Uh oh!

tjungblu commented Aug 28, 2025

Uh oh!

openshift-cherrypick-robot commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tjungblu commented Aug 28, 2025 •

edited

Loading

lance5890 Aug 28, 2025 •

edited

Loading