OCPBUGS-60941: add individual context to the health check#1474
OCPBUGS-60941: add individual context to the health check#1474openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
Conversation
When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled. This adds a new context with an individual timeout to each member health check. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
|
@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| memberCtx, cancel := context.WithTimeout(ctx, DefaultClientTimeout) | ||
| defer cancel() | ||
|
|
||
| memberHealth[i] = checkSingleMemberHealth(memberCtx, cli, member) |
There was a problem hiding this comment.
Maybe there's potential data race issue in the goroutine - it's capturing the loop variable member by reference. This can cause a data race where all goroutines end up checking the last member. we should pass member as an argument: go func(i int, m *etcdserverpb.Member) {...}(i, member)
There was a problem hiding this comment.
thanks for the quick review @lance5890 - I think this should be fixed in later go versions?
https://go.dev/blog/loopvar-preview
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/override ci/prow/e2e-aws-cpms |
|
/label acknowledge-critical-fixes-only |
|
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling, ci/prow/e2e-metal-assisted DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@tjungblu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/jira refresh |
|
@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9091149
into
openshift:main
|
@tjungblu: Jira Issue OCPBUGS-60941: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-60941 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.19 |
|
@tjungblu: new pull request created: #1475 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.
With this fix only the respective member is being considered unhealthy: