Skip to content

NO-JIRA: [TNF] add Two Node Fencing exception to accept less than two etcd endpoints #30058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clobrano
Copy link
Contributor

@clobrano clobrano commented Aug 5, 2025

In a Two Node Fencing cluster is it acceptable to have less than two etcd endpoints.

Copy link
Contributor

openshift-ci bot commented Aug 5, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 5, 2025
Copy link
Contributor

openshift-ci bot commented Aug 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

func isTwoNodeFencingCheck(clientConfig *rest.Config) (bool, error) {
configClient, err := clientconfigv1.NewForConfig(clientConfig)
if err != nil {
logrus.WithError(err).Error("Error creating config client to check for Single Node configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error message is misleading :D

@@ -52,6 +52,12 @@ func testStableSystemOperatorStateTransitions(events monitorapi.Intervals, clien
isSingleNode = false
}

isTwoNodeFencing, err := isTwoNodeFencingCheck(clientConfig)
if err != nil {
logrus.Warnf("Error checking for TwoNodeFencing Node configuration on upgrade (unable to make exception): %v", err)
Copy link
Contributor

@jaypoulz jaypoulz Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably rephrase this:
"Error checking for DualReplica controlPlaneTopology (i.e. Two Node OpenShift with Fencing)"

@@ -156,6 +168,15 @@ func isSingleNodeCheck(clientConfig *rest.Config) (bool, error) {
return exutil.IsSingleNode(context.Background(), configClient)
}

func isTwoNodeFencingCheck(clientConfig *rest.Config) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already printing the error here and not doing anything with it when it's bubbled up, I would just return a bool here and just print out the error, then simplify the call site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. In this case, however, I intentionally mirrored the signature of isSingleNodeCheck. Would you like to change that function as well, for consistency?

func isSingleNodeCheck(clientConfig *rest.Config) (bool, error) {
configClient, err := clientconfigv1.NewForConfig(clientConfig)
if err != nil {
logrus.WithError(err).Error("Error creating config client to check for Single Node configuration")
return false, err
}
return exutil.IsSingleNode(context.Background(), configClient)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that would be great, thank you!

@clobrano clobrano marked this pull request as ready for review August 6, 2025 05:23
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 6, 2025
@openshift-ci openshift-ci bot requested review from p0lyn0mial and sjenning August 6, 2025 05:24
@clobrano clobrano force-pushed the tnf-e2e-exceptions/allow-less-than-two-etcd-endpoints branch from b55b947 to 236a8ba Compare August 6, 2025 05:35
@clobrano
Copy link
Contributor Author

clobrano commented Aug 6, 2025

I am not sure this change is actually working

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2025
@clobrano clobrano force-pushed the tnf-e2e-exceptions/allow-less-than-two-etcd-endpoints branch from 236a8ba to 058c1cf Compare August 7, 2025 06:48
@clobrano clobrano changed the title TNF: add Two Node Fencing exception to legacycvomonitortests TNF: add Two Node Fencing exception to accept less than two etcd endpoints Aug 7, 2025
…ints

In a Two Node Fencing cluster (DualReplicaTopology) is it acceptable to
have less than two etcd endpoints.
@clobrano clobrano force-pushed the tnf-e2e-exceptions/allow-less-than-two-etcd-endpoints branch from 058c1cf to 4894db4 Compare August 7, 2025 07:17
@clobrano
Copy link
Contributor Author

clobrano commented Aug 7, 2025

The original change wasn't effective. I was looking at condition Message, but the pattern wasn't there. It seemed to work, because the test failure is not 100% reproducible.

The change was moved to the library looking for duplicated pathological events, and activated only if topology is DualReplica

Sorry for the initial wasted reviews, but now the PR is ready again 🙇

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2025
Copy link
Contributor

openshift-ci bot commented Aug 7, 2025

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway 4894db4 link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-aws-proxy 4894db4 link false /test e2e-aws-proxy
ci/prow/e2e-aws-disruptive 4894db4 link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2 4894db4 link false /test e2e-metal-ipi-serial-ovn-ipv6-2of2
ci/prow/e2e-aws-ovn-single-node-upgrade 4894db4 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn 4894db4 link false /test e2e-aws-ovn
ci/prow/e2e-azure 4894db4 link false /test e2e-azure
ci/prow/okd-scos-e2e-aws-ovn 4894db4 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 4894db4 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-gcp-ovn-techpreview 4894db4 link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-gcp-ovn 4894db4 link true /test e2e-gcp-ovn
ci/prow/e2e-hypershift-conformance 4894db4 link false /test e2e-hypershift-conformance

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Aug 7, 2025

Job Failure Risk Analysis for sha: 4894db4

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive High
[sig-node] static pods should start after being created
This test has passed 99.48% of 3246 runs on release 4.20 [Overall] in the last week.
---
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 99.82% of 3245 runs on release 4.20 [Overall] in the last week.
---
[bz-Etcd] clusteroperator/etcd should not change condition/Available
This test has passed 99.66% of 3246 runs on release 4.20 [Overall] in the last week.

return &SimplePathologicalEventMatcher{
name: "EtcdEndpointsConfigMissingDuringTwoNodeTests",
locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^openshift-kube-apiserver-operator$`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to openshift-apiserver and oauth-apiserver too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I only observed events from openshift-kube-apiserver-operator, do you think oauth-apiserver might incurr in the same error?

},
messageReasonRegex: regexp.MustCompile(`^ConfigMissing$`),
messageHumanRegex: regexp.MustCompile(`apiServerArguments\.etcd-servers has less than two live etcd endpoints`),
topology: &dualReplicaTopology,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is single node filtered out too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comes from APIServer knowing there is more than 1 endpoint during normal lifecycle, and 1 is a degraded state, but not failure. So I'd say no, only DualReplica.

@clobrano clobrano changed the title TNF: add Two Node Fencing exception to accept less than two etcd endpoints [NO-JIRA] TNF: add Two Node Fencing exception to accept less than two etcd endpoints Aug 11, 2025
@clobrano
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@clobrano: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano clobrano changed the title [NO-JIRA] TNF: add Two Node Fencing exception to accept less than two etcd endpoints NO-JIRA [TNF] add Two Node Fencing exception to accept less than two etcd endpoints Aug 11, 2025
@clobrano
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@clobrano: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano clobrano changed the title NO-JIRA [TNF] add Two Node Fencing exception to accept less than two etcd endpoints [NO-JIRA] [TNF] add Two Node Fencing exception to accept less than two etcd endpoints Aug 11, 2025
@clobrano
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@clobrano: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano clobrano changed the title [NO-JIRA] [TNF] add Two Node Fencing exception to accept less than two etcd endpoints NO-JIRA: [TNF] add Two Node Fencing exception to accept less than two etcd endpoints Aug 11, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 11, 2025
@openshift-ci-robot
Copy link

@clobrano: This pull request explicitly references no jira issue.

In response to this:

In a Two Node Fencing cluster is it acceptable to have less than two etcd endpoints.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@clobrano: This pull request explicitly references no jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants