NE-1815: Adapt router metrics tests for DCM support #29180

alebedev87 · 2024-10-10T12:14:15Z

This PR follows up on a Dynamic Configuration Manager (DCM) gap identified during the smoke test.

When DCM is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

openshift-ci-robot · 2024-10-10T12:14:18Z

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This commit updates the tests to handle scenarios where some servers may be DOWN.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-trt-bot · 2024-10-10T20:20:56Z

Job Failure Risk Analysis for sha: fa18d0b

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	IncompleteTests Tests for this run (101) are below the historical average (1163): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6	IncompleteTests Tests for this run (14) are below the historical average (2200): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot · 2024-10-15T19:47:20Z

Job Failure Risk Analysis for sha: 7bb50cd

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial	High [sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. --- [sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. --- [sig-api-machinery] disruption/kube-api connection/new should be available throughout the test This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. --- [sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. --- Showing 4 of 7 test results

openshift-ci-robot · 2024-10-16T12:37:16Z

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

candita · 2024-10-16T15:38:27Z

/assign @gcs278

openshift-trt-bot · 2024-10-16T16:19:06Z

Job Failure Risk Analysis for sha: c3e2a9f

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	Low [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available This test has passed 0.00% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-serial'] in the last 14 days.

openshift-ci-robot · 2024-10-16T19:01:03Z

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

This PR follows up on the DCM gaps identified during the smoke test.

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-10-16T19:01:44Z

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

This PR follows up on a Dynamic Configuration Manager (DCM) gap identified during the smoke test.

When DCM is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

gcs278

Nice PR, just have some concern that we might be losing test precision, and suggested an idea. Feel free to let me know if you disagree.

gcs278 · 2024-10-25T22:55:09Z

test/extended/router/metrics.go

-					if findGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels)[0] >= float64(times) {
-						return true, nil
+				if len(findNonZeroGaugesWithLabels(metrics["haproxy_server_up"], serverLabels)) == 2 {
+					for _, g := range findNonZeroGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) {


Does it make sense to change this to findNonZeroGaugesWithLabels? Aren't these for backends, not servers? I don't think this metric will contain "dynamic servers", since it's for backends.

findNonZeroGaugesWithLabels is generic enough to filter out any zero value gauges. But I agree, let's keep it for the server metrics only as its description states.

gcs278 · 2024-10-25T23:03:46Z

test/extended/router/metrics.go

-						return true, nil
+				if len(findNonZeroGaugesWithLabels(metrics["haproxy_server_up"], serverLabels)) == 2 {
+					for _, g := range findNonZeroGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) {
+						// stop retrying if any route got expected number of connections.


Aren't we looking for a specific route namespace/name label combination? Is it safe to assume that will only return 1 unique route?

Yes and yes. The word "any" is wrong here. The for loop's goal was to cover the case when there is no non-zero gauge so that the index operation which was there before ([0]) would not panic. Let me change this to something like the below so that it doesn't look confusing:

if g := findGaugesWithLabels(metrics["haproxy_backend_connections_total"]; len(g) > 0 { if g[0] >= float64(times) { return true, nil } }

gcs278 · 2024-10-25T23:14:36Z

test/extended/router/metrics.go

+// When DCM is enabled, not all server slots are always in the UP state,
+// unlike the behavior without DCM where all slots are UP by default.
+// As a result, we may want to filter out servers where the gauge is set to 0 (== DOWN).
+func findNonZeroGaugesWithLabels(f *dto.MetricFamily, promLabels map[string]string) []float64 {


Would it be more precise to filter out dynamic servers based on their name? I know we have functions to include labels, so you would need a function to exclude labels. The names of the servers are _dynamic-pod-# (2 right now). Could be a regex, or a contains type of thing.

Do you think that filtering out zero value gauges here is "weakening" the precision of the test, and opening possibilities for metrics bugs to slip through this E2E test? What happens if haproxy is writing a bunch of bogus metrics for servers, but they have zero value? Would that slip through?

Would it be more precise to filter out dynamic servers based on their name? I know we have functions to include labels, so you would need a function to exclude labels. The names of the servers are _dynamic-pod-# (2 right now). Could be a regex, or a contains type of thing.

That was my first idea but the thing is that DCM uses all servers as slots - static (added via the config parsing) and dynamic. That is, static servers may also be disabled resulting into a zero metric value.

Do you think that filtering out zero value gauges here is "weakening" the precision of the test, and opening possibilities for metrics bugs to slip through this E2E test?

Yes it is weakening the precision. With the introduction of DCM a zero value of a metric may mean: no value (disabled server) or zero value. The approach is sub optimal but straight forward. So far, most of the tests (maybe even all) expect non zero metric values so filtering zero ones help us to remove disabled servers only.

The optimal approach is to remove metrics for disabled servers dynamically in the router. We will have to go this way in the future. So far, I didn't have to modify the router for the DCM use case and I would like to stay this way. But if the DCM will spawn into the next release, I'll come back to this one.

Upd: added a TODO comment.

Sounds like a reasonable future approach. 👍

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This commit updates the tests to handle scenarios where some servers may be DOWN.

alebedev87 · 2024-10-30T09:30:11Z

/retest

openshift-ci · 2024-10-30T13:15:25Z

@alebedev87: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-ipsec-serial	`fa18d0b`	link	false	`/test e2e-aws-ovn-ipsec-serial`
ci/prow/e2e-aws-ovn-kube-apiserver-rollout	`c90f8c1`	link	false	`/test e2e-aws-ovn-kube-apiserver-rollout`
ci/prow/e2e-aws-ovn-single-node-upgrade	`c90f8c1`	link	false	`/test e2e-aws-ovn-single-node-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

gcs278 · 2024-10-31T15:25:12Z

/lgtm

openshift-ci · 2024-10-31T15:25:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/router/OWNERS~~ [alebedev87,gcs278]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-10-31T18:37:59Z

/retest-required

Remaining retests: 0 against base HEAD 6ce9801 and 2 for PR HEAD c90f8c1 in total

openshift-ci-robot · 2024-10-31T20:40:49Z

/retest-required

Remaining retests: 0 against base HEAD 5e81bc2 and 1 for PR HEAD c90f8c1 in total

openshift-trt-bot · 2024-11-01T03:52:08Z

Job Failure Risk Analysis for sha: c90f8c1

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 76.92% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-bot · 2024-11-01T12:32:55Z

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.18.0-202411010937.p0.g8102da8.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 10, 2024

openshift-ci bot requested review from frobware and knobunc October 10, 2024 12:17

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024

alebedev87 force-pushed the fix-router-metrics-for-dcm branch from e278c37 to fa18d0b Compare October 10, 2024 14:53

alebedev87 force-pushed the fix-router-metrics-for-dcm branch from fa18d0b to 7bb50cd Compare October 15, 2024 16:00

alebedev87 force-pushed the fix-router-metrics-for-dcm branch 5 times, most recently from 9057e75 to c3e2a9f Compare October 16, 2024 12:33

openshift-ci bot assigned gcs278 Oct 16, 2024

gcs278 reviewed Oct 25, 2024

View reviewed changes

alebedev87 mentioned this pull request Oct 28, 2024

NE-1790: Enable Dynamic Configuration Manager feature gate openshift/cluster-ingress-operator#1159

Merged

alebedev87 force-pushed the fix-router-metrics-for-dcm branch from c3e2a9f to c90f8c1 Compare October 29, 2024 16:18

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2024

openshift-merge-bot bot merged commit 8102da8 into openshift:master Nov 1, 2024

NE-1815: Adapt router metrics tests for DCM support #29180

NE-1815: Adapt router metrics tests for DCM support #29180

Uh oh!

Conversation

alebedev87 commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 10, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-trt-bot commented Oct 10, 2024

Uh oh!

openshift-trt-bot commented Oct 15, 2024

Uh oh!

openshift-ci-robot commented Oct 16, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

candita commented Oct 16, 2024

Uh oh!

openshift-trt-bot commented Oct 16, 2024

Uh oh!

openshift-ci-robot commented Oct 16, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 16, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcs278 left a comment

Choose a reason for hiding this comment

Uh oh!

gcs278 Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

alebedev87 Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

gcs278 Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

alebedev87 Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

gcs278 Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

alebedev87 Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alebedev87 Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

gcs278 Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

alebedev87 commented Oct 30, 2024

Uh oh!

openshift-ci bot commented Oct 30, 2024

Uh oh!

gcs278 commented Oct 31, 2024

Uh oh!

openshift-ci bot commented Oct 31, 2024

Uh oh!

openshift-ci-robot commented Oct 31, 2024

Uh oh!

openshift-ci-robot commented Oct 31, 2024

Uh oh!

openshift-trt-bot commented Nov 1, 2024

Uh oh!

openshift-bot commented Nov 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alebedev87 commented Oct 10, 2024 •

edited

Loading

openshift-ci-robot commented Oct 10, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Oct 16, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Oct 16, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Oct 16, 2024 •

edited by openshift-ci bot

Loading

alebedev87 Oct 29, 2024 •

edited

Loading