Skip to content

Conversation

@alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Oct 10, 2024

This PR follows up on a Dynamic Configuration Manager (DCM) gap identified during the smoke test.

When DCM is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 10, 2024

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This commit updates the tests to handle scenarios where some servers may be DOWN.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 10, 2024
@openshift-ci openshift-ci bot requested review from frobware and knobunc October 10, 2024 12:17
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024
@alebedev87 alebedev87 force-pushed the fix-router-metrics-for-dcm branch from e278c37 to fa18d0b Compare October 10, 2024 14:53
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: fa18d0b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (101) are below the historical average (1163): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (14) are below the historical average (2200): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@alebedev87 alebedev87 force-pushed the fix-router-metrics-for-dcm branch from fa18d0b to 7bb50cd Compare October 15, 2024 16:00
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 7bb50cd

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial High
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test
This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test
This test has passed 100.00% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
Showing 4 of 7 test results

@alebedev87 alebedev87 force-pushed the fix-router-metrics-for-dcm branch 5 times, most recently from 9057e75 to c3e2a9f Compare October 16, 2024 12:33
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 16, 2024

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@candita
Copy link
Contributor

candita commented Oct 16, 2024

/assign @gcs278

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: c3e2a9f

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[bz-apiserver-auth] clusteroperator/authentication should not change condition/Available
This test has passed 0.00% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-serial'] in the last 14 days.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 16, 2024

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

This PR follows up on the DCM gaps identified during the smoke test.

When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 16, 2024

@alebedev87: This pull request references NE-1815 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

This PR follows up on a Dynamic Configuration Manager (DCM) gap identified during the smoke test.

When DCM is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, just have some concern that we might be losing test precision, and suggested an idea. Feel free to let me know if you disagree.

if findGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels)[0] >= float64(times) {
return true, nil
if len(findNonZeroGaugesWithLabels(metrics["haproxy_server_up"], serverLabels)) == 2 {
for _, g := range findNonZeroGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to change this to findNonZeroGaugesWithLabels? Aren't these for backends, not servers? I don't think this metric will contain "dynamic servers", since it's for backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findNonZeroGaugesWithLabels is generic enough to filter out any zero value gauges. But I agree, let's keep it for the server metrics only as its description states.

return true, nil
if len(findNonZeroGaugesWithLabels(metrics["haproxy_server_up"], serverLabels)) == 2 {
for _, g := range findNonZeroGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) {
// stop retrying if any route got expected number of connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we looking for a specific route namespace/name label combination? Is it safe to assume that will only return 1 unique route?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and yes. The word "any" is wrong here. The for loop's goal was to cover the case when there is no non-zero gauge so that the index operation which was there before ([0]) would not panic. Let me change this to something like the below so that it doesn't look confusing:

if g := findGaugesWithLabels(metrics["haproxy_backend_connections_total"]; len(g) > 0 {
    if g[0] >= float64(times) {
        return true, nil
    }
}

// When DCM is enabled, not all server slots are always in the UP state,
// unlike the behavior without DCM where all slots are UP by default.
// As a result, we may want to filter out servers where the gauge is set to 0 (== DOWN).
func findNonZeroGaugesWithLabels(f *dto.MetricFamily, promLabels map[string]string) []float64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more precise to filter out dynamic servers based on their name? I know we have functions to include labels, so you would need a function to exclude labels. The names of the servers are _dynamic-pod-# (2 right now). Could be a regex, or a contains type of thing.

Do you think that filtering out zero value gauges here is "weakening" the precision of the test, and opening possibilities for metrics bugs to slip through this E2E test? What happens if haproxy is writing a bunch of bogus metrics for servers, but they have zero value? Would that slip through?

Copy link
Contributor Author

@alebedev87 alebedev87 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more precise to filter out dynamic servers based on their name? I know we have functions to include labels, so you would need a function to exclude labels. The names of the servers are _dynamic-pod-# (2 right now). Could be a regex, or a contains type of thing.

That was my first idea but the thing is that DCM uses all servers as slots - static (added via the config parsing) and dynamic. That is, static servers may also be disabled resulting into a zero metric value.

Do you think that filtering out zero value gauges here is "weakening" the precision of the test, and opening possibilities for metrics bugs to slip through this E2E test?

Yes it is weakening the precision. With the introduction of DCM a zero value of a metric may mean: no value (disabled server) or zero value. The approach is sub optimal but straight forward. So far, most of the tests (maybe even all) expect non zero metric values so filtering zero ones help us to remove disabled servers only.

The optimal approach is to remove metrics for disabled servers dynamically in the router. We will have to go this way in the future. So far, I didn't have to modify the router for the DCM use case and I would like to stay this way. But if the DCM will spawn into the next release, I'll come back to this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upd: added a TODO comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a reasonable future approach. 👍

When the Dynamic Configuration Manager (DCM) is enabled by default,
server management behavior changes. Not all servers will be in the UP state,
and this must be accounted for in the router metrics tests.
This commit updates the tests to handle scenarios where some servers may be DOWN.
@alebedev87 alebedev87 force-pushed the fix-router-metrics-for-dcm branch from c3e2a9f to c90f8c1 Compare October 29, 2024 16:18
@alebedev87
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 30, 2024

@alebedev87: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-ipsec-serial fa18d0b link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-aws-ovn-kube-apiserver-rollout c90f8c1 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-single-node-upgrade c90f8c1 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor

gcs278 commented Oct 31, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6ce9801 and 2 for PR HEAD c90f8c1 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5e81bc2 and 1 for PR HEAD c90f8c1 in total

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: c90f8c1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 76.92% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

@openshift-merge-bot openshift-merge-bot bot merged commit 8102da8 into openshift:master Nov 1, 2024
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.18.0-202411010937.p0.g8102da8.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants