-
Notifications
You must be signed in to change notification settings - Fork 4.8k
NE-1815: Adapt router metrics tests for DCM support #29180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NE-1815: Adapt router metrics tests for DCM support #29180
Conversation
|
@alebedev87: This pull request references NE-1815 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e278c37 to
fa18d0b
Compare
|
Job Failure Risk Analysis for sha: fa18d0b
|
fa18d0b to
7bb50cd
Compare
|
Job Failure Risk Analysis for sha: 7bb50cd
|
9057e75 to
c3e2a9f
Compare
|
@alebedev87: This pull request references NE-1815 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/assign @gcs278 |
|
Job Failure Risk Analysis for sha: c3e2a9f
|
|
@alebedev87: This pull request references NE-1815 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@alebedev87: This pull request references NE-1815 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
gcs278
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR, just have some concern that we might be losing test precision, and suggested an idea. Feel free to let me know if you disagree.
test/extended/router/metrics.go
Outdated
| if findGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels)[0] >= float64(times) { | ||
| return true, nil | ||
| if len(findNonZeroGaugesWithLabels(metrics["haproxy_server_up"], serverLabels)) == 2 { | ||
| for _, g := range findNonZeroGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to change this to findNonZeroGaugesWithLabels? Aren't these for backends, not servers? I don't think this metric will contain "dynamic servers", since it's for backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findNonZeroGaugesWithLabels is generic enough to filter out any zero value gauges. But I agree, let's keep it for the server metrics only as its description states.
test/extended/router/metrics.go
Outdated
| return true, nil | ||
| if len(findNonZeroGaugesWithLabels(metrics["haproxy_server_up"], serverLabels)) == 2 { | ||
| for _, g := range findNonZeroGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels) { | ||
| // stop retrying if any route got expected number of connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we looking for a specific route namespace/name label combination? Is it safe to assume that will only return 1 unique route?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes. The word "any" is wrong here. The for loop's goal was to cover the case when there is no non-zero gauge so that the index operation which was there before ([0]) would not panic. Let me change this to something like the below so that it doesn't look confusing:
if g := findGaugesWithLabels(metrics["haproxy_backend_connections_total"]; len(g) > 0 {
if g[0] >= float64(times) {
return true, nil
}
}| // When DCM is enabled, not all server slots are always in the UP state, | ||
| // unlike the behavior without DCM where all slots are UP by default. | ||
| // As a result, we may want to filter out servers where the gauge is set to 0 (== DOWN). | ||
| func findNonZeroGaugesWithLabels(f *dto.MetricFamily, promLabels map[string]string) []float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more precise to filter out dynamic servers based on their name? I know we have functions to include labels, so you would need a function to exclude labels. The names of the servers are _dynamic-pod-# (2 right now). Could be a regex, or a contains type of thing.
Do you think that filtering out zero value gauges here is "weakening" the precision of the test, and opening possibilities for metrics bugs to slip through this E2E test? What happens if haproxy is writing a bunch of bogus metrics for servers, but they have zero value? Would that slip through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more precise to filter out dynamic servers based on their name? I know we have functions to include labels, so you would need a function to exclude labels. The names of the servers are _dynamic-pod-# (2 right now). Could be a regex, or a contains type of thing.
That was my first idea but the thing is that DCM uses all servers as slots - static (added via the config parsing) and dynamic. That is, static servers may also be disabled resulting into a zero metric value.
Do you think that filtering out zero value gauges here is "weakening" the precision of the test, and opening possibilities for metrics bugs to slip through this E2E test?
Yes it is weakening the precision. With the introduction of DCM a zero value of a metric may mean: no value (disabled server) or zero value. The approach is sub optimal but straight forward. So far, most of the tests (maybe even all) expect non zero metric values so filtering zero ones help us to remove disabled servers only.
The optimal approach is to remove metrics for disabled servers dynamically in the router. We will have to go this way in the future. So far, I didn't have to modify the router for the DCM use case and I would like to stay this way. But if the DCM will spawn into the next release, I'll come back to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upd: added a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a reasonable future approach. 👍
When the Dynamic Configuration Manager (DCM) is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This commit updates the tests to handle scenarios where some servers may be DOWN.
c3e2a9f to
c90f8c1
Compare
|
/retest |
|
@alebedev87: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alebedev87, gcs278 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Job Failure Risk Analysis for sha: c90f8c1
|
|
[ART PR BUILD NOTIFIER] Distgit: openshift-enterprise-tests |
This PR follows up on a Dynamic Configuration Manager (DCM) gap identified during the smoke test.
When DCM is enabled by default, server management behavior changes. Not all servers will be in the UP state, and this must be accounted for in the router metrics tests. This PR updates the tests to handle scenarios where some servers may be DOWN. The scenarios when DCM is not enabled should still be working.