OCPBUGS-64729: Update etcd alerts to match observed real world data#1511
OCPBUGS-64729: Update etcd alerts to match observed real world data#1511dgoodwin wants to merge 3 commits intoopenshift:mainfrom
Conversation
Walkthroughetcd alert rules were changed: commit-duration alert expr threshold lowered (0.5 → 0.08) and a new critical commit alert (>0.10) added; fsync alerts were replaced by new warning and critical rules (0.05 and 0.07) and old fsync alerts removed; jsonnet config and dependency lock updated; main.jsonnet excludes an alert name. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| > 0.5 | ||
| for: 10m | ||
| labels: | ||
| severity: warning |
There was a problem hiding this comment.
IIUC we're removing the warning severity for etcdHighFsyncDurations. Do we have another rule which can notify platform admins before the critical alert first?
There was a problem hiding this comment.
It's a little tough for this, we don't actually know when a cluster falls over. Just that upstream recommendations are optimistic and we have thousands of clusters running much higher. I'm estimating what level of chaos we're willing to cause to lower these down to sensible levels again with the 5% alerting rate, I could trim some off the recommendations here and call that a warning threshold, but then we might be over our 5% fleet rate.
There was a problem hiding this comment.
While we don't provide stability guarantees for alerting rules, I presume that some cluster admins will be puzzled by the removal of the warning severity. As stated in https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#warning-alerts warning alerts don't require immediate action but they help identifying potential issues. We could use a higher for value to avoid the alerting rule triggering too often.
cc @typeid
There was a problem hiding this comment.
Ok how about I use these limits currently in the pr for warning, and add a critical level a little higher
|
Updated with an attempt at warning plus critical levels. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
jsonnet/custom.libsonnet (1)
63-85: Add the runbook URL in the Jsonnet source as wellThe generated manifest now exposes
runbook_urlfor both warning/criticaletcdHighCommitDurations, but the Jsonnet definition still omits it. Any other consumers rendering fromcustom.libsonnetwill miss the runbook link, leading to divergence between bundles. Please add the samerunbook_urlentry to both alert blocks so downstream renders stay in sync.annotations: { description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.', summary: 'etcd cluster 99th percentile commit durations are too high.', + runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md' }, }, @@ annotations: { description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.', summary: 'etcd cluster 99th percentile commit durations are too high.', + runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md' },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
jsonnet/custom.libsonnet(2 hunks)jsonnet/jsonnetfile.lock.json(1 hunks)manifests/0000_90_etcd-operator_03_prometheusrule.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jsonnet/jsonnetfile.lock.json
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
jsonnet/custom.libsonnetmanifests/0000_90_etcd-operator_03_prometheusrule.yaml
|
@hasbro17 / @dgoodwin / @simonpasquier I've created an upstream PR for this and all the other stuff we have accumulated over the years in: |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, hasbro17 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity. |
|
@openshift-bot: This pull request references Jira Issue OCPBUGS-64729, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/label acknowledge-critical-fixes-only |
|
Actually saw a relevant failure on single node https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1511/pull-ci-openshift-cluster-etcd-operator-main-e2e-aws-ovn-single-node/1996107709992144896 /retest |
|
/jira refresh /retest |
|
@oarribas: This pull request references Jira Issue OCPBUGS-64729, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@oarribas: This pull request references Jira Issue OCPBUGS-64729, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Failed again the SNO test. Not sure if fully related to this change. /retest |
|
It is, the new thresholds are too sensitive for single node CI jobs, unclear if they would fire all the time in SNO clusters as well. I think this needs a change in origin to be more lenient with this alert, specifically for SNO, before this can merge. I just have not had time to get to it. If the etcd team could take over, that would be ideal. |
|
@dgoodwin: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Build on #1495, justification in this comment: https://issues.redhat.com/browse/OCPBUGS-64729?focusedId=28407511&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28407511