OCPBUGS-66152: split grpc requests slow into read/write alerts#1532
OCPBUGS-66152: split grpc requests slow into read/write alerts#1532openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
Conversation
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
|
@tjungblu: This pull request references Jira Issue OCPBUGS-66152, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThe pull request refactors etcd monitoring by splitting a single complex gRPC request latency alert into two separate alerts targeting read and write operations with distinct thresholds and evaluation windows, and updates dependency versions in the lockfile. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@jsonnet/custom.libsonnet`:
- Around line 56-64: The etcdGRPCWriteRequestsSlow alert currently only filters
grpc_method="Txn" so Put and DeleteRange write operations are missed; update the
alert expression (the expr for alert: etcdGRPCWriteRequestsSlow) to include all
write methods by changing the label filter to match multiple methods (e.g.,
grpc_method=~"Txn|Put|DeleteRange") or an equivalent set-based match, and update
the annotations.description text to reflect that Put and DeleteRange are
included so the summary remains accurate.
In `@manifests/0000_90_etcd-operator_03_prometheusrule.yaml`:
- Around line 121-127: The alert etcdGRPCWriteRequestsSlow only filters
grpc_method="Txn" so slow Put/DeleteRange write paths are missed; update the
expr in the PrometheusRule for alert etcdGRPCWriteRequestsSlow to include all
write methods (Txn, Put, DeleteRange) — e.g. change the grpc_method filter to
match those three methods (or use a regex like =~ "Txn|Put|DeleteRange") in the
grpc_server_handling_seconds_bucket selector so the histogram_quantile evaluates
write latency across all write operations.
| alert: 'etcdGRPCWriteRequestsSlow', | ||
| expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5', | ||
| 'for': '10m', | ||
| labels: { | ||
| severity: 'critical', | ||
| }, | ||
| annotations: { | ||
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.', | ||
| summary: 'etcd grpc requests are slow', | ||
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.', | ||
| summary: 'etcd grpc write requests are slow', |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the file to understand context
find . -name "custom.libsonnet" -type fRepository: openshift/cluster-etcd-operator
Length of output: 101
🏁 Script executed:
# Check file size and read the relevant section with context
fd -e libsonnet | head -20Repository: openshift/cluster-etcd-operator
Length of output: 1790
🏁 Script executed:
# Find the custom.libsonnet file
git ls-files | grep -i custom.libsonnetRepository: openshift/cluster-etcd-operator
Length of output: 99
🏁 Script executed:
# Read the custom.libsonnet file to understand context
wc -l ./jsonnet/custom.libsonnetRepository: openshift/cluster-etcd-operator
Length of output: 105
🏁 Script executed:
# Read the file with context around lines 56-64
cat -n ./jsonnet/custom.libsonnet | sed -n '40,75p'Repository: openshift/cluster-etcd-operator
Length of output: 2098
🏁 Script executed:
# Check if there are other alerts for Put or DeleteRange methods
grep -n "Put\|DeleteRange" ./jsonnet/custom.libsonnetRepository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
# Check the upstream etcd mixin alerts to compare
cat -n ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -A 5 "etcdGRPCWriteRequestsSlow" | head -20Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
# Search for write-related alerts in the upstream etcd mixin
cat ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -i "write" -A 3 -B 1Repository: openshift/cluster-etcd-operator
Length of output: 865
🏁 Script executed:
# Get a broader view of what's in the upstream etcd mixin alerts
wc -l ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnetRepository: openshift/cluster-etcd-operator
Length of output: 157
🏁 Script executed:
# Read the full upstream etcd alerts file to find write-related alerts
cat -n ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnet | grep -i "grpc.*write\|txn\|put\|deleterange" -B 2 -A 5Repository: openshift/cluster-etcd-operator
Length of output: 57
🏁 Script executed:
# Get the entire alerts file to understand the pattern
cat ./jsonnet/vendor/github.com/etcd-io/etcd/contrib/mixin/alerts/alerts.libsonnetRepository: openshift/cluster-etcd-operator
Length of output: 11491
🏁 Script executed:
# Read the entire custom.libsonnet to check if Put/DeleteRange are monitored elsewhere
cat -n ./jsonnet/custom.libsonnetRepository: openshift/cluster-etcd-operator
Length of output: 10807
🌐 Web query:
etcd gRPC methods Put DeleteRange Txn write operations
💡 Result:
Summary — etcd KV gRPC write operations
-
Put (KV.Put): writes or updates a single key. Increments store revision, generates a PUT event; can return previous KV if PutRequest.prev_kv is set. Request/response types: PutRequest → PutResponse. [1][2]
-
DeleteRange (KV.DeleteRange): deletes a single key or a range (key + range_end). Increments revision, generates DELETE events for each deleted key; can return deleted key-values if prev_kv is set. Response includes deleted count. Request/response types: DeleteRangeRequest → DeleteRangeResponse. [1][2]
-
Txn (KV.Txn): atomic If/Then/Else transaction. TxnRequest contains compares (conditions) and two lists of RequestOp (success / failure). RequestOp is a oneof: RangeRequest, PutRequest, DeleteRangeRequest (and nested TxnRequest). TxnResponse returns succeeded bool and a list of ResponseOp (matching RangeResponse / PutResponse / DeleteRangeResponse). A Txn increments the revision once for the applied requests and generates events with the same revision. It is forbidden to modify the same key more than once within a single Txn (Puts/Deletes must use unique keys). [1][2]
References:
- etcd v3 gRPC KV API (Put / DeleteRange / Txn) — etcd docs. [1]
- etcd v3 transaction and RequestOp/ResponseOp protobuf details — etcd docs. [2]
Sources:
[1] etcd API reference — service KV (Put, DeleteRange, Txn).
[2] etcd v3 learning / API (Txn, RequestOp, Request/Response message shapes).
Expand the write alert to include Put and DeleteRange methods.
The alert is named etcdGRPCWriteRequestsSlow and its description claims to monitor "gRPC write requests," but currently only filters for Txn. Put and DeleteRange are direct write operations in etcd that are not captured. If your workload uses direct Put or DeleteRange calls (common in many deployments), slow writes via these methods would go undetected.
Suggested fix
- expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5',
+ expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method=~"Txn|Put|DeleteRange", grpc_type="unary"}[10m])) without(grpc_type)) > 5',📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| alert: 'etcdGRPCWriteRequestsSlow', | |
| expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method="Txn", grpc_type="unary"}[10m])) without(grpc_type)) > 5', | |
| 'for': '10m', | |
| labels: { | |
| severity: 'critical', | |
| }, | |
| annotations: { | |
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC requests is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method }} method.', | |
| summary: 'etcd grpc requests are slow', | |
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.', | |
| summary: 'etcd grpc write requests are slow', | |
| alert: 'etcdGRPCWriteRequestsSlow', | |
| expr: 'histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd", grpc_method=~"Txn|Put|DeleteRange", grpc_type="unary"}[10m])) without(grpc_type)) > 5', | |
| 'for': '10m', | |
| labels: { | |
| severity: 'critical', | |
| }, | |
| annotations: { | |
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC write requests is {{ $value }}s on etcd instance {{ $labels.instance }}.', | |
| summary: 'etcd grpc write requests are slow', |
🤖 Prompt for AI Agents
In `@jsonnet/custom.libsonnet` around lines 56 - 64, The etcdGRPCWriteRequestsSlow
alert currently only filters grpc_method="Txn" so Put and DeleteRange write
operations are missed; update the alert expression (the expr for alert:
etcdGRPCWriteRequestsSlow) to include all write methods by changing the label
filter to match multiple methods (e.g., grpc_method=~"Txn|Put|DeleteRange") or
an equivalent set-based match, and update the annotations.description text to
reflect that Put and DeleteRange are included so the summary remains accurate.
There was a problem hiding this comment.
You'd be surprised, but deletes are also implemented as Txn calls:
https://github.com/etcd-io/etcd/blob/main/client/v3/kubernetes/client.go
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Fetch the client code to verify how DeleteRange is implemented
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | head -200
Length of output: 3853
🏁 Script executed:
#!/bin/bash
# Search for Delete and Put method implementations in the kubernetes client
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | grep -A 20 "func.*Delete"
Length of output: 611
🏁 Script executed:
#!/bin/bash
# Search for Put method implementations
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | grep -A 20 "func.*Put"
Length of output: 711
🏁 Script executed:
#!/bin/bash
# Get more context on how these methods work - look for Txn usage
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kubernetes/client.go" | grep -B 5 -A 15 "Txn()"
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check if the base etcd client v3 has direct Put/Delete methods
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kv.go" | grep -A 15 "func.*Put\|func.*Delete"
Length of output: 1654
🏁 Script executed:
#!/bin/bash
# Look at the KV interface to see available methods
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kv.go" | head -100
Length of output: 3404
🏁 Script executed:
#!/bin/bash
# Check the gRPC service definition to see what methods are exposed
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/api/etcdserverpb/rpc.proto" | grep -A 5 "service KV"
Length of output: 210
🏁 Script executed:
#!/bin/bash
# Get the full KV service definition from proto
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/api/etcdserverpb/rpc.proto" | grep -A 50 "service KV"
Length of output: 1689
🏁 Script executed:
#!/bin/bash
# Also check how the Do method routes operations
curl -s "https://raw.githubusercontent.com/etcd-io/etcd/main/client/v3/kv.go" | grep -A 30 "func.*Do\("
Length of output: 23
🏁 Script executed:
#!/bin/bash
# Search for etcd client usage in the OpenShift cluster-etcd-operator
rg -n "clientv3\." --type=go -C 2 | head -50
Length of output: 3779
🏁 Script executed:
#!/bin/bash
# Check if the kubernetes client wrapper is imported in this repo
rg -n "kubernetes.*client" --type=go -g "!vendor/**" | head -30
Length of output: 525
🏁 Script executed:
#!/bin/bash
# Look for direct Put/Delete calls vs Txn usage
rg -n "\.Put\(|\.Delete\(" --type=go -g "!vendor/**" -C 1 | head -40
Length of output: 2915
🏁 Script executed:
#!/bin/bash
# Search for documentation or comments about what client Kubernetes uses
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | head -100
Length of output: 3222
🏁 Script executed:
#!/bin/bash
# Check if kubernetes uses the kubernetes client wrapper
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -i "import" -A 30 | head -50
Length of output: 796
🏁 Script executed:
#!/bin/bash
# Look for Put/Delete implementations in kube-apiserver storage
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -A 20 "func.*Create\|func.*Delete\|func.*Update"
Length of output: 3966
🏁 Script executed:
#!/bin/bash
# Check how kube-apiserver uses the kubernetes client for writes
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -A 30 "s.client.Kubernetes"
Length of output: 7601
🏁 Script executed:
#!/bin/bash
# Look for OptimisticPut usage in kube-apiserver
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep -i "OptimisticPut\|OptimisticDelete" -B 2 -A 10
Length of output: 1384
🏁 Script executed:
#!/bin/bash
# Search for the actual etcd operation calls
curl -s "https://raw.githubusercontent.com/kubernetes/kubernetes/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go" | grep "\.client\." -A 2 | head -60
Length of output: 1537
@tjungblu You're absolutely right, thank you for the clarification! I verified that kube-apiserver uses the kubernetes.Client wrapper exclusively, and all write operations (Create/Update/Delete) go through OptimisticPut and OptimisticDelete, which indeed wrap the operations in Txn calls as you showed.
The current alert filtering for grpc_method="Txn" is correct and sufficient for monitoring write latency in the Kubernetes/OpenShift context. Adding Put and DeleteRange would not capture any additional slow writes since kube-apiserver doesn't use those methods directly.
My suggestion was incorrect. The alert as implemented properly captures all write operations that matter for the product.
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-azure-ovn-upgrade 10 |
|
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ef7faad0-f857-11f0-800f-2fa28c59142c-0 |
|
/retest-required |
|
/retest |
|
/jira refresh |
|
@tjungblu: This pull request references Jira Issue OCPBUGS-66152, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-azure-ovn-upgrade 10 |
|
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c31a59e0-fb68-11f0-98de-b13fd625f17e-0 |
|
/retest-required /cherry-pick release-4.21 |
|
@tjungblu: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/unit unrelated tnf job failure |
|
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/unit DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| > on () group_left (type) | ||
| bottomk(1, | ||
| 1.5 * group by (type) (cluster_infrastructure_provider{type="Azure"}) | ||
| or | ||
| 1 * group by (type) (cluster_infrastructure_provider)) | ||
| |||, |
There was a problem hiding this comment.
So we're getting rid of the inflated threshold for Azure but bumping it overall right?
There was a problem hiding this comment.
yep, but specific for the read and write paths we care about
There was a problem hiding this comment.
I'm more happy about getting rid of the Azure special case. Hopefully we don't have to add it back eventually.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by @tjungblu |
|
@tjungblu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@tjungblu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@tjungblu: Jira Issue Verification Checks: Jira Issue OCPBUGS-66152 Jira Issue OCPBUGS-66152 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tjungblu: new pull request created: #1537 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-01-31-004202 |
This avoids alerting on endpoints that have no meaningful impact on the product (e.g. etcd leases) and allows us to tune the thresholds for read and write differently.