Skip to content

Commit

Permalink
chore(charts/istio-alerts): update runbook for 5xx increase (#324)
Browse files Browse the repository at this point in the history
Due to encoding issues, the direct link to graph is not in the alerts...

So adding the query here for easier lookup
  • Loading branch information
schahal authored Aug 5, 2024
1 parent f2d3973 commit a5f4608
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 3 deletions.
2 changes: 1 addition & 1 deletion charts/istio-alerts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: istio-alerts
description: A Helm chart that provisions a series of alerts for istio VirtualServices
type: application
version: 0.5.2
version: 0.5.3
maintainers:
- name: diranged
email: matt@nextdoor.com
Expand Down
2 changes: 1 addition & 1 deletion charts/istio-alerts/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# istio-alerts

![Version: 0.5.2](https://img.shields.io/badge/Version-0.5.2-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
![Version: 0.5.3](https://img.shields.io/badge/Version-0.5.3-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)

A Helm chart that provisions a series of alerts for istio VirtualServices

Expand Down
24 changes: 23 additions & 1 deletion charts/istio-alerts/runbook.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
## 5xx-Rate-Too-High

This alert fires when the rate of 5xx responses from a service exceeds a
threshold (by default, 0.05%). A 5xx indicates that some sort of server-side
threshold (default, 0.05% for 5m). A 5xx indicates that some sort of server-side
error is occurring, and you should investigate which status codes are being
returned to investigate this alarm. A breakdown of responses by status code
can be found in grafana on the "Istio Service Dashboard". Be sure to navigate
Expand All @@ -10,6 +10,28 @@ service. Many services have custom dashboards in DataDog as well which may help
investigate this alert further, and most service also produce logs of requests
which may provide more context into what errors are being returned and why.

Can check trends/graph by:

1. Going to your Grafana instance and navigating to the `Explore` tab
2. Entering the following Prometheus query (replace `cluster` and `destination_service_namespace`):

```
sum by (destination_service_name, reporter) (
rate(istio_requests_total{cluster="<x>", response_code=~"5.*", destination_service_namespace="<y>"}[5m])
)
/
sum by (destination_service_name, reporter) (
rate(istio_requests_total{cluster="<x>", destination_service_namespace="<y>"}[5m])
)
```

Action Items:

1. If trends are expected, tweak your thresholds (away from the [default 0.05% for 5 minutes](https://github.com/Nextdoor/k8s-charts/blob/f2d3973a1a9292e7c59e3feb4eb49df93dea926d/charts/istio-alerts/values.yaml#L28-L41)).
2. If the response codes are unexpected, debug your app to see why the increase in error responses.

## HighRequestLatency

TBD
Expand Down

0 comments on commit a5f4608

Please sign in to comment.