Skip to content

Commit

Permalink
Add remote alertmanager for operators
Browse files Browse the repository at this point in the history
  • Loading branch information
wyb1 committed Nov 14, 2019
1 parent 76eb57d commit 467de5c
Show file tree
Hide file tree
Showing 26 changed files with 451 additions and 88 deletions.
12 changes: 6 additions & 6 deletions .ci/generate_monitoring_docs
Original file line number Diff line number Diff line change
Expand Up @@ -38,25 +38,25 @@ for t in $tools; do
done

pushd $SOURCE_PATH/charts/seed-monitoring/charts/core/charts/prometheus > /dev/null
cat <<EOF > $SOURCE_PATH/docs/development/user_alerts.md
cat <<EOF > $SOURCE_PATH/docs/monitoring/user_alerts.md
# User Alerts
|Alertname|Severity|Type|Description|
|---|---|---|---|
EOF
cat <<EOF > $SOURCE_PATH/docs/development/operator_alerts.md
cat <<EOF > $SOURCE_PATH/docs/monitoring/operator_alerts.md
# Operator Alerts
|Alertname|Severity|Type|Description|
|---|---|---|---|
EOF
for file in rules/*.yaml; do
cat $file | yaml2json | jq -r '.groups | .[].rules | map(select(.labels.visibility == "owner" or .labels.visibility == "all")) | map(select(has("alert"))) | .[] | "|" + .alert + "|" + .labels.severity + "|" + .labels.type + "|" + "`" + .annotations.description + "`" + "|"' >> $SOURCE_PATH/docs/development/user_alerts.md
cat $file | yaml2json | jq -r '.groups | .[].rules | map(select(.labels.visibility == "operator" or .labels.visibility == "all")) | map(select(has("alert"))) | .[] | "|" + .alert + "|" + .labels.severity + "|" + .labels.type + "|" + "`" + .annotations.description + "`" + "|"' >> $SOURCE_PATH/docs/development/operator_alerts.md
cat $file | yaml2json | jq -r '.groups | .[].rules | map(select(.labels.visibility == "owner" or .labels.visibility == "all")) | map(select(has("alert"))) | .[] | "|" + .alert + "|" + .labels.severity + "|" + .labels.type + "|" + "`" + .annotations.description + "`" + "|"' >> $SOURCE_PATH/docs/monitoring/user_alerts.md
cat $file | yaml2json | jq -r '.groups | .[].rules | map(select(.labels.visibility == "operator" or .labels.visibility == "all")) | map(select(has("alert"))) | .[] | "|" + .alert + "|" + .labels.severity + "|" + .labels.type + "|" + "`" + .annotations.description + "`" + "|"' >> $SOURCE_PATH/docs/monitoring/operator_alerts.md
done
popd > /dev/null

if [ -n "$(git status --porcelain)" ]; then
git add $SOURCE_PATH/docs/development/user_alerts.md
git add $SOURCE_PATH/docs/development/operator_alerts.md
git add $SOURCE_PATH/docs/monitoring/user_alerts.md
git add $SOURCE_PATH/docs/monitoring/operator_alerts.md
git commit -m "Update alert documentation"
else
echo "no changes";
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{{- include "gardener.secret-alerting" . }}
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ spec:
checksum/secret-gardener-controller-manager-kubeconfig: {{ include (print $.Template.BasePath "/controller-manager/secret-kubeconfig.yaml") . | sha256sum }}
checksum/secret-default-domain: {{ include "gardener.secret-default-domain" . | sha256sum }}
checksum/secret-internal-domain: {{ include "gardener.secret-internal-domain" . | sha256sum }}
checksum/secret-alerting-smtp: {{ include "gardener.secret-alerting-smtp" . | sha256sum }}
checksum/secret-alerting: {{ include "gardener.secret-alerting" . | sha256sum }}
checksum/secret-openvpn-diffie-hellman: {{ include "gardener.secret-openvpn-diffie-hellman" . | sha256sum }}
labels:
app: gardener
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{{- define "gardener.secret-alerting" -}}
{{- if .Values.global.controller.enabled }}
{{- range $key, $config := .Values.global.controller.alerting }}
---
apiVersion: v1
kind: Secret
metadata:
name: alerting-{{ $key }}
namespace: garden
labels:
app: gardener
chart: "{{ $.Chart.Name }}-{{ $.Chart.Version }}"
release: "{{ $.Release.Name }}"
heritage: "{{ $.Release.Service }}"
gardener.cloud/role: alerting
type: Opaque
data:
auth_type: {{ ( required ".controller.alerting[].auth_type is required" $config.auth_type ) | b64enc }}
{{- if eq $config.auth_type "smtp" }}
to: {{ ( required ".controller.alerting[].to is required" $config.to ) | b64enc }}
from: {{ ( required ".controller.alerting[].from is required" $config.from ) | b64enc }}
smarthost: {{ ( required ".controller.alerting[].smarthost is required" $config.smarthost ) | b64enc }}
auth_username: {{ ( required ".controller.alerting[].auth_username is required" $config.auth_username ) | b64enc }}
auth_identity: {{ ( required ".controller.alerting[].auth_identity is required" $config.auth_identity ) | b64enc }}
auth_password: {{ ( required ".controller.alerting[].auth_password is required" $config.auth_password ) | b64enc }}
{{- end }}
{{- if eq $config.auth_type "none" }}
url: {{ ( required ".controller.alerting[].url is required" $config.url ) | b64enc }}
{{- end }}
{{- if eq $config.auth_type "basic" }}
url: {{ ( required ".controller.alerting[].url is required" $config.url ) | b64enc }}
username: {{ ( required ".controller.alerting[].username is required" $config.username ) | b64enc }}
password: {{ ( required ".controller.alerting[].password is required" $config.password ) | b64enc }}
{{- end }}
{{- if eq $config.auth_type "certificate" }}
url: {{ ( required ".controller.alerting[].url is required" $config.url ) | b64enc }}
ca.crt: {{ ( required ".controller.alerting[].ca_crt is required" $config.ca_crt ) | b64enc }}
tls.crt: {{ ( required ".controller.alerting[].tls_crt is required" $config.tls_cert ) | b64enc }}
tls.key: {{ ( required ".controller.alerting[].tls_key is required" $config.tls_key ) | b64enc }}
{{- end }}
{{- end }}
{{- end }}
{{- end -}}
5 changes: 3 additions & 2 deletions charts/gardener/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,9 @@ global:
# provider: aws-route53 # depends on the DNS extension of your choice
# credentials: {}
# # actual keys here depend on the DNS extension of your choice
alertingSMTP: []
# - to: email-address-to-send-alerts-to
alerting: []
# - auth_type: smtp
# to: email-address-to-send-alerts-to
# from: email-address-to-send-alerts-from
# smarthost: smtp-host-used-for-sending
# auth_username: smtp-authentication-username
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
{{ if .Values.alertmanager.enabled }}
apiVersion: v1
kind: Service
metadata:
Expand Down Expand Up @@ -141,3 +142,4 @@ spec:
resources:
requests:
storage: {{ .Values.alertmanager.storage }}
{{- end }}
2 changes: 2 additions & 0 deletions charts/seed-bootstrap/templates/alertmanager/config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
{{ if .Values.alertmanager.enabled }}
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: {{ .Release.Namespace }}
data:
alertmanager.yaml: {{ include "config" .Values.alertmanager | b64enc }}
{{- end }}
1 change: 1 addition & 0 deletions charts/seed-bootstrap/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ fluentd-es:

alertmanager:
emailConfigs: []
enabled: true
storage: 1Gi

hvpa:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,29 @@ data:
- /etc/prometheus/rules/*.yaml
alerting:
alertmanagers:
{{- if hasKey .Values.alerting.auth_type "none" }}
- static_configs:
- targets:
- {{ .Values.alerting.auth_type.none.url }}
{{- end }}
{{- if hasKey .Values.alerting.auth_type "basic" }}
- static_configs:
- targets:
- {{ .Values.alerting.auth_type.basic.url }}
basic_auth:
username: {{ .Values.alerting.auth_type.basic.username }}
password: {{ .Values.alerting.auth_type.basic.password }}
{{- end }}
{{- if hasKey .Values.alerting.auth_type "certificate" }}
- static_configs:
- targets:
- {{ .Values.alerting.auth_type.certificate.url }}
tls_config:
ca_file: /etc/prometheus/operator/ca.crt
cert_file: /etc/prometheus/operator/tls.crt
key_file: /etc/prometheus/operator/tls.key
insecure_skip_verify: {{ .Values.alerting.auth_type.certificate.insecure_skip_verify }}
{{- end }}
- kubernetes_sd_configs:
- role: endpoints
namespaces:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,10 @@ spec:
# we mount the Shoot cluster's CA and certs
- mountPath: /etc/prometheus/seed
name: prometheus-kubeconfig
{{- if hasKey .Values.alerting.auth_type "certificate" }}
- mountPath: /etc/prometheus/operator
name: prometheus-remote-am-tls
{{- end }}
- image: {{ index .Values.images "vpn-seed" }}
imagePullPolicy: IfNotPresent
name: vpn-seed
Expand Down Expand Up @@ -259,6 +263,11 @@ spec:
- name: blackbox-exporter-config-prometheus
configMap:
name: blackbox-exporter-config-prometheus
{{- if hasKey .Values.alerting.auth_type "certificate" }}
- name: prometheus-remote-am-tls
secret:
secretName: prometheus-remote-am-tls
{{- end }}
volumeClaimTemplates:
- metadata:
name: prometheus-db
Expand Down
15 changes: 15 additions & 0 deletions charts/seed-monitoring/charts/core/charts/prometheus/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,21 @@ rules:
enabled: false
rules: false

alerting:
auth_type: {}
# none:
# url: foo.bar
# basic:
# url: foo.bar
# username: admin
# password: password
# certificate:
# url: foo.bar
# ca.crt: ca
# tls.crt: certificate
# tls.key: key
# insecure_skip_verify: false

ignoreAlerts: false

# object can be any object you want to scale Prometheus on:
Expand Down
4 changes: 4 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,7 @@
* [Deploying the Gardener into a Kubernetes cluster](deployment/kubernetes.md)
* [Deploying the Gardener and a Seed into an AKS cluster](deployment/aks.md)
* [Overwrite image vector](deployment/image_vector.md)

## Monitoring

* [Alerting](monitoring/alerting.md)
137 changes: 137 additions & 0 deletions docs/monitoring/alerting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Alerting

Gardener uses [Prometheus](https://prometheus.io/) to gather metrics from each component. A Prometheus is deployed in each shoot control plane (on the seed) which is responsible for gathering control plane and cluster metrics. Prometheus can be configured to fire alerts based on these metrics and send them to an [alertmanager](https://prometheus.io/docs/alerting/alertmanager/). The alertmanager is responsible for sending the alerts to users and operators. This document describes how to setup alerting for:

- [end-users/stakeholders/customers](#Alerting-for-Users)
- [operators/administrators](#Alerting-for-Operators)

# Alerting for Users

To receive email alerts as a user set the following values in the shoot spec:

```yaml
spec:
monitoring:
alerting:
emailReceivers:
- john.doe@example.com
```
`emailReceivers` is a list of emails that will receive alerts if something is wrong with the shoot cluster. A list of alerts for users can be found [here](user_alerts.md).

# Alerting for Operators

Currently, Gardener supports two options for alerting:

- [Email Alerting](#Email-Alerting)
- [Sending Alerts to an external alertmanager](#External-Alertmanager)

A list of operator alerts can be found [here](operator_alerts.md).

## Email Alerting

Gardener provides the option to deploy an alertmanager into each seed. This alertmanager is responsible for sending out alerts to operators for each shoot cluster in the seed. Only email alerts are supported by the alertmanager managed by Gardener. This is configurable by setting the Gardener controller manager configuration values `alerting`. See [this](../usage/configuration.md) on how to configure the Gardener's SMTP secret. If the values are set, a secret with the label `gardener.cloud/role: alerting` will be created in the garden namespace of the garden cluster. This secret will be used by each alertmanager in each seed.

## External Alertmanager

The alertmanager supports different kinds of [alerting configurations](https://prometheus.io/docs/alerting/configuration/). The alertmanager provided by Gardener only supports email alerts. If email is not sufficient, then alerts can be sent to an external alertmanager. Prometheus will send alerts to a URL and then alerts will be handled by the external alertmanager. This external alertmanager is operated and configured by the operator (i.e. Gardener does not configure or deploy this alertmanager). To configure sending alerts to an external alertmanager, create a secret in the virtual garden cluster in the garden namespace with the label: `gardener.cloud/role: alerting`. This secret needs to contain a URL to the the external alertmanager and information regarding authentication. Supported authentication types are:

- No Authentication (none)
- Basic Authentication (basic)
- Mutual TLS (certificate)

### Remote Alertmanager Examples

Note: the `url` value cannot be prepended with `http` or `https`.

```yaml
# No Authentication
apiVersion: v1
kind: Secret
metadata:
labels:
gardener.cloud/role: alerting
name: alerting-auth
namespace: garden
data:
# No Authentication
auth_type: base64(none)
url: base64(external.alertmanager.foo)
# Basic Auth
auth_type: base64(basic)
url: base64(extenal.alertmanager.foo)
username: base64(admin)
password: base64(password)
# Mutual TLS
auth_type: base64(certificate)
url: base64(external.alertmanager.foo)
ca.crt: base64(ca)
tls.crt: base64(certificate)
tls.key: base64(key)
# Email Alerts (internal alertmanager)
auth_type: base64(smtp)
auth_identity: base64(internal.alertmanager.auth_identity)
auth_password: base64(internal.alertmanager.auth_password)
auth_username: base64(internal.alertmanager.auth_username)
from: base64(internal.alertmanager.from)
smarthost: base64(internal.alertmanager.smarthost)
to: base64(internal.alertmanager.to)
type: Opaque
```

### Configuring your External Alertmanager

Please refer to the [alertmanager](https://prometheus.io/docs/alerting/alertmanager/) documentation on how to configure an alertmanager.

We recommend you use at least the following inhibition rules in your alertmanager configuration to prevent excessive alerts:
```yaml
inhibit_rules:
# Apply inhibition if the alert name is the same.
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'service', 'cluster']
# Stop all alerts for type=shoot if there are VPN problems.
- source_match:
service: vpn
target_match_re:
type: shoot
equal: ['type', 'cluster']
# Stop warning and critical alerts if there is a blocker - no workers nodes, no etcd main etc.
- source_match:
severity: blocker
target_match_re:
severity: ^(critical|warning)$
equal: ['cluster']
# If the API server is down inhibit no worker nodes alert. No worker nodes depends on kube-state-metrics which depends on the API server.
- source_match:
service: kube-apiserver
target_match_re:
service: nodes
equal: ['cluster']
# If API server is down inhibit kube-state-metrics alerts.
- source_match:
service: kube-apiserver
target_match_re:
severity: info
equal: ['cluster']
# No Worker nodes depends on kube-state-metrics. Inhibit no worker nodes if kube-state-metrics is down.
- source_match:
service: kube-state-metrics-shoot
target_match_re:
service: nodes
equal: ['cluster']
```
Below is a graph visualizing the inhibition rules:

![inhibitionGraph](../development/content/alertInhibitionGraph.png)


File renamed without changes.
File renamed without changes.
13 changes: 9 additions & 4 deletions docs/usage/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,15 @@ When the `gardener-controller-manager` starts it scans the `garden` namespace of
* Not every end-user/stakeholder/customer has its own domain, however, Gardener needs to create a DNS record for every shoot cluster.
* As landscape operator you might want to define a default domain owned and controlled by you that is used for all shoot clusters that don't specify their own domain.

* **Alerting SMTP secrets** (optional), contain the SMTP credentials which will be used by the [AlertmMnager](https://prometheus.io/docs/alerting/alertmanager/) to send emails for alerts, please see [this](../../example/10-secret-alerting-smtp.yaml) for an example.
* These secrets are used by the AlertManager which is deployed next to the Kubernetes control plane of a shoot cluster in seed clusters.
* In case there have been alerting SMTP secrets configured, the Gardener will inject the credentials in the configuration of the AlertManager.
* It will use them to send mails to the stated email address in case anything is wrong with the Shoot clusters.
* **Alerting secrets** (optional), contain the alerting configuration and credentials for the [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) to send email alerts. It is also possible to configure the monitoring stack to send alerts to an alertmanager not deployed by Gardener to handle alerting. Please see [this](../../example/10-secret-alerting.yaml) for an example.
* If email alerting is configured:
* An Alertmanager is deployed into each seed cluster that handles the alerting for all shoots on the seed cluster.
* Gardener will inject the SMTP credentials into the configuration of the Alertmanager.
* The Alertmanager will send emails to the configured email address in case any alerts are firing.
* If an external alertmanager is configured:
* Each shoot has a [Prometheus](https://prometheus.io/docs/introduction/overview/) responsible for monitoring components and sending out alerts. The alerts will be sent to a URL configured in the alerting secret.
* This external alertmanager is not managed by Gardener and can be configured however the operator sees fit.
* Supported authentication types are no authentication, basic, or mutual TLS.

* **OpenVPN Diffie-Hellmann Key secret** (optional), contains the self-generated Diffie-Hellmann key used by OpenVPN in your landscape, please see [this](../../example/10-secret-openvpn-diffie-hellman.yaml) for an example.
* If you don't specify a custom key then a default key is used, but for productive landscapes it's recommend to create a landscape-specific key and define it.
Expand Down
Loading

0 comments on commit 467de5c

Please sign in to comment.