Skip to content

Commit

Permalink
udpate prom alerts (#49)
Browse files Browse the repository at this point in the history
  • Loading branch information
Zedive authored Sep 8, 2021
1 parent 857882b commit 1f21bbe
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 5 deletions.
2 changes: 1 addition & 1 deletion charts/flink-job/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
appVersion: "1.0"
description: Flink job cluster on k8s
name: flink-job
version: 0.0.2
version: 0.0.3
maintainers:
- name: Zedive
email: albert@nextdoor.com
3 changes: 2 additions & 1 deletion charts/flink-job/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Flink job cluster on k8s

![Version: 0.0.2](https://img.shields.io/badge/Version-0.0.2-informational?style=flat-square) ![AppVersion: 1.0](https://img.shields.io/badge/AppVersion-1.0-informational?style=flat-square)
![Version: 0.0.3](https://img.shields.io/badge/Version-0.0.3-informational?style=flat-square) ![AppVersion: 1.0](https://img.shields.io/badge/AppVersion-1.0-informational?style=flat-square)

This chart deploys a flink job cluster and runs a simple word counting flink app as an example.
This chart includes some production ready set-ups such as
Expand All @@ -20,6 +20,7 @@ See metrics reporter in the flink properties for more details.
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| alerts.enabled | bool | `true` | (Boolean) whether to create the PrometheusRule for this flink cluster |
| alerts.severity | string | `"info"` | |
| defaults.runbookUrl | string | `"https://github.com/Nextdoor/k8s-charts/blob/main/charts/flink-job/runbook.md"` | (String) Runbook URL for the Prometheus alerts |
| envVars | list | `[{"name":"HADOOP_CLASSPATH","value":"/opt/flink/opt/flink-metrics-prometheus-1.9.3.jar"}]` | Environment variables shared by all containers |
| flinkProperties | object | `{"execution.checkpointing.interval":"10min","execution.checkpointing.mode":"EXACTLY_ONCE","high-availability":"org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory","high-availability.storageDir":"file:/savepoint/","kubernetes.cluster-id":"{{ .Values.fullnameOverride }}","kubernetes.namespace":"{{ .Release.Namespace }}","metrics.reporter.prom.class":"org.apache.flink.metrics.prometheus.PrometheusReporter","metrics.reporters":"prom","restart-strategy":"exponential-delay","restart-strategy.exponential-delay.backoff-multiplier":"2.0","state.checkpoints.dir":"file:/savepoint/","taskmanager.numberOfTaskSlots":"1"}` | (`Map`) Flink properties which are appened to flink-conf.yaml |
Expand Down
20 changes: 17 additions & 3 deletions charts/flink-job/templates/prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
flink_jobmanager_numRunningJobs{cluster="{{ $cluster }}", namespace="{{ $namespace }}"} < 1
for: 10m
labels:
severity: info
severity: {{ .Values.alerts.severity }}
- alert: FlinkJobOutage
annotations:
summary: Flink job is down
Expand All @@ -36,7 +36,7 @@ spec:
}[10m]) > 10000
for: 10m
labels:
severity: info
severity: {{ .Values.alerts.severity }}
- alert: FlinkJobTooManyRestarts
annotations:
summary: Flink job has too many restarts
Expand All @@ -50,5 +50,19 @@ spec:
}[30m]) > 2
for: 10m
labels:
severity: info
severity: {{ .Values.alerts.severity }}
- alert: FlinkCheckpointFailing
annotations:
summary: Flink fails to capture the checkpoint.
runbook_url: "{{ $values.defaults.runbookUrl }}#flinkcheckpointfailing"
description: >-
The job manager in {{ template "flink-job-cluster.fullname" . }} fails to capture checkpoint.
expr: >-
changes(flink_jobmanager_job_numberOfFailedCheckpoints{
cluster="{{ $cluster }}",
namespace="{{ $namespace }}"
}[10m]) > 0
for: 10m
labels:
severity: {{ .Values.alerts.severity }}
{{- end -}}
1 change: 1 addition & 0 deletions charts/flink-job/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,7 @@ savepoints:
alerts:
# -- (Boolean) whether to create the PrometheusRule for this flink cluster
enabled: true
severity: info

defaults:
# -- (String) Runbook URL for the Prometheus alerts
Expand Down

0 comments on commit 1f21bbe

Please sign in to comment.