Skip to content

Commit 74c19d2

Browse files
authored
Enhance windows observ lib with detailed alerts and add promtool tests. (#1408)
* Add Check for unexpected changes in generated files add kafka and snmp generated files * Fix kafka dashboards CI check * Add to docker-mixin * Add golang-observ-lib * Add jvm-observ-lib * Add windows * Add process lib * Add zookeeper * Add to csp-mixin check * Fix process-dashboards.json diff * Regenerate golang * Regenerate windows * Update ci * Update README * Update snmp Makefile * Add common Makefile * Add activemq * Add apache mixins * add confluent * Add docker-mixin * Add csp-mixin * Update goruntime * Update f5 * Update grafana-agent * Simplify Makefile to use common mixin Makefile * Add grafana-builder dependency * Add envoy * Add harbor * Add hass-mixin * Add influx * Add haproxy * Add istio * Add multiple mixins: IBM MQ, Istio, Jaeger, Jenkins, Jira, and JVM * Update kafka-mixin * Add multiple mixins: Couchbase, Discourse, Elasticsearch, GitLab, Kubescape, Memcached, Microsoft IIS, Minio, MongoDB (Atlas and standard), MSSQL, and NGINX * Add Spark mixin * Add multiple mixins: Node.js, Nomad, OpenLDAP, OpenSearch, OpenStack, OracleDB, PgBouncer, Presto, Python Runtime, RabbitMQ, Rclone, Redis (Enterprise and standard), Ruby, SAP HANA, and Spring Boot * Add multiple mixins: Supabase, Tensorflow, Traefik, Vault, Velero, Windows, WSO2 Enterprise and Streaming Integrators * add snmp-mixin * Remove unrelated links from snmp-mixin * set editable: false * Fix jvm * Update spring * make fmt camel * Fix redis mixin linter * Fix lint for golang * Update lint for kafka * update oracle * Update nomad * Update haproxy * update envoy * Update jaeger * update ceph linter * Fix for asterisk * Fix .lint * Update oracle * Fix cilium * Update install-ci-deps * Reset minio * Add .gitattributes * fix haproxy * Update gitattributes * Fix mixin format * Fix nomad lint * Fix .lint * Update makefile * fix typo in snmp * Update lint error message * Add newline * Update mixtool to main * Update snmp * Fix * Fix snmp * Add conditiona promtool tests Add windows alerts expanded descriptions * Add promtool install in ci * Update README * Update alerts
1 parent 362e410 commit 74c19d2

File tree

7 files changed

+117
-8
lines changed

7 files changed

+117
-8
lines changed

.github/workflows/lint-mixins.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,3 +85,6 @@ jobs:
8585
working-directory: ./${{ matrix.mixin }}
8686
run: "make && git diff --exit-code || ( echo 'Error: Generated files are not up to date. Run make and commit the local diff'; exit 1; )"
8787

88+
- name: Run promtool test
89+
working-directory: ./${{ matrix.mixin }}
90+
run: make test

Makefile

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,40 @@
11
JSONNET_FMT := jsonnetfmt -n 2 --max-blank-lines 2 --string-style s --comment-style s
22
SHELL := /bin/bash
3+
PROMTOOL_VERSION := 3.2.0
34

4-
install-ci-deps:
5+
# Detect OS and architecture
6+
UNAME_S := $(shell uname -s | tr '[:upper:]' '[:lower:]')
7+
UNAME_M := $(shell uname -m)
8+
9+
# Map architecture names
10+
ifeq ($(UNAME_M),x86_64)
11+
ARCH := amd64
12+
else ifeq ($(UNAME_M),aarch64)
13+
ARCH := arm64
14+
else ifeq ($(UNAME_M),armv7l)
15+
ARCH := armv7
16+
else
17+
ARCH := $(UNAME_M)
18+
endif
19+
20+
install-ci-deps: install-promtool
521
go install github.com/google/go-jsonnet/cmd/jsonnet@v0.20.0
622
go install github.com/google/go-jsonnet/cmd/jsonnetfmt@v0.20.0
723
go install github.com/google/go-jsonnet/cmd/jsonnet-lint@v0.20.0
824
go install github.com/monitoring-mixins/mixtool/cmd/mixtool@main
925
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@v0.5.1
1026

27+
.PHONY: install-promtool
28+
install-promtool:
29+
@echo "Installing promtool $(PROMTOOL_VERSION) for $(UNAME_S)/$(ARCH)..."
30+
@wget https://github.com/prometheus/prometheus/releases/download/v$(PROMTOOL_VERSION)/prometheus-$(PROMTOOL_VERSION).$(UNAME_S)-$(ARCH).tar.gz
31+
@mkdir -p prometheus-$(PROMTOOL_VERSION).$(UNAME_S)-$(ARCH)
32+
@tar xvf prometheus-$(PROMTOOL_VERSION).$(UNAME_S)-$(ARCH).tar.gz
33+
@sudo mv prometheus-$(PROMTOOL_VERSION).$(UNAME_S)-$(ARCH)/promtool /usr/local/bin/
34+
@rm -rf prometheus-$(PROMTOOL_VERSION).$(UNAME_S)-$(ARCH) prometheus-$(PROMTOOL_VERSION).$(UNAME_S)-$(ARCH).tar.gz
35+
@echo "promtool $(PROMTOOL_VERSION) installed successfully"
36+
37+
1138
fmt:
1239
@find . -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
1340
xargs -n 1 -- $(JSONNET_FMT) -i

Makefile_mixin

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
JSONNET_FMT := jsonnetfmt -n 2 --max-blank-lines 1 --string-style s --comment-style s
22

33
.PHONY: all
4-
all: build dashboards_out prometheus_alerts.yaml prometheus_rules.yaml
4+
all: build dashboards_out prometheus_alerts.yaml prometheus_rules.yaml test
55

66
vendor: $(wildcard jsonnetfile.json)
77
ifneq ("$(wildcard jsonnetfile.json)","")
@@ -42,6 +42,14 @@ prometheus_rules.yaml: $(wildcard vendor/**/rules.yaml) $(wildcard vendor/**/rul
4242
touch prometheus_rules_out/prometheus_rules.yaml; \
4343
fi
4444

45+
.PHONY: test
46+
test:
47+
@if [ -f tests/prometheus_*.yaml ]; then \
48+
promtool test rules tests/prometheus_*.yaml; \
49+
else \
50+
echo "No tests/prometheus_*.yaml files found, skipping promtool test."; \
51+
fi
52+
4553
.PHONY: deploy deploy_rules deploy_dashboards
4654
deploy: deploy_rules deploy_dashboards
4755

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,6 @@ Based on format described [here](https://monitoring.mixins.dev/):
2626
* [`caddy-mixin`](caddy-mixin/): A set of reusable and extensible dashboards
2727
for Caddy.
2828

29-
30-
3129
* [`jira-mixin`](jira-mixin/): A set of reusable and extensible dashboards and alerts for JIRA.
3230

3331
You can find more in directories with `-mixin` suffix.
@@ -54,6 +52,14 @@ Examples:
5452
- [golang-observ-lib](golang-observ-lib/)
5553
- [csp-mixin](csp-mixin/)
5654

55+
## Prometheus rules testing for monitoring mixins and observability libraries
56+
57+
It is highly recommended to test prometheus alerts with [promtool test rules](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules) command when complex PromQL queries are used or when additional queries are used in alerts' annotations.
58+
59+
promtool tests files should be placed in tests directory in the root of the library and should be named like `prometheus_*.yaml`. This will enable running tests ing Github Actions and with `make test` command.
60+
61+
A good example of promtool tests can be found in windows-observ-lib: [prometheus_alerts_test.yaml](windows-observ-lib/tests/prometheus_alerts_test.yaml)
62+
5763
## LICENSE
5864

5965
[Apache-2.0](LICENSE)

windows-observ-lib/alerts.libsonnet

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,11 @@
109109
annotations: {
110110
summary: 'High memory usage on Windows host.',
111111
description: |||
112-
Memory usage on host {{ $labels.instance }} is above %(alertMemoryUsageThresholdCritical)s%%. The current value is {{ $value | printf "%%.2f" }}%%.
112+
Memory usage on host {{ $labels.instance }} is critically high, with {{ printf "%%.2f" $value }}%% of total memory used.
113+
This exceeds the threshold of %(alertMemoryUsageThresholdCritical)s%%.
114+
Current memory free: {{ with printf `windows_os_physical_memory_free_bytes{%(filteringSelector)s}` | query | first | value | humanize }}{{ . }}{{ end }}.
115+
Total memory: {{ with printf `windows_cs_physical_memory_bytes{%(filteringSelector)s}` | query | first | value | humanize }}{{ . }}{{ end }}.
116+
Consider investigating processes consuming high memory or increasing available memory.
113117
||| % this.config,
114118
},
115119
},
@@ -126,7 +130,11 @@
126130
annotations: {
127131
summary: 'Disk is almost full on Windows host.',
128132
description: |||
129-
Volume {{ $labels.volume }} is almost full on host {{ $labels.instance }}, more than %(alertDiskUsageThresholdCritical)s%% of space is used. The current volume utilization is {{ $value | printf "%%.2f" }}%%.
133+
Disk space on volume {{ $labels.volume }} of host {{ $labels.instance }} is critically low, with {{ printf "%%.2f" $value }}%% of total space used.
134+
This exceeds the threshold of %(alertDiskUsageThresholdCritical)s%%.
135+
Current disk free: {{ with printf `windows_logical_disk_free_bytes{volume="%%s", %(filteringSelector)s}` $labels.volume | query | first | value | humanize }}{{ . }}{{ end }}.
136+
Total disk size: {{ with printf `windows_logical_disk_size_bytes{volume="%%s", %(filteringSelector)s}` $labels.volume | query | first | value | humanize }}{{ . }}{{ end }}.
137+
Consider cleaning up unnecessary files or increasing disk capacity.
130138
||| % this.config,
131139
},
132140
},

windows-observ-lib/prometheus_rules_out/prometheus_alerts.yaml

Lines changed: 10 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
rule_files:
2+
- ../prometheus_rules_out/prometheus_alerts.yaml
3+
4+
evaluation_interval: 15m
5+
6+
tests:
7+
- interval: 1m
8+
input_series:
9+
- series: 'windows_os_physical_memory_free_bytes{instance="host1"}'
10+
values: '10000000x15'
11+
- series: 'windows_cs_physical_memory_bytes{instance="host1"}'
12+
values: '1000000000x15'
13+
- series: 'windows_logical_disk_free_bytes{volume="C:", instance="host1"}'
14+
values: '10000000x15'
15+
- series: 'windows_logical_disk_size_bytes{volume="C:", instance="host1"}'
16+
values: '1000000000x15'
17+
18+
19+
alert_rule_test:
20+
- eval_time: 15m
21+
alertname: WindowsMemoryHighUtilization
22+
exp_alerts:
23+
- exp_labels:
24+
severity: critical
25+
instance: host1
26+
exp_annotations:
27+
description: |
28+
Memory usage on host host1 is critically high, with 99.00% of total memory used.
29+
This exceeds the threshold of 90%.
30+
Current memory free: 10M.
31+
Total memory: 1G.
32+
Consider investigating processes consuming high memory or increasing available memory.
33+
summary: 'High memory usage on Windows host.'
34+
35+
- eval_time: 15m
36+
alertname: WindowsDiskAlmostOutOfSpace
37+
exp_alerts:
38+
- exp_labels:
39+
severity: critical
40+
instance: host1
41+
volume: "C:"
42+
exp_annotations:
43+
description: |
44+
Disk space on volume C: of host host1 is critically low, with 99.00% of total space used.
45+
This exceeds the threshold of 90%.
46+
Current disk free: 10M.
47+
Total disk size: 1G.
48+
Consider cleaning up unnecessary files or increasing disk capacity.
49+
summary: 'Disk is almost full on Windows host.'

0 commit comments

Comments
 (0)