Skip to content

Commit aa42c72

Browse files
authored
add console metrics, gen config map from the rule files for k8s deployment (#1951)
1 parent 8a7089f commit aa42c72

File tree

3 files changed

+216
-2
lines changed

3 files changed

+216
-2
lines changed
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
groups:
2+
- name: pomerium-console-core-connectivity
3+
rules:
4+
- alert: ConsoleGRPCClientErrors
5+
expr: rate(pomerium_rpc_client_requests_per_rpc_bucket{rpc_grpc_status_code!="0"}[5m]) > 0.1
6+
for: 2m
7+
labels:
8+
severity: warning
9+
component: pomerium-console
10+
service: grpc
11+
annotations:
12+
summary: "High GRPC client error rate in Pomerium Console"
13+
description: "Console GRPC client error rate is {{ $value }} errors/sec for service {{ $labels.rpc_service }}"
14+
runbook: |
15+
1. Check console logs for GRPC error details
16+
2. Verify databroker connectivity and authentication
17+
3. Check network connectivity to GRPC services
18+
4. Monitor resource utilization
19+
20+
- alert: ConsoleDatabrokerReconcilerErrors
21+
expr: increase(pomerium_console_databroker_reconciler_Reconcile_failures_total[5m]) > 0
22+
for: 2m
23+
labels:
24+
severity: warning
25+
component: pomerium-console
26+
service: syncer
27+
annotations:
28+
summary: "Pomerium Console config reconciler failures"
29+
description: "{{ $value }} databroker reconciler failures for cluster {{ $labels.cluster }} {{ $labels.component }} in the last 5 minutes"
30+
runbook: |
31+
1. Check the root cause of the failure in the logs or traces by filtering for `pomerium_console_databroker_reconciler` calls
32+
2. If the failure is due to a specific entity, review configuration validation errors to understand which entity is causing the issue
33+
34+
- alert: ConsoleDatabrokerReconcilerHighLatency
35+
expr: histogram_quantile(0.95, rate(pomerium_console_databroker_reconciler_Reconcile_duration_bucket{component="config-syncer"}[5m])) > 30
36+
for: 3m
37+
labels:
38+
severity: warning
39+
component: pomerium-console
40+
service: syncer
41+
annotations:
42+
summary: "High databroker reconciler latency for config syncer"
43+
description: "95th percentile databroker reconciler duration is {{ $value }}s"
44+
runbook: |
45+
1. Check database performance and query times using OTEL Tracing to identify the root cause
46+
2. Review databroker performance and connectivity
47+
48+
- alert: ConsoleDatabrokerReconcilerMissing
49+
expr: |
50+
(
51+
count by (cluster_id) (pomerium_console_databroker_reconciler_ReconcileLoop{component="config-syncer"}) +
52+
count by (cluster_id) (pomerium_console_databroker_reconciler_ReconcileLoop{component="service-account-syncer"})
53+
) < 2
54+
for: 2m
55+
labels:
56+
severity: critical
57+
component: pomerium-console
58+
service: syncer
59+
annotations:
60+
summary: "Some databroker reconciler components are not running"
61+
description: "Only {{ $value }} out of 2 expected databroker reconciler components are running for cluster {{ $labels.cluster_id }}"
62+
runbook: |
63+
1. Check console logs for reconciler startup errors
64+
2. Verify databroker connectivity for the affected cluster
65+
66+
- name: pomerium-console-external-data-sources
67+
rules:
68+
- alert: ExternalDataSourceTaskFailures
69+
expr: increase(pomerium_console_datasource_task_calls_failures_total[5m]) > 0
70+
for: 1m
71+
labels:
72+
severity: warning
73+
component: pomerium-console
74+
service: external-data-sources
75+
annotations:
76+
summary: "External data source task failures"
77+
description: "{{ $value }} task failures for external data source {{ $labels.id }} {{ $labels.record_type }}"
78+
runbook: |
79+
1. Check the logs for exact error messages related to the task failures
80+
2. Check the external data source URL accessibility
81+
3. Verify authentication credentials and headers
82+
4. Check network connectivity and DNS resolution
83+
84+
- alert: ExternalDataSourceHighLatency
85+
expr: histogram_quantile(0.95, rate(pomerium_console_datasource_task_duration_bucket[5m])) > 30000
86+
for: 3m
87+
labels:
88+
severity: warning
89+
component: pomerium-console
90+
service: external-data-sources
91+
annotations:
92+
summary: "High external data source task latency"
93+
description: "95th percentile task duration is {{ $value }}ms for data source {{ $labels.id }} {{ $labels.record_type }}"
94+
runbook: |
95+
1. Some external data sources may have high latency due to their nature of calling remote APIs and you may need to adjust the polling intervals to let them process the data
96+
97+
- alert: ExternalDataSourceHTTPErrors
98+
expr: rate(pomerium_console_ext_data_source_requests_completed{code!~"200|304"}[5m]) > 0.1
99+
for: 2m
100+
labels:
101+
severity: warning
102+
component: pomerium-console
103+
service: external-data-sources
104+
annotations:
105+
summary: "High HTTP error rate for external data sources"
106+
description: "{{ $value }} HTTP errors/sec for data source {{ $labels.data_source_id }}"
107+
runbook: |
108+
1. Check external data source endpoint health
109+
2. Verify authentication and authorization
110+
3. Check for rate limiting from external endpoints
111+
112+
- alert: ExternalDataSourceHighRequestLatency
113+
expr: histogram_quantile(0.95, rate(pomerium_console_ext_data_source_request_duration_seconds_bucket[5m])) > 10
114+
for: 3m
115+
labels:
116+
severity: warning
117+
component: pomerium-console
118+
service: external-data-sources
119+
annotations:
120+
summary: "High HTTP request latency for external data sources"
121+
description: "95th percentile HTTP request duration is {{ $value }}s for data source {{ $labels.data_source_id }}"
122+
runbook: |
123+
1. Check external endpoint response times
124+
125+
- name: pomerium-console-resources
126+
rules:
127+
- alert: HighMemoryUsage
128+
expr: go_memstats_alloc_bytes / go_memstats_sys_bytes * 100 > 85
129+
for: 5m
130+
labels:
131+
severity: warning
132+
component: pomerium-console
133+
service: resources
134+
annotations:
135+
summary: "High memory usage for Pomerium Console"
136+
description: "Memory usage is {{ $value }}%"
137+
runbook: |
138+
1. Consider increasing memory limits
139+
140+
- alert: HighGoroutineCount
141+
expr: go_goroutines > 10000
142+
for: 3m
143+
labels:
144+
severity: warning
145+
component: pomerium-console
146+
service: resources
147+
annotations:
148+
summary: "High goroutine count"
149+
description: "{{ $value }} goroutines are running"
150+
runbook: |
151+
1. Contact Pomerium support if this is unexpected (i.e. no active requests to the console)
152+
153+
- alert: GCPressure
154+
expr: rate(go_gc_duration_seconds[5m]) > 0.1
155+
for: 3m
156+
labels:
157+
severity: warning
158+
component: pomerium-console
159+
service: resources
160+
annotations:
161+
summary: "High garbage collection pressure"
162+
description: "GC duration rate is {{ $value }}s/s"
163+
runbook: |
164+
1. Contact Pomerium support if this is unexpected (i.e. no active requests to the console)
165+
166+
- name: pomerium-console-database
167+
rules:
168+
- alert: DatabaseConnectionErrors
169+
expr: rate(pomerium_db_client_operation_duration_count{pgx_operation_type="connect"}[5m]) - rate(pomerium_db_client_operation_duration_bucket{pgx_operation_type="connect",le="+Inf"}[5m]) > 0.1
170+
for: 2m
171+
labels:
172+
severity: critical
173+
component: pomerium-console
174+
service: database
175+
annotations:
176+
summary: "Database connection errors"
177+
description: "{{ $value }} database connection errors/sec"
178+
runbook: |
179+
1. Check database connectivity and authentication
180+
2. Verify database service availability
181+
182+
- alert: SlowDatabaseQueries
183+
expr: histogram_quantile(0.95, rate(pomerium_db_client_operation_duration_bucket{pgx_operation_type="query"}[5m])) > 5000
184+
for: 3m
185+
labels:
186+
severity: warning
187+
component: pomerium-console
188+
service: database
189+
annotations:
190+
summary: "Slow database queries"
191+
description: "95th percentile query time is {{ $value }}ms"
192+
runbook: |
193+
1. Make sure the database is colocated with the Pomerium Console
194+
2. Check database instance size and consider scaling if necessary
195+
3. Review individual queries via OTEL Tracing to identify slow queries
196+
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
4+
generatorOptions:
5+
disableNameSuffixHash: true
6+
7+
configMapGenerator:
8+
- name: prometheus-alerts-rule-files
9+
files:
10+
- pomerium-alerts.yml
11+
- envoy-alerts.yml
12+
- upstream-alerts.yml
13+
- console-alerts.yml

content/examples/prometheus/prometheus.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,13 @@ global:
44
rule_files:
55
- "upstream-alerts.yml"
66
- "pomerium-alerts.yml"
7+
- "envoy-alerts.yml"
8+
- "console-alerts.yml"
79

810
scrape_configs:
9-
- job_name: 'prometheus'
11+
- job_name: 'pomerium-core'
1012
static_configs:
11-
- targets: ['localhost:9098']
13+
- targets: ['localhost:9090']
14+
- job_name: 'pomerium-console'
15+
static_configs:
16+
- targets: ['localhost:9092']

0 commit comments

Comments
 (0)