|
| 1 | +groups: |
| 2 | + - name: pomerium-console-core-connectivity |
| 3 | + rules: |
| 4 | + - alert: ConsoleGRPCClientErrors |
| 5 | + expr: rate(pomerium_rpc_client_requests_per_rpc_bucket{rpc_grpc_status_code!="0"}[5m]) > 0.1 |
| 6 | + for: 2m |
| 7 | + labels: |
| 8 | + severity: warning |
| 9 | + component: pomerium-console |
| 10 | + service: grpc |
| 11 | + annotations: |
| 12 | + summary: "High GRPC client error rate in Pomerium Console" |
| 13 | + description: "Console GRPC client error rate is {{ $value }} errors/sec for service {{ $labels.rpc_service }}" |
| 14 | + runbook: | |
| 15 | + 1. Check console logs for GRPC error details |
| 16 | + 2. Verify databroker connectivity and authentication |
| 17 | + 3. Check network connectivity to GRPC services |
| 18 | + 4. Monitor resource utilization |
| 19 | +
|
| 20 | + - alert: ConsoleDatabrokerReconcilerErrors |
| 21 | + expr: increase(pomerium_console_databroker_reconciler_Reconcile_failures_total[5m]) > 0 |
| 22 | + for: 2m |
| 23 | + labels: |
| 24 | + severity: warning |
| 25 | + component: pomerium-console |
| 26 | + service: syncer |
| 27 | + annotations: |
| 28 | + summary: "Pomerium Console config reconciler failures" |
| 29 | + description: "{{ $value }} databroker reconciler failures for cluster {{ $labels.cluster }} {{ $labels.component }} in the last 5 minutes" |
| 30 | + runbook: | |
| 31 | + 1. Check the root cause of the failure in the logs or traces by filtering for `pomerium_console_databroker_reconciler` calls |
| 32 | + 2. If the failure is due to a specific entity, review configuration validation errors to understand which entity is causing the issue |
| 33 | +
|
| 34 | + - alert: ConsoleDatabrokerReconcilerHighLatency |
| 35 | + expr: histogram_quantile(0.95, rate(pomerium_console_databroker_reconciler_Reconcile_duration_bucket{component="config-syncer"}[5m])) > 30 |
| 36 | + for: 3m |
| 37 | + labels: |
| 38 | + severity: warning |
| 39 | + component: pomerium-console |
| 40 | + service: syncer |
| 41 | + annotations: |
| 42 | + summary: "High databroker reconciler latency for config syncer" |
| 43 | + description: "95th percentile databroker reconciler duration is {{ $value }}s" |
| 44 | + runbook: | |
| 45 | + 1. Check database performance and query times using OTEL Tracing to identify the root cause |
| 46 | + 2. Review databroker performance and connectivity |
| 47 | +
|
| 48 | + - alert: ConsoleDatabrokerReconcilerMissing |
| 49 | + expr: | |
| 50 | + ( |
| 51 | + count by (cluster_id) (pomerium_console_databroker_reconciler_ReconcileLoop{component="config-syncer"}) + |
| 52 | + count by (cluster_id) (pomerium_console_databroker_reconciler_ReconcileLoop{component="service-account-syncer"}) |
| 53 | + ) < 2 |
| 54 | + for: 2m |
| 55 | + labels: |
| 56 | + severity: critical |
| 57 | + component: pomerium-console |
| 58 | + service: syncer |
| 59 | + annotations: |
| 60 | + summary: "Some databroker reconciler components are not running" |
| 61 | + description: "Only {{ $value }} out of 2 expected databroker reconciler components are running for cluster {{ $labels.cluster_id }}" |
| 62 | + runbook: | |
| 63 | + 1. Check console logs for reconciler startup errors |
| 64 | + 2. Verify databroker connectivity for the affected cluster |
| 65 | +
|
| 66 | + - name: pomerium-console-external-data-sources |
| 67 | + rules: |
| 68 | + - alert: ExternalDataSourceTaskFailures |
| 69 | + expr: increase(pomerium_console_datasource_task_calls_failures_total[5m]) > 0 |
| 70 | + for: 1m |
| 71 | + labels: |
| 72 | + severity: warning |
| 73 | + component: pomerium-console |
| 74 | + service: external-data-sources |
| 75 | + annotations: |
| 76 | + summary: "External data source task failures" |
| 77 | + description: "{{ $value }} task failures for external data source {{ $labels.id }} {{ $labels.record_type }}" |
| 78 | + runbook: | |
| 79 | + 1. Check the logs for exact error messages related to the task failures |
| 80 | + 2. Check the external data source URL accessibility |
| 81 | + 3. Verify authentication credentials and headers |
| 82 | + 4. Check network connectivity and DNS resolution |
| 83 | +
|
| 84 | + - alert: ExternalDataSourceHighLatency |
| 85 | + expr: histogram_quantile(0.95, rate(pomerium_console_datasource_task_duration_bucket[5m])) > 30000 |
| 86 | + for: 3m |
| 87 | + labels: |
| 88 | + severity: warning |
| 89 | + component: pomerium-console |
| 90 | + service: external-data-sources |
| 91 | + annotations: |
| 92 | + summary: "High external data source task latency" |
| 93 | + description: "95th percentile task duration is {{ $value }}ms for data source {{ $labels.id }} {{ $labels.record_type }}" |
| 94 | + runbook: | |
| 95 | + 1. Some external data sources may have high latency due to their nature of calling remote APIs and you may need to adjust the polling intervals to let them process the data |
| 96 | +
|
| 97 | + - alert: ExternalDataSourceHTTPErrors |
| 98 | + expr: rate(pomerium_console_ext_data_source_requests_completed{code!~"200|304"}[5m]) > 0.1 |
| 99 | + for: 2m |
| 100 | + labels: |
| 101 | + severity: warning |
| 102 | + component: pomerium-console |
| 103 | + service: external-data-sources |
| 104 | + annotations: |
| 105 | + summary: "High HTTP error rate for external data sources" |
| 106 | + description: "{{ $value }} HTTP errors/sec for data source {{ $labels.data_source_id }}" |
| 107 | + runbook: | |
| 108 | + 1. Check external data source endpoint health |
| 109 | + 2. Verify authentication and authorization |
| 110 | + 3. Check for rate limiting from external endpoints |
| 111 | +
|
| 112 | + - alert: ExternalDataSourceHighRequestLatency |
| 113 | + expr: histogram_quantile(0.95, rate(pomerium_console_ext_data_source_request_duration_seconds_bucket[5m])) > 10 |
| 114 | + for: 3m |
| 115 | + labels: |
| 116 | + severity: warning |
| 117 | + component: pomerium-console |
| 118 | + service: external-data-sources |
| 119 | + annotations: |
| 120 | + summary: "High HTTP request latency for external data sources" |
| 121 | + description: "95th percentile HTTP request duration is {{ $value }}s for data source {{ $labels.data_source_id }}" |
| 122 | + runbook: | |
| 123 | + 1. Check external endpoint response times |
| 124 | +
|
| 125 | + - name: pomerium-console-resources |
| 126 | + rules: |
| 127 | + - alert: HighMemoryUsage |
| 128 | + expr: go_memstats_alloc_bytes / go_memstats_sys_bytes * 100 > 85 |
| 129 | + for: 5m |
| 130 | + labels: |
| 131 | + severity: warning |
| 132 | + component: pomerium-console |
| 133 | + service: resources |
| 134 | + annotations: |
| 135 | + summary: "High memory usage for Pomerium Console" |
| 136 | + description: "Memory usage is {{ $value }}%" |
| 137 | + runbook: | |
| 138 | + 1. Consider increasing memory limits |
| 139 | +
|
| 140 | + - alert: HighGoroutineCount |
| 141 | + expr: go_goroutines > 10000 |
| 142 | + for: 3m |
| 143 | + labels: |
| 144 | + severity: warning |
| 145 | + component: pomerium-console |
| 146 | + service: resources |
| 147 | + annotations: |
| 148 | + summary: "High goroutine count" |
| 149 | + description: "{{ $value }} goroutines are running" |
| 150 | + runbook: | |
| 151 | + 1. Contact Pomerium support if this is unexpected (i.e. no active requests to the console) |
| 152 | +
|
| 153 | + - alert: GCPressure |
| 154 | + expr: rate(go_gc_duration_seconds[5m]) > 0.1 |
| 155 | + for: 3m |
| 156 | + labels: |
| 157 | + severity: warning |
| 158 | + component: pomerium-console |
| 159 | + service: resources |
| 160 | + annotations: |
| 161 | + summary: "High garbage collection pressure" |
| 162 | + description: "GC duration rate is {{ $value }}s/s" |
| 163 | + runbook: | |
| 164 | + 1. Contact Pomerium support if this is unexpected (i.e. no active requests to the console) |
| 165 | +
|
| 166 | + - name: pomerium-console-database |
| 167 | + rules: |
| 168 | + - alert: DatabaseConnectionErrors |
| 169 | + expr: rate(pomerium_db_client_operation_duration_count{pgx_operation_type="connect"}[5m]) - rate(pomerium_db_client_operation_duration_bucket{pgx_operation_type="connect",le="+Inf"}[5m]) > 0.1 |
| 170 | + for: 2m |
| 171 | + labels: |
| 172 | + severity: critical |
| 173 | + component: pomerium-console |
| 174 | + service: database |
| 175 | + annotations: |
| 176 | + summary: "Database connection errors" |
| 177 | + description: "{{ $value }} database connection errors/sec" |
| 178 | + runbook: | |
| 179 | + 1. Check database connectivity and authentication |
| 180 | + 2. Verify database service availability |
| 181 | +
|
| 182 | + - alert: SlowDatabaseQueries |
| 183 | + expr: histogram_quantile(0.95, rate(pomerium_db_client_operation_duration_bucket{pgx_operation_type="query"}[5m])) > 5000 |
| 184 | + for: 3m |
| 185 | + labels: |
| 186 | + severity: warning |
| 187 | + component: pomerium-console |
| 188 | + service: database |
| 189 | + annotations: |
| 190 | + summary: "Slow database queries" |
| 191 | + description: "95th percentile query time is {{ $value }}ms" |
| 192 | + runbook: | |
| 193 | + 1. Make sure the database is colocated with the Pomerium Console |
| 194 | + 2. Check database instance size and consider scaling if necessary |
| 195 | + 3. Review individual queries via OTEL Tracing to identify slow queries |
| 196 | +
|
0 commit comments