Description
Currently Alertmanager has metrics for various details around the behavior of receivers
Lines 256 to 281 in f30aef2
These are broken down by an integration
label, which indicates the type of receiver (pagerduty
, webhook
etc).
The docs recommend extending Alertmanager with the webhook
receiver (https://prometheus.io/docs/operating/integrations/#alertmanager-webhook-receiver) as opposed to adding new integrations (which I understand, it limits the maintenance burden of the main repo), but this means that a lot of receivers become the webhook
integration.
Because of the limited resolution of the receiver metrics, this means that it's impossible to tease different webhooks apart. For example, we have various webhooks that go to JIRA, Chat, and other internal systems. By just having the one metric, if one of those webhook receivers falls over, it becomes impossible from just the metrics to tell which one it was.
I would like to propose a couple of solutions for this:
a) include the receiver name as a secondary label to the above metrics
b) include the hostname/port of the webhook either as another label, or replacing/extending the existing integration label
In order to allow more accurate alerting of receiver issues