Skip to content

Include receiver name in notification metrics #3012

Open
@sinkingpoint

Description

Currently Alertmanager has metrics for various details around the behavior of receivers

numNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "alertmanager",
Name: "notifications_total",
Help: "The total number of attempted notifications.",
}, []string{"integration"}),
numTotalFailedNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "alertmanager",
Name: "notifications_failed_total",
Help: "The total number of failed notifications.",
}, []string{"integration"}),
numNotificationRequestsTotal: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "alertmanager",
Name: "notification_requests_total",
Help: "The total number of attempted notification requests.",
}, []string{"integration"}),
numNotificationRequestsFailedTotal: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "alertmanager",
Name: "notification_requests_failed_total",
Help: "The total number of failed notification requests.",
}, []string{"integration"}),
notificationLatencySeconds: prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "alertmanager",
Name: "notification_latency_seconds",
Help: "The latency of notifications in seconds.",
Buckets: []float64{1, 5, 10, 15, 20},
}, []string{"integration"}),

These are broken down by an integration label, which indicates the type of receiver (pagerduty, webhook etc).

The docs recommend extending Alertmanager with the webhook receiver (https://prometheus.io/docs/operating/integrations/#alertmanager-webhook-receiver) as opposed to adding new integrations (which I understand, it limits the maintenance burden of the main repo), but this means that a lot of receivers become the webhook integration.

Because of the limited resolution of the receiver metrics, this means that it's impossible to tease different webhooks apart. For example, we have various webhooks that go to JIRA, Chat, and other internal systems. By just having the one metric, if one of those webhook receivers falls over, it becomes impossible from just the metrics to tell which one it was.

I would like to propose a couple of solutions for this:

a) include the receiver name as a secondary label to the above metrics
b) include the hostname/port of the webhook either as another label, or replacing/extending the existing integration label

In order to allow more accurate alerting of receiver issues

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions