Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Nov 15, 2025

Add tracing support using otel:

  • api: extract trace and span IDs from request context and add to alerts
  • types: add trace and span IDs to alerts
  • dispatch: add distributed tracing support
  • notify: add distributed tracing support

This change borrows part of the implementation from #3673
Fixes #3670

TODO list:

  • make notify tracing generic so we don't have to add extra logic per integration
  • apply suggestions from Instrument with OTel tracing #3673
  • go through tracing package and make changes/improvements if necessary
  • implement tracing on Prometheus notifier on top of feat(notifier): independent alertmanager sendloops prometheus#16355
  • extract traces from Prometheus requests
    • store extracted traces on Alerts
    • handle tracing IDs merge for Alerts
    • test distributed tracing between Prometheus and Alertmanager
    • test distributed tracing between webhook notifier and a test webhook receiver which is supports distributed tracing
  • add/update tests
  • move common tracing setup to https://github.com/prometheus/common/

@siavashs
Copy link
Contributor Author

siavashs commented Nov 15, 2025

I'll implement the Prometheus notifier changes in a new draft PR based on prometheus/prometheus#16355
So if the above PR got merged then it would be much easier to rebase the tracing changes.

Copy link

@thompson-tomo thompson-tomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback on attribute naming based on Open telemetry exposure in semconv. Might be worthwhile to add an alerting.platform.name attribute to the spans which stores alert manager.

Is there interest in having these signals defined in the open telemetry signal registry (semantic conventions)?

trace.WithAttributes(attrs...),
// we'll use producer here since the alert is not processed
// synchronously
trace.WithSpanKind(trace.SpanKindProducer),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Producer doesn't seem right to me based on

Whether a span represents an outgoing call to a remote service (CLIENT and PRODUCER spans) or a processing of an incoming request initiated externally (SERVER and CONSUMER spans).
Whether a Span represents a request/response operation (CLIENT and SERVER spans) or a deferred execution (PRODUCER and CONSUMER spans).

Perhaps it is internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inherited code from the other PR, still need to check all the details.
But doesn't it fall under producer/consumer category? (the last sentence in the paragraph you quoted).

Copy link

@thompson-tomo thompson-tomo Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it satisfies the second But it needs also to satisfy the first. If it doesn't satisfy either then it is internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the span kinds to some extent, I'll have to review all to be sure if everything is set correctly.
We also use propagation now.

return json.Marshal(result)
}

// TODO: probably move these into prometheus/common since they're copied from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good idea, what do you think @ArthurSens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can prepare a PR for common later this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not be against it :)

@siavashs siavashs force-pushed the feat/tracing branch 3 times, most recently from 858e51b to 14b79bd Compare November 17, 2025 21:07
Add tracing support using otel:
- api: extract trace and span IDs from request context and add to alerts
- types: add trace and span IDs to alerts
- dispatch: add distributed tracing support
- notify: add distributed tracing support

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Comment on lines +92 to +94
trace.WithAttributes(attribute.String("alerting.notify.integration.name", i.name)),
trace.WithAttributes(attribute.Int("alerting.alerts.count", len(alerts))),
trace.WithSpanKind(trace.SpanKindClient),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to add additional integration details potential some from https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client-span in particular server.*, note I assume this span represents the outbound call to a notification system hence the client span kind.

Copy link
Contributor Author

@siavashs siavashs Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelhttp generates a bunch of spans for each http request so I think it is redundant to add more info to the parent span, unless if we want to disable those and construct only one custom span:
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to also have a http span then perhaps this should be internal.

attribute.String("alerting.alert.name", alert.Name()),
attribute.String("alerting.alert.fingerprint", alert.Fingerprint().String()),
),
trace.WithSpanKind(trace.SpanKindConsumer),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this one

Suggested change
trace.WithSpanKind(trace.SpanKindConsumer),
trace.WithSpanKind(trace.SpanKindInternal),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that one, are these outgoing calls as per the test mentioned in #4745 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Instrument Alertmanager for distributed tracing

5 participants