Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry middleware #769

Closed
wants to merge 11 commits into from
Closed

Conversation

davidpdrsn
Copy link
Member

@davidpdrsn davidpdrsn commented Feb 18, 2022

This adds axum_extra::middleware::opentelemtry_tracing_layer which returns a TraceLayer that will use opentelemetry's conventional span field names.

Fixes #750

@davidpdrsn davidpdrsn changed the title WIP: OpenTelemetry middleware Add OpenTelemetry middleware Feb 21, 2022
@davidpdrsn davidpdrsn marked this pull request as ready for review February 21, 2022 08:44
Copy link
Member

@jplatte jplatte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory review, don't think I wanna fully review the middleware implementation.

axum-extra/src/middleware/opentelemetry.rs Show resolved Hide resolved
examples/opentelemetry-jaeger/src/main.rs Show resolved Hide resolved
@jplatte
Copy link
Member

jplatte commented Feb 22, 2022

@davidpdrsn w.r.t. the tracing initialization, how would you feel about a PR that gets rid of set_var for the whole repo at once?

@davidpdrsn
Copy link
Member Author

@davidpdrsn w.r.t. the tracing initialization, how would you feel about a PR that gets rid of set_var for the whole repo at once?

If you wanna grind it out then fine by me 😅

Copy link
Member

@jplatte jplatte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since merging is blocked on review, I'll approve. Or could you self-approve your PR too? 😅

@davidpdrsn
Copy link
Member Author

I'm gonna ask around if someone familiar with opentelemetry wanna take a look as well.

Copy link
Contributor

@neoeinstein neoeinstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have some questions about this approach to hooking into OpenTelemetry, however. In particular, I'd want to see how propagation would be working with to an underlying service (How do we continue the trace out to further backend services?). It could be that I can't picture it quite the same vs. what I've observed for how the actix-web-opentelemetry middleware is set up. This proposal tries to keep a half foot in both camps, by not being a pure OpenTelemetry implementation and trying to stuff the OpenTelemetry constructs through the tracing pipeline. That may not be the most effective way to accomplish that goal. Instead, it may be better to have a pure OpenTelemetry focused middleware, and if there is a desire for logging into tracing as well, that should be its own middleware.

Comment on lines 150 to 154
let client_ip = req
.extensions()
.get::<ConnectInfo<SocketAddr>>()
.map(|ConnectInfo(client_ip)| Cow::from(client_ip.to_string()))
.unwrap_or_default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this would only duplicate the functionality of a lower layer providing net.peer.ip, which could probably be done here or in a distinct transport-level layer. But, if this is just providing the network peer's IP, the spec indicates that client_ip should not be provided. As deployed, this is most likely to just be reporting the load balancer's IP (if using one) rather than the IP of the client. If you were to instead be doing some inspection of the X-Forwarded-For headers to try to find the IP of the "true client", then this field becomes useful.

Copy link
Member Author

@davidpdrsn davidpdrsn Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've added a check for x-forwarded-for as well.

Honestly not sure if we should set net.peer.ip as well 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is probably a good place to do net.peer.ip and net.peer.port if available. I think that adding another layer deeper down the stack would just be more complicated, as you would need to be this high up in the stack to know about the trace context to attach the attribute.

axum-extra/src/middleware/opentelemetry.rs Outdated Show resolved Hide resolved
Comment on lines +269 to +270
// assume there is no error, if there is `OtelOnFailure` will be called and override this
span.record("otel.status_code", &"OK");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, the span status should only be set by the server when there was a 500 series error. Otherwise it should be left unset.

In general, instrumentation middleware should not set this value unless it is to report an error.

Suggested change
// assume there is no error, if there is `OtelOnFailure` will be called and override this
span.record("otel.status_code", &"OK");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That spec doesn't mention otel.status_code. Where do you see it?

Copy link
Contributor

@neoeinstein neoeinstein Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StatusCode in an OpenTelemetry span is translated into otel.status_code by the various SDKs for serialization on the wire. The semantics of Ok > Error > Unset are the primary reason why the fields shouldn't be directly set, but there are some SDKs, such as .NET, which indicated directly setting it, but also recommend looking at the shim where a specific SetStatus method is provided.

The other reason we shouldn't set this here is that this span is the current span when the handler for a request is called (presuming no intervening #[instrument] calls). If the user has and reports their own error in the middle, that will appropriately trigger the Error status code for the span. Having this here will, at the end of the request, override that status and mark the span as Ok, which is why the spec calls this out as a MUST NOT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it refers to this part of the spec,

And otel.status_code is a tracing_openetelemetry special attribute to set the Opentelemetry StatusCode so it seems we should not be setting it manually to OK 🤔

match failure {
ServerErrorsFailureClass::StatusCode(status) => {
if status.is_server_error() {
span.record("otel.status_code", &"ERROR");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is a bit of a backward way to do this. There is a specific call on the normal OpenTelemetry API to set the status code, which should be preferred. The tracing_opentelemetry glue would need to expose a new function to allow setting this, because it is in there, just not exposed. In the absence of that, this is reasonable, but it won't properly enforce the semantics of Ok vs Error vs Unset that the API uses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I don't understand what you mean. tracing_opentelemetry's docs mentions that you can set these fields but it doesn't set them automatically and it doesn't have APIs to set them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I missed that in the documentation for the crate. Because of the specific OTel semantics for how "Ok overrides all Error overrides all Unset", it is probably preferrable to have this be exposed on their extension trait, but if this is how they say to do it, it just pushes the responsibility for ensuring those semantics are correctly implemented onto us.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentel Span trait has a set_status method.
I've used it previously through the context's SpanRef.

@davidpdrsn
Copy link
Member Author

@neoeinstein Thanks for the feedback. I really appreciate it!

As mentioned on Discord, if possible, which it is, then I'd like to use the tracing stuff because thats what users are most likely to be familiar with and use already. If someone wants to write a middleware that uses the opentelemetry SDK directly, and not tracing, then they can do that.

I don't see how that is keeping a foot in both camps. tracing_opentelemetry is a library that exists to bridge the gap, in fact it is maintained by the same people that maintain the opentelemetry crate, so why not use it?

@sergeyshaykhullin
Copy link

Also mixing them both will lead a parent hierarchy problems tokio-rs/tracing#1458 if otel used after tracing

Opinionated part
I prefer using tracing because of brilliant instrument macro not polluting fn scopes

@floriantz
Copy link
Contributor

I do agree on relying on tracing as well. It makes adding opentelemetry on an codebase already instrumented with tracing really simple. And I like not being entirely dependant on otel as we might want other kinds of exporters as well.

But that's a very opinionated take from me.

@neoeinstein
Copy link
Contributor

I am already a heavy user of both tracing and OpenTelemetry, though my current use is primarily involving actix-web-opentelemetry, and am also a big fan of being able to drop #[instrument] wherever I need it. My opinions are thus based on my current use of the technology in production.

actin-web-opentelemetry doesn't interact with the tracing world directly. Instead, it has one concern: the propagation of the tracing context. In my case, I will generally then throw a #[instrument] on the handler function immediately after. The tracing-opentelemetry subscriber then receives those tracing spans and passes them to OpenTelemetry to place in context within the propagated trace.

I did pull down the PR here and play around with it. I did note that the recording of the trace_id in info_span! reports nothing if there was no incoming trace, even though a new trace root and id are getting generated by the OTel tracer. However when present, the trace_id from the traceparent header does get propagated into the tracing spans.

I also discovered that the OpenTelemetry "context" is not actually being set, even if a traceparent was given. This means that a later propagator calling Context::current() is not going to see the current span from tracing (I get an empty context in my testing) and instead generate a new root trace and span, unconnected to the inbound request. At a minimum it would be desirable to have the outgoing requests show as part of the same trace as is recorded for the incoming trace.

I did try doing something like tracing::Span::current().context().attach(), but that just creates a !Send context guard that reverts back once the guard is dropped. It also pollutes handler, making it !Send if attempting to hold the guard across an await (in the code below, move other().await into the guarded block).

A potential story here is that outgoing propagation libraries would need to do a bit of a dance to get the current context. This seems easy enough to wrap in a #[cfg(feature = "tracing)] if tracing isn't being used, but does mean that we introduce this concern into the ecosystem.

let context = Some(tracing::Span::current().context())
    .filter(|c| c.has_active_span())
    .unwrap_or_else(opentelemetry::Context::current);

// Or, in a more imperative form
let context = if tracing::Span::current().context().has_active_span() {
    opentelemetry::Context::current()
} else {
    tracing::Span::current().context()
}

Below is my code sample for handler in the example file:

async fn handler() -> &'static str {
    use tracing_opentelemetry::OpenTelemetrySpanExt;
    let context = opentelemetry::Context::current();
    tracing::info!(
        context.span = %context.span().span_context().span_id(),
        context.trace = %context.span().span_context().trace_id(),
        context.is_recording = %context.span().is_recording(),
        "context pulled from opentelemetry::Context::current()"
    );
    let context = tracing::Span::current().context();
    tracing::info!(
        context.span = %context.span().span_context().span_id(),
        context.trace = %context.span().span_context().trace_id(),
        context.is_recording = %context.span().is_recording(),
        "context pulled from tracing::Span::current().context()"
    );
    {
        let guard = context.attach();
        let context = opentelemetry::Context::current();
        tracing::info!(
            context.span = %context.span().span_context().span_id(),
            context.trace = %context.span().span_context().trace_id(),
            context.is_recording = %context.span().is_recording(),
            "context pulled from opentelemetry::Context::current() after .attach()"
        );
    }
    other().await;
    "Hello, World!"
}

async fn other() {
    tracing::info!("in other");
}

And some screenshots of the relevant Jaeger output and console messages.

Screenshot 2022-02-23 at 11-48-16 Jaeger UI

bildo

.unwrap_or_default();

let span = tracing::info_span!(
"HTTP request",
Copy link
Contributor

@neoeinstein neoeinstein Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to form the "name" of the span. We do want it to be relatively low cardinality (as per the spec), but the generally accepted form on the server side is something like {http.method} {http.route}, presuming route had a matched path (so only if MatchedPath came back as Some). In the event we don't have a matched path, then this should just be something like the HTTP request that you have here or HTTP {http.method} request.

That keeps the cardinality in check but also makes the span more easilly identifiable.

Copy link

@tiagolobocastro tiagolobocastro Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK non-constant values cannot be used to name a tracing span. I've got around this on my tracing layer by using opentelemetry span builder instead.

Copy link
Contributor

@floriantz floriantz Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set the name with the format {http.method} {http.route} to otel.name, I tested it with the otel exporter going to jaeger afterwards and I get the otel.name for the root span instead of HTTP Request.

otel.name is a special attribute to override the span name when using the otel exporter which exists specifically for this use-case if I understand well: https://docs.rs/tracing-opentelemetry/latest/tracing_opentelemetry/

@pauljamescleary
Copy link

Any update on this PR? Looks exactly like what we need

@davidpdrsn
Copy link
Member Author

@pauljamescleary no news yet. Honestly got kinda overwhelmed with all the feedback. I do wanna finish it eventually though.

If this is something you need urgently I'd recommend you copy the code into your codebase and make the necessary adjustments.

@pauljamescleary
Copy link

@davidpdrsn thanks for the update :)

davidB added a commit to davidB/sandbox_axum_observability that referenced this pull request Apr 24, 2022
- based on previosu work
- based on [Add OpenTelemetry middleware by davidpdrsn · Pull Request #769 · tokio-rs/axum](tokio-rs/axum#769) (code and comment)
@davidB
Copy link
Contributor

davidB commented May 26, 2022

I copied the code of the middleware as part of exploration of axum + reqwest + d-tracing (with otlp + tempo or jaeger) at https://github.com/davidB/sandbox_axum_observability I included some of your suggestions (Thanks):

  • to have a name with the route using otel.name
  • to have a trace_id before creating the tracing span and then having a non-empty trace_id into log
  • to retrieve trace_id in main code (by example to provide trace_id into the response for error, debug,...)

It's wip. If you're ok, I can contribute back to this PR, and collect feedback.
https://github.com/davidB/sandbox_axum_observability/blob/development/app/src/middleware/opentelemetry.rs

@davidpdrsn
Copy link
Member Author

I'm gonna close this for now. If someone wants to continue the work please do and post here to let people know. Its possible to implement this in a third party so no reason this must be built into axum.

@davidpdrsn davidpdrsn closed this Jun 6, 2022
@davidpdrsn davidpdrsn deleted the opentelemetry-middleware branch June 6, 2022 11:44
@davidB
Copy link
Contributor

davidB commented Jun 6, 2022

@davidpdrsn Are you ok, if I publish (as crate) my fork of your code (mainly based on the repo I shared in my previous message) ?

Thanks for your work

(sorry for double notification, I first send with my pro account)

@davidpdrsn
Copy link
Member Author

@davidB go for it! You're also welcome to add it to ECOSYSTEM.md.

@davidB
Copy link
Contributor

davidB commented Jun 11, 2022

I published a first version of the crate as https://crates.io/crates/axum-tracing-opentelemetry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenTelemetry compatible tracing middleware
9 participants