Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from OpenTracing to OpenTelemetry instrumentation #3646

Merged
merged 14 commits into from
Sep 5, 2024

Conversation

andreasgerstmayr
Copy link
Contributor

@andreasgerstmayr andreasgerstmayr commented May 3, 2024

What this PR does:

Which issue(s) this PR fixes:
Fixes #2224

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
@andreasgerstmayr
Copy link
Contributor Author

This is a first draft to replace OpenTracing with OpenTelemetry SDK.

Some open questions

Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
@zalegrala
Copy link
Contributor

Thanks for the PR, this looks like its on the right track to me.

  • Creating a new tracer per package does seem like the way to go for the transition to the OTEL SDK.
  • unit64 -> int64 attributes is likely fine. I don't think traceql knows the difference.
  • I don't worry too much about backwards compatibility here since I think the number of users who will care is small, but we need to mark it as a breaking change. This will impact us internally and we will need to make preparations for this to land in main.

One concern I have is whether we lose the server spans. I see the dskit change linked above, but I'm not clear this addresses the issue. Currently, many spans are generated on the dskit server, which connects various components since the server is used for all components for grpc and http, and these spans are using the opentracing SDK. Are you able to confirm the impact here? I can test this PR out once you give the nod that you're ready.

I will mention that there has been some effort in the past to align this between all databases, and since dskit is in the mix, we may not be able to make this change on our own. I've had a coupe branches of my own which look similar, but the concern above blocked forward movement. I expect that we can have updated information here next quarter, but the complexity of moving all databases gave us some pause previously, and so some coordination may be prudent.

@andreasgerstmayr
Copy link
Contributor Author

One concern I have is whether we lose the server spans. I see the dskit change linked above, but I'm not clear this addresses the issue. Currently, many spans are generated on the dskit server, which connects various components since the server is used for all components for grpc and http, and these spans are using the opentracing SDK. Are you able to confirm the impact here? I can test this PR out once you give the nod that you're ready.

So far my testing consisted of running two tempo instances (monolithic mode), one with this PR and one without, setting up two data sources in Grafana and opening the explore page in separate tabs, and then comparing random samples of traces. Initially the context propagation of the HTTP GET - api_search trace was broken for example, but this is fixed with grafana/dskit#518.

The HTTP <method> - <routename> spans are created in dskit using OpenTracing SDK, and they are correctly linked with the ones from the OTEL SDK by the OpenTracing bridge (@kvrhdn set up the bridge in #842, I didn't change this part).

Yes, you can test this PR already.
I didn't test the microservices mode yet, I'll do that next (therefore the PR is still in draft mode, and I want to compare more sample traces).

I will mention that there has been some effort in the past to align this between all databases, and since dskit is in the mix, we may not be able to make this change on our own. I've had a coupe branches of my own which look similar, but the concern above blocked forward movement. I expect that we can have updated information here next quarter, but the complexity of moving all databases gave us some pause previously, and so some coordination may be prudent.

I think it should work fine if dskit is instrumented with OpenTracing, as long as we don't remove the OpenTracing-OpenTelemetry bridge.

@andreasgerstmayr
Copy link
Contributor Author

I tested it in microservices mode today, looks good so far.
Please let me know if you find any broken traces.

@knylander-grafana
Copy link
Contributor

Greetings. Wondering if we'll need to update the docs for your PR changes. Will this migration impact users? For example, are there changes to configuration files? IF there are updates, we can create a doc issue for your work or you can add the doc updates to this or a separate PR.

@zalegrala
Copy link
Contributor

I took some time to deploy this, and it looks pretty good to me. Nice work. Here is a 6 hour window, and you can't really tell when the deployment happened.
image

Looking at alloy for the same period, you can see the transition in receiver protocol, but not much else.

image

It might be nice to include an option for using GRPC. Additionally, to @knylander-grafana 's point, we could call out the change in environment variables where it makes sense. I changed OTEL_EXPORTER_JAEGER_ENDPOINT -> OTEL_EXPORTER_OTLP_TRACES_ENDPOINT which was enough.

@andreasgerstmayr
Copy link
Contributor Author

@knylander-grafana Yes, there is impact for some users. I'll update the docs in this PR (this section: https://grafana.com/docs/tempo/latest/operations/monitor/#traces).

I'll also add gRPC support and update the env vars next week.

@zalegrala do you have a metric for the (average) number of spans per trace in your deployment? I'm mainly concerned about broken traces (broken context propagation), this should be visible if the number of spans per trace is different before and after applying this PR.

Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
@zalegrala
Copy link
Contributor

I don't have that metric currently, but I think we can work that up, perhaps next week. I expect the change from grpc -> http will mean there is more negotiation and slightly increase in spans, but I share the concern of broken traces. If we do have some after merge, then we can follow up, but it would be good to identify as much as we can ahead of time.

@andreasgerstmayr
Copy link
Contributor Author

I moved from the otlptracehttp to autoexport, which lets us control the transport via OTEL_EXPORTER_OTLP_PROTOCOL (defaults to http/protobuf).

I also updated the docs, changelog, docker compose example file and resolved all merge conflicts.

Today I noticed one issue in the logs around OpenCensus/metrics:

level=error ts=2024-05-29T14:23:47.75153709Z caller=main.go:259 msg=OpenTelemetry.ErrorHandler err="starting span \"ExportMetrics\": unsupported sampler: 0x1e42800"

afaics it's due to the OpenCensus bridge at https://github.com/census-instrumentation/opencensus-go/blob/master/metric/metricexport/reader.go#L191
with sampler = https://github.com/census-instrumentation/opencensus-go/blob/master/metric/metricexport/reader.go#L30,
which the bridge can't handle: https://github.com/open-telemetry/opentelemetry-go/blob/89c32cbd368fc786f5d3963f6979aedce4b27bb6/bridge/opencensus/internal/oc2otel/tracer_start_options.go#L31-L33

I don't know where Tempo uses OpenCensus metrics though, afaics Tempo does use prometheus/client_golang for metrics.

@zalegrala
Copy link
Contributor

The error you point to isn't new with this change, so probably safe to ignore, though would be nice to clean up.

@andreasgerstmayr
Copy link
Contributor Author

Found the one occurrence of OpenCensus metrics: modules/distributor/receiver/metrics.go. It's registering a few metrics with the DefaultRegisterer of prometheus.

But I'm unsure what the code is doing, I can't find any place where a metric value is set. The return value of newMetricViews() is assigned to receiversShim.metricViews, but never read again. And the defined metrics don't show up at the tempo:3200/metrics endpoint (even without this PR).

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc updates look good. This approval only applies to the doc and not to code changes.

@knylander-grafana knylander-grafana added the type/docs Improvements or additions to documentation label Jul 23, 2024
@knylander-grafana
Copy link
Contributor

@andreasgerstmayr You have some good work in this PR but there are some conflicts that need to be resolved.

@andreasgerstmayr andreasgerstmayr mentioned this pull request Jul 26, 2024
3 tasks
@andreasgerstmayr
Copy link
Contributor Author

@andreasgerstmayr You have some good work in this PR but there are some conflicts that need to be resolved.

Yes, unfortunately this PR gets out of date very quickly because it modifies many files. I've split off the update of the dskit dependency to a separate PR #3915 to make this PR a bit smaller. Once the other PR is merged, I'll fix the merge conflicts of this PR again.

Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
@@ -27,7 +27,6 @@ require (
github.com/gorilla/mux v1.8.1
github.com/grafana/dskit v0.0.0-20240801171758-736c44c85382
github.com/grafana/e2e v0.1.1
github.com/grpc-ecosystem/grpc-opentracing v0.0.0-20180507213350-8e809c8a8645
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@zalegrala
Copy link
Contributor

Nice work @andreasgerstmayr. Thanks for following up on this.

@zalegrala zalegrala merged commit f4b2642 into grafana:main Sep 5, 2024
17 checks passed
@andreasgerstmayr
Copy link
Contributor Author

Thanks for the review & merging! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

migrate from jaeger-sdk to otel-sdk
3 participants