Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing: Migrate Google Cloud (Stackdriver) client to OpenTelemetry #4838

Merged
merged 12 commits into from
Mar 17, 2022

Conversation

matej-g
Copy link
Collaborator

@matej-g matej-g commented Nov 7, 2021

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This is first in the series of PRs to start migrating our tracing from the OpenTracing packages to OpenTelemetry. This PR in particular moves Google Cloud provider to OpenTelemetry. In details:

  • Migrate Stackdriver to OpenTelemetry, rename to Google Cloud, while keeping the configuration backwards compatible
  • Introduce utilities (bridge tracer and sampler), which will be required to build the OpenTelemetry tracer(s) while keeping the current behavior and features (e.g. forcing tracing by HTTP header)
  • Adjust HTTP middleware
  • Update documentation

Partially addresses #1972.

Verification

Tests have been updated; the change has been also verified manually by sending traces to Google Cloud.

- Add method to create bridge tracer
- Implement a sampler which enables us to force tracing

Signed-off-by: Matej Gera <matejgera@gmail.com>
- Rename to Google Cloud, keep it backwards compatible
- Refactor and move to OTEL exporter
- Adjust factory to use bridge tracer for this provider

Signed-off-by: Matej Gera <matejgera@gmail.com>
- to ensure force tracing / populate trace ID header works for the bridge tracer as well

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
}

if len(os.Args) > 1 {
attr = append(attr, attribute.String("binary_cmd", os.Args[1]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have binary Infos to attributes? Maybe remove this or at least make this configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opentelemetry specification says lots of semantic_conventions , we should use it as possible. like process.executable.name. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/process.md#process

Copy link
Collaborator Author

@matej-g matej-g Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this was purely to keep the behavior, attribute name etc. the same as it is now, as we wanted to make the change as 'backwards compatible' as possible. But yes, OTEL has more detailed guidance on semantic conventions and I'm definitely planning to follow them as we move the rest of the instrumentation to OTEL. However, if the feeling is that this can be changed already here, I'm happy to follow through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matej-g
Copy link
Collaborator Author

matej-g commented Nov 29, 2021

ping @yeya24 @hanjm if you have interest and capacity to review by any chance, I'd be grateful 🙏

@hanjm
Copy link
Member

hanjm commented Nov 30, 2021

💪 LGTM

@hanjm
Copy link
Member

hanjm commented Dec 11, 2021

ping @yeya24

pkg/tracing/google_cloud/google_cloud.go Outdated Show resolved Hide resolved

resource, err := resource.New(ctx, resource.WithAttributes(collectAttributes(serviceName)...))
if err != nil {
level.Warn(logger).Log("msg", "detecting resources for tracing provider failed", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, should we return it or do we need to deal with it in line 67 as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an error, we should still receive valid (although possibly empty) resource. I think best we can do is log warning and move on (or we can possibly change level to error). I'll add a comment as well.

@yeya24
Copy link
Contributor

yeya24 commented Dec 20, 2021

Would like to move this forward 😄

@matej-g
Copy link
Collaborator Author

matej-g commented Dec 21, 2021

Hey @yeya24 thanks for the ping. I updated the PR and also now updated OTEL to version 1.3.0.

However, after further testing, I'm noticing the traces get 'broken' between components 😞. For example when executing query -> receiver, it shows up as two distinct traces for each component - it gets 'split' on the store gRPC operation between client and server; I'm not sure exactly why, whether some changes are required to the middleware as well. I'll need more time to sort it out.

@hanjm
Copy link
Member

hanjm commented Dec 22, 2021

- Currently, bridge tracer supports only HTTP headers as a carrier.However, our instrumentation e.g. for gRPC uses metatada.MD as a carrier instead, breaking the propagatin. This fix works around it by 'converting' the carrier to HTTP header. See code docs for details.

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
@matej-g
Copy link
Collaborator Author

matej-g commented Jan 28, 2022

Apologies for the long delay on this one, but I needed to find a bit of time to pin this down.

@hanjm I could not at first understand why the data was not being propagated, as you point out, but in the end I discovered that the OpenTelemetry's BridgeTracer supports only HTTP headers as a carrier when propagating. Since in the gRPC middleware we are using the gRPC metadata as a carrier, the trace injection was always failing due to unsupported carrier. I created a wrapper for the bridge tracer to work around this. I have tested is as well on Google Cloud and now the traces are looking as expected:
trace_example

cc @yeya24

@matej-g
Copy link
Collaborator Author

matej-g commented Feb 24, 2022

Gentle ping @hanjm @yeya24, if you'd get a chance to re-review, thank you 🙏

@hanjm
Copy link
Member

hanjm commented Feb 24, 2022

Overall LGTM, Thanks you.

hanjm
hanjm previously approved these changes Feb 24, 2022
}

if len(os.Args) > 1 {
attr = append(attr, attribute.String("binary_cmd", os.Args[1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

go.mod Outdated
go.opentelemetry.io/otel v1.3.0
go.opentelemetry.io/otel/bridge/opentracing v1.3.0
go.opentelemetry.io/otel/sdk v1.3.0
go.opentelemetry.io/otel/trace v1.3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.3.1 has released.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update to latest version!

pkg/tracing/client/factory.go Outdated Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Mar 5, 2022

We should be good to go after fixing @hanjm 's comments 👍

Signed-off-by: Matej Gera <matejgera@gmail.com>
@matej-g
Copy link
Collaborator Author

matej-g commented Mar 17, 2022

Thanks @hanjm @yeya24, hopefully now this is good to go!

@yeya24 yeya24 enabled auto-merge (squash) March 17, 2022 20:47
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yeya24 yeya24 merged commit 72d63e3 into thanos-io:main Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants