Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from OpenTracing to OpenTelemetry #1548

Closed
wants to merge 1 commit into from

Conversation

rajatvig
Copy link

@rajatvig rajatvig commented Dec 29, 2021

This moves the tracing setup from OpenTracing (now deprecated) to OpenTelemetry. The code is vendor agnostic and tracing data is sent over GRPC to the collector as configured in the environment.

Have removed the code in tracetools and inlined the span create and end in bootstrap.

There is a small demo setup of the OpenTelemetry Collector and Jaeger to see traces from the agents locally.

This would very likely break any consumers of the current Datadog based tracing setup as it relies on a running Collector. However, as documented here, it is easy to get it stood up for Datadog.

@pda
Copy link
Member

pda commented Jan 5, 2022

Thanks very much @rajatvig, that looks fantastic.

I'm hoping we can soon find time to review it, and think about compatibility/migration for existing Datadog/OpenTracing users.

@rajatvig
Copy link
Author

rajatvig commented Jan 7, 2022

@pda Thanks for looking into it.

There is the option to send the traces to the deployed Datadog Agent as documented here. The endpoint can be configured using OTEL_EXPORTER_OTLP_TRACES_ENDPOINT variable. Ideally, it should be a no-op from that perspective but I don't have a Datadog setup to verify.

@cyrille-leclerc
Copy link

cyrille-leclerc commented Jan 10, 2022

It's really exciting.

@fatmcgav
Copy link
Contributor

@rajatvig Firstly, thanks for taking the time to open this PR :)

I've tried to test these changes locally, but I'm not seeing any data sent through to Jaeger.

FYI, I'm doing:

❯ env |grep OTEL
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=development
❯ scripts/build-binary.sh linux amd64 test1
Building buildkite-agent with:

GOOS=linux
GOARCH=amd64
BUILD_VERSION=test1


Done: pkg/buildkite-agent-linux-amd64 💪
❯ cd pkg
❯ ll
total 25M
lrwxrwxrwx 1 fatmcgav fatmcgav   27 Jan 11 14:12 buildkite-agent -> buildkite-agent-linux-amd64
-rwxr-xr-x 1 fatmcgav fatmcgav  25M Jan 11 14:21 buildkite-agent-linux-amd64
❯ ./buildkite-agent start --token $BUILDKITE_AGENT_TOKEN --tracing-backend otel --build-path "$PWD/builds" --tags "queue=test"

   _           _ _     _ _    _ _                                _
  | |         (_) |   | | |  (_) |                              | |
  | |__  _   _ _| | __| | | ___| |_ ___    __ _  __ _  ___ _ __ | |_
  | '_ \| | | | | |/ _` | |/ / | __/ _ \  / _` |/ _` |/ _ \ '_ \| __|
  | |_) | |_| | | | (_| |   <| | ||  __/ | (_| | (_| |  __/ | | | |_
  |_.__/ \__,_|_|_|\__,_|_|\_\_|\__\___|  \__,_|\__, |\___|_| |_|\__|
                                                 __/ |
 https://buildkite.com/agent                    |___/

2022-01-11 14:23:15 NOTICE Starting buildkite-agent v3.33.3 with PID: 3026700
2022-01-11 14:23:15 NOTICE The agent source code can be found here: https://github.com/buildkite/agent
2022-01-11 14:23:15 NOTICE For questions and support, email us at: hello@buildkite.com
2022-01-11 14:23:15 INFO   Registering agent with Buildkite...
2022-01-11 14:23:15 INFO   Successfully registered agent "fatmcgav-p1g2" with tags [queue=test]
2022-01-11 14:23:15 INFO   Starting 1 Agent(s)
2022-01-11 14:23:15 INFO   You can press Ctrl-C to stop the agents
2022-01-11 14:23:15 INFO   fatmcgav-p1g2 Connecting to Buildkite...
2022-01-11 14:23:16 INFO   fatmcgav-p1g2 Waiting for work...

However I don't see any traces hitting my local Jaeger instance...

I do see a couple of additional log lines in my build output which weren't there prior to these changes:

2022/01/11 14:23:42 context deadline exceeded
2022/01/11 14:23:57 context deadline exceeded

Any suggestions?

Ultimately, I want to test sending traces to an Elastic APM server instance, but thought I'd test with Jaeger first :)

Thanks

@rajatvig
Copy link
Author

@fatmcgav Are you running the OTLP Collector as well? The error seems to suggest not. The port 4317 is served by the OTLP Collector and that should be configured to export to Jaeger. There is a sample setup of the collector and Jaeger here.

@fatmcgav
Copy link
Contributor

@fatmcgav Are you running the OTLP Collector as well? The error seems to suggest not. The port 4317 is served by the OTLP Collector and that should be configured to export to Jaeger. There is a sample setup of the collector and Jaeger here.

@rajatvig sorry, I should've said that I am using the example OTEL collector included in this PR...

@rajatvig
Copy link
Author

@fatmcgav That seems odd. I just ran it locally and it works with almost the same steps as you outlined. The only difference being I am on a macOS and not on a Linux host. It might have something to do with how docker compose is binding the network.

Will try and run on a Linux machine tomorrow.

@fatmcgav
Copy link
Contributor

So I've spent a bit more time debugging this one, and was able to get traces into my local Jaeger instance by tweaking bootstrap.go:

diff --git a/bootstrap/bootstrap.go b/bootstrap/bootstrap.go
index 17365239..31249947 100644
--- a/bootstrap/bootstrap.go
+++ b/bootstrap/bootstrap.go
@@ -272,7 +278,12 @@ func (b *Bootstrap) startTracing(ctx context.Context) (trace.Span, context.Conte
        var stopper stopper
        switch b.Config.TracingBackend {
        case "datadog", "otel":
-               client := otlptracegrpc.NewClient()
+               client := otlptracegrpc.NewClient(
+                       otlptracegrpc.WithInsecure(),
+                       otlptracegrpc.WithEndpoint("localhost:4317"),
+                       otlptracegrpc.WithDialOption(grpc.WithBlock()),
+                       otlptracegrpc.WithReconnectionPeriod(time.Second),
+               )
                exporter, err := otlptrace.New(ctx, client)
                if err != nil {
                        b.shell.Errorf("Error creating OTLP trace exporter %s", err)

So it seems like there's some kinda issue with loading the config from env vars, atleast in my env...

I've confirmed that the env vars are set:

❯ env | grep OTEL
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=development
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=localhost:4317

@rajatvig
Copy link
Author

@fatmcgav The Go Exporters code here seems to indicate it would derive the options based off the environment variables and that is the reason I did not set them explicitly in the code configuring the exporter.

@fatmcgav
Copy link
Contributor

My thoughts aswell, but it just isn't working for me...

I even went as far as building an agent from this PR, pushing a local docker image and testing on K8s with the relevant OTEL_ env vars set, but still no joy...

Unfortunately setting OTEL_LOG_LEVEL=debug doesn't provide any useful logging either, I'd guess due to the logging setup within the buildkite-agent...

@rajatvig
Copy link
Author

@fatmcgav I am guessing the ENV variables are not getting passed to the Agent process you are running. Maybe add debugging from the Agent to spit out the environment variables it sees. Also, the endpoint does need to be prefixed with http:// else it does not set the insecure option.

@fatmcgav
Copy link
Contributor

@fatmcgav I am guessing the ENV variables are not getting passed to the Agent process you are running. Maybe add debugging from the Agent to spit out the environment variables it sees. Also, the endpoint does need to be prefixed with http:// else it does not set the insecure option.

So my test pipeline runs the following step:

trap 'kill -- $$' INT TERM QUIT; echo 'Hello world!' && env | grep OTEL | sort && sleep 60 && echo 'Done!'

And I can see the following output in Buildkite build log:

Hello world!
--
  | OTEL_EXPORTER_OTLP_ENDPOINT=http://<configured endpoint>
  | OTEL_EXPORTER_OTLP_INSECURE=true
  | OTEL_LOG_LEVEL=debug
  | OTEL_RESOURCE_ATTRIBUTES=deployment.environment=development
  | Done!

@rajatvig
Copy link
Author

The environment variable is OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. In the list you have it is missing TRACES

@fatmcgav
Copy link
Contributor

fatmcgav commented Jan 15, 2022

The environment variable is OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. In the list you have it is missing TRACES

My understanding is that the TRACES value can be used as an override for default endpoint specified by OTEL_EXPORTER_OTLP_ENDPOINT...

Either way, I'm pretty sure I tested both with and without the TRACES in the env name...

Edit: Yeh, so the logic in https://github.com/open-telemetry/opentelemetry-go/blob/main/exporters/otlp/otlptrace/internal/otlpconfig/envconfig.go#L63 looks for the TRACES_ENDPOINT value first, and then falls back to ENDPOINT...

2. Export

```shell
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is using the GRPC based endpoint so it does not need that appended

@pda
Copy link
Member

pda commented Jan 24, 2022

Thanks so much @rajatvig @cyrille-leclerc @fatmcgav @cbandy for the thinking, testing and discussion around this.
We've got this queued in our current cycle, so stay tuned for some overdue action on our end 🙏

@yob
Copy link
Contributor

yob commented Jan 28, 2022

There is the option to send the traces to the deployed Datadog Agent as documented here.

.. and

Since versions 6.32.0 and 7.32.0, the Datadog Agent supports OTLP traces and metrics ingestion through both gRPC and HTTP.

This is very cool info, I didn't realise the datadog agent now ships with native support for ingestion of OTLP traces.

@ojkelly
Copy link

ojkelly commented Feb 3, 2022

I was able to get this to work on the elastic stack for buildkite with a bootstrap script by patching /etc/systemd/system/buildkite-agent.service

with Environment="OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317"

and echo 'tracing-backend=otel' >> /etc/buildkite-agent/buildkite-agent.cfg

And by running an otel collector on the agent.

Looking forward to seeing this merged :)

@jsoo1
Copy link

jsoo1 commented Mar 3, 2022

In #1563 I kinda assumed buildkite-agent would adopt OpenTelemetry. Here's a patch on this PR that is a rough draft of some more fields I would like to see on spans. I am very excited to see this merged, either way! Thank you!

From fc0e463fc21675bc4a5d155b4d4733f104303083 Mon Sep 17 00:00:00 2001
From: John Soo <john.soo@arista.com>
Date: Fri, 18 Feb 2022 07:04:20 -0800
Subject: [PATCH] tracing: Add buildkite_{base_branch,pull_request,label}
 attributes.

---
 bootstrap/bootstrap.go | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/bootstrap/bootstrap.go b/bootstrap/bootstrap.go
index 95e5365c..8d2a8939 100644
--- a/bootstrap/bootstrap.go
+++ b/bootstrap/bootstrap.go
@@ -240,6 +240,7 @@ func (b *Bootstrap) startTracing(ctx context.Context) (trace.Span, context.Conte
 	buildNumber, _ := b.shell.Env.Get("BUILDKITE_BUILD_NUMBER")
 	buildURL, _ := b.shell.Env.Get("BUILDKITE_BUILD_URL")
 	jobURL := fmt.Sprintf("%s#%s", buildURL, b.JobID)
+	pullRequestBaseBranch, _ := b.shell.Env.Get("BUILDKITE_PULL_REQUEST_BASE_BRANCH")
 	source, _ := b.shell.Env.Get("BUILDKITE_SOURCE")
 	label, hasLabel := b.shell.Env.Get("BUILDKITE_LABEL")
 	if !hasLabel {
@@ -288,12 +289,15 @@ func (b *Bootstrap) startTracing(ctx context.Context) (trace.Span, context.Conte
 			attribute.String("buildkite.org", b.OrganizationSlug),
 			attribute.String("buildkite.pipeline", b.PipelineSlug),
 			attribute.String("buildkite.branch", b.Branch),
+			attribute.String("buildkite.pull_request", b.PullRequest),
+			attribute.String("buildkite.pull_request_base_branch", pullRequestBaseBranch),
 			attribute.String("buildkite.job_id", b.JobID),
 			attribute.String("buildkite.job_url", jobURL),
 			attribute.String("buildkite.build_id", buildID),
 			attribute.String("buildkite.build_number", buildNumber),
 			attribute.String("buildkite.build_url", buildURL),
 			attribute.String("buildkite.source", source),
+			attribute.String("buildkite.label", label),
 			attribute.String("buildkite.retry", fmt.Sprintf("%v", retry)),
 			attribute.String("buildkite.parallel", fmt.Sprintf("%v", parallel)),
 			attribute.String("buildkite.rebuilt_from_id", rebuiltFromID),
-- 
2.34.0

@ojkelly
Copy link

ojkelly commented Apr 27, 2022

Bumping this, is there anything I can help with to get it over the line and merged?

We've found this data very valuable, and would love to get it into the next release.

@moskyb
Copy link
Contributor

moskyb commented Apr 28, 2022

Hey friends!

First off, an apology: we haven't given this PR the attention it deserves, and it's languished a bit. We've had competing priorities within the team, but we should've been clearer and more upfront about what was going on. We're working on processes within the team so that we can be better opensource stewards of Agent and Friends, but these things take time.

With that said, wrt. this PR, I have some good news and some bad news.

The good news is that OpenTelemetry support is a high priority for us at the moment. It's clear that this is something that people want in the agent, and they have good reason, so I'm going to be working on adding OpenTelemetry support to the agent over the next couple of weeks.

The bad news is that unfortunately, we won't be able to merge this PR. The PR is in good shape as it stands, but unfortunately, there are breaking changes in Datadog's behaviour between ingesting OpenTracing traces and OpenTelemetry traces. This means that were we to ship this as is, users currently using the (older) OpenTracing/Datadog integration would see differently shaped traces, and would need to do some extra work to get their Datadog agents.

With that in mind, I'm going to work on a separate PR that includes OpenTelemetry support in such a way that existing workflows aren't broken - likely taking liberal inspiration from this PR.

I'm happy to chat about this in the comments here, otherwise, feel free to get in touch with me on either the Enterprise or Community slacks (@‍Benno Moskovitz) or via email at ben.m@buildkite.com.

As a matter of logistics, I'll close this PR 2022-05-06, or once we've reached consensus in the comments, whichever comes later.

@ojkelly
Copy link

ojkelly commented Apr 28, 2022

Thanks for the update @moskyb, great to hear this is still happening 😃

@rajatvig
Copy link
Author

Thanks for the update @moskyb. Really glad this is moving forward. Am fine closing the PR also long as we get OpenTelemetry support in the Agent

@moskyb
Copy link
Contributor

moskyb commented May 5, 2022

as promised, i'm closing this now. thanks so much for your work @rajatvig!

You can follow along with my reimplementation (you might see some familiar code 👀) in #1630

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants