feat: Initial OpenTelemetry Spec #686

AbhiPrasad · 2022-09-13T11:43:24Z

https://develop-git-abhi-otel.sentry.dev/sdk/performance/opentelemetry/

This PR outlines an initial spec for OpenTelemetry usage in Sentry. It focuses on outlining the Span protocol (how to map from otel span -> sentry span), as well as mapping out some high level sections for us to work on here.

Open Questions

There are still some open questions we need to answer together at this stage.

At a high level we established that we would like the OpenTelemetry SDK to be responsible to create transactions. Therefore we must decide how this is going to be done

If there are two traces ongoing, do we create two transactions? If we do, which one becomes the active transaction? Do we store state about all ongoing transactions?

My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about. Pseudo-algo:

# Ideally it's a Weak Map that holds WeakRef
map = new Map();

def on_span_start(otel_span):
  transaction = map.get(otel_span.trace_id)
  if transaction:
    return transaction

  new_transaction = transaction_from_span(otel_span)
  map.add(new_transaction.trace_id, new_transaction)
  if (map.length === 1):
    make_active_transaction(new_transaction)
    
def on_span_finish(otel_span):
  transaction = map.get(otel_span.trace_id)
  if transaction
    if transaction.span_id == otel_span.span_id:
      transaction.finish(otel_span.end_timestamp)
      map.delete(transaction.trace_id)
      if get_active_transaction(transaction) == transaction
        make_active_transaction(nil)
    else:
      add_span_to_transaction(otel_span)
  # if no transaction exists, that means the transaction was already sent to Sentry and finished. 
  # Create a new transaction from this span and send it to Sentry
  new_transaction = transaction_from_span(otel_span)
  new_transaction.finish(otel_span.end_timestamp)

Right now I've briefly just said we rely on a combination of opentelemetry span kind, span status, span attributes and span name to decide sentry span operation and span description. Getting a correct span operation is imperative to making sure we can make use of Sentry product features (span operations, performance issues, etc) - so we'll need to make sure these are consistent.

OpenTelemetry has a list of semantic conventions that identify the type of span based on span attributes, but this is annoying to keep duplicating in every single SDK. The easiest solution here is just to push all of that work onto Relay.

So,

SDK maps all attributes -> tags
In Relay we detect a transaction as coming from opentelemetry:
In Relay, we process the tags to generate ops, descriptions and status for the spans (and also maybe transaction)

With dynamic sampling this is also not a problem, since we can just use the OpenTelemetry Span name as the transaction name, so everything should be consistent.

We did some of this mapping work in the Sentry Exporter. For the purposes of the MVP in Ruby/Python, we can prob do this mapping in the SDK though.

Final Thoughts

My goal here is to get this merged in after the protocol gets approved. We can then split up the work around each of the different sections and iterate on this together.

vercel · 2022-09-13T11:43:29Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
develop	✅ Ready (Inspect)	Visit Preview	Oct 17, 2022 at 3:24PM (UTC)

src/docs/sdk/performance/opentelemetry.mdx

vladanpaunovic

Thanks @AbhiPrasad!

I left a few minor details

vladanpaunovic · 2022-09-26T13:00:23Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+## Approach
+
+TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876


It would be good if @sl0thentr0py and @mjq-sentry would chime in a few words about the approach @mjq-sentry took. Can you guys do this?

src/docs/sdk/performance/opentelemetry/span-protocol.mdx

vladanpaunovic · 2022-09-26T13:42:13Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+## Transaction Protocol
+
+There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction.


When talking about this, I would make it clearer on what spans are Otel spans and what spans are Sentry spans.

I can kinda figure it out, but it feels like I am guessing.

+1 - maybe an illustration. I can help if needed.

I'll work on clarifying the wording, let me see what I can do about the illustration.

vladanpaunovic · 2022-09-26T13:45:32Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+Aside from information from Spans and Transactions, OpenTelemetry has meta-level information about the SDK, resource, and service that generated spans. To track this information, we generate a new OpenTelemetry Event Context.
+
+The existence of this context on an event (transaction or error) is how the Sentry backend will know that the incoming event is an OpenTelemetry event.


You can link to event payload here and move the Otel context to event payload page.

I'm actually going to leave this in here so we don't confuse others reading the develop docs (new hires, other teams, etc.). once we are comfortable with this new otel context schema, let's move this outside into the main event payload page.

vladanpaunovic · 2022-09-26T13:47:05Z

src/docs/sdk/performance/opentelemetry/index.mdx

+}
+```
+
+The reason sdk and service are split are so they can be indexed as top level fields in the future for easier usage within Sentry.


It would be good to capture how are we planning to fill in the service fields

This is done using by grabbing them off the OpenTelemetry resource, but I can do this!

danielkhan · 2022-09-26T19:41:47Z

@AbhiPrasad

If there are two traces ongoing, do we create two transactions? If we do, which one becomes the active transaction? Do we store state about all ongoing transactions?

Can you outline a scenario that would cause two ongoing traces within one execution?
Technically it's possible for Node but this should not be in the same context.
Within the same context, I would assume that there is only one active trace with unique trace ID.

My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about.

Such operations often cause memory leaks, don't they?

OpenTelemetry has a list of semantic conventions that identify the type of span based on span attributes, but this is annoying to keep duplicating in every single SDK. The easiest solution here is just to push all of that work onto Relay.

Yes to avoid redundancies by centralizing on Relay if needed.

Generally, the copy is excellent but sometimes reads a bit like a dev spec to me. Are the docs really the right place for all of it?

danielkhan · 2022-09-26T19:44:46Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>.
+
+TODO: Add history about OpenTelemetry Sentry Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/sentryexporter


Really needed in the docs? I would leave that to a README.

danielkhan · 2022-09-26T19:47:20Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+## Background
+
+When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>.


I would just mention that Sentry and OpenTelemetry have conceptual differences. This is normal and true for each vendor with an SDK that predates OTel. It doesn't matter why it is like that today. It's more important to a reader what the differences are.

danielkhan · 2022-09-26T19:49:00Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+## Approach
+
+TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876


Does this really need to go into the docs?
Should this document our way to a solution or the solution?

It should document the solution, I'll work with the team to get this in there.

src/docs/sdk/performance/opentelemetry/span-protocol.mdx

antonpirker · 2022-09-27T13:42:18Z

src/docs/sdk/performance/opentelemetry/index.mdx

+    };
+
+    // https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#telemetry-sdk
+    sdk?: {


Maybe call this "otel_sdk" (or something similar) so we do not confuse it with sentry sdk versions, making just everything just more explicit?

untitaker

I think this spec has some potentially big implications for storage so I would like somebody from S&S to review this as well.

untitaker · 2022-09-29T13:59:44Z

src/docs/sdk/performance/opentelemetry/index.mdx

+
+## Transaction Protocol
+
+There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction.


The span description becomes the transaction name

the span description is in some cases extremely high cardinality. Are we going to introduce more transaction name sources to mitigate this?

untitaker · 2022-09-29T14:04:21Z

src/docs/sdk/performance/opentelemetry/span-protocol.mdx

+
+In Sentry, we have two options for how to treat span events. First, we can add them as breadcrumbs to the transaction the span belongs to. Second, we can create an artificial "point-in-time" span (a span with 0 duration), and add it to the span tree. TODO on what approach we take here.
+
+In the special case that the span event is an exception span, [where the `name` of the span event is `exception`](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/), we also have the possibility of generating a Sentry error from an exception. In this case, we can create this [exception based on the attributes of an event](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/#attributes), which include the error message and stacktrace. This exception can also inherit all other attributes of the span event + span as tags on the event.


which and how many attributes do we expect there? if there are a ton of default attributes attached to an exception span, can we define which of them actually end up in event tags?

there is a storage concern about adding too many default tags to the event payload, and a product one as well considering user-defined tags and tags inherited from otel count against the same size limit

fpacifici · 2022-10-02T23:46:01Z

My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about. Pseudo-algo:

If I get this correctly, in this system, a sequence of spans at the root level of a trace would create one new transaction per span.
In the current Sentry protocol you could fit all of them in the same transaction.
Switching from the old model to the new model can multiply the number of produced transactions. Do we have a sense of the potential impact this would have on customer's quotas and ingestion capacity ?

fpacifici · 2022-10-02T23:47:52Z

Since this could have very large cross system implications have you considered discussing proposing an rfc ?
https://github.com/getsentry/rfcs/

AbhiPrasad · 2022-10-11T13:53:56Z

@fpacifici @untitaker Thanks for taking a look! I think we need to do some more testing before we can better understand storage and ingestion ramifications. For example, theoretically this change should send transactions at the same rate fir both an OpenTelemetry and Sentry SDK (otel sdk behaves === sentry sdk), but until we test various scenarios we cannot confirm this.

The primary purpose of this document was to outline a high level approach so we could start experimenting with the SDKs. Once we get to a comfortable spot and have an SDK prototype, we will open a more formal RFC so we can discuss infrastructure ramifications. There we can also bring up changes we'll need to make to Relay/ingestion to support this.

untitaker · 2022-10-11T14:06:44Z

There we can also bring up changes we'll need to make to Relay/ingestion to support this.

What if SDK changes are needed? Are those then still possible? If the answer is no, this is IMO not a sustainable approach to prototyping.

For example I see a 50:50 chance that we cannot send and store this data as tags because there are too many. Would your proposal be to later detect those payloads in Relay and selectively move tags elsewhere? I'm sure that's convenient for SDK authors but painful to maintain for ingest.

danielkhan · 2022-10-11T14:26:28Z

@untitaker

I'm sure that's convenient for SDK authors but painful to maintain for ingest.

Yet, if we need to do a translation of values and tags, it would be redundant to do that in the individual SDKs. Instead, I would rather - if possible - vow for a dedicated mapping component that does this. This component could maybe also be maintained by the SDK team.

untitaker · 2022-10-11T14:53:19Z

Yet, if we need to do a translation of values and tags, it would be redundant to do that in the individual SDKs.

I don't have an issue with putting more things in Relay. I mainly take issue with the idea that questions concerning infrastructure and software architecture are entirely left unanswered until we have shipped a prototype potentially already used by customers. I don't have a clear suggestion for what we should do (because we don't have an answer to those questions), and I suspect the concern about tags is not the biggest concern with this proposal anyway. If our answer is that SDK changes are still possible after prototyping (incl breakign changes) that's sufficient to me.

Regarding tags specifically

So I think this is going too much into detail, but some concerns with tags off the top of my head:

generically named otel attributes can conflict with user-defined tags
tags have a single static limit applied to them. if there are more than n tags, we have no way of knowing which tags to trim first unless we build a static map of tag names we consider otel-specific
tags can only be strings (so greater-than or smaller-than queries are not possible)

then there are combinations of those problems. for example, let's say the UI is built to interpret foo.bar tag a certain way (because foo.bar has meaning in otel), eg assume an integer value. What happens if the user sends a custom tag of the same name?

you can fix all of those problems by making relay convert tags into some other structure in the payload (which is something we currently don't do). There are two downsides to that:

the event json visible in the UI further diverges from what the user sends
tag limits are still applied beforehand depending on customer relay version

one other option would be to dump otel attributes into a new custom context, and then let relay figure out which of those attributes can be indexed and which limits to apply. More time would be needed (and an answer to the above questions) to determine whether that's a good idea.

Or, again, you're saying upfront it's fine to change the schema in breaking ways post-prototyping.

untitaker · 2022-10-11T14:58:43Z

Instead, I would rather - if possible - vow for a dedicated mapping component that does this.

I think what you're suggesting here is that we should maintain two separate schemas: ingestion schema maintained by SDKs, and storage schema maintained by infra teams. I think that idea has potential but it's not where we're at right now, and it would break some assumptions both customers (and tooling built by customers) and our internals have. One needs to be a superset of the other, I don't think it's feasible to have them be maintained by two separate teams

danielkhan · 2022-10-12T13:45:03Z

I mainly take issue with the idea that questions concerning infrastructure and software architecture are entirely left unanswered until we have shipped a prototype potentially already used by customers.

We won't release a prototype to customers, @untitaker.
This right here is the right place to raise all the concerns and discuss solutions and once this is all figured out, we will go ahead and productize this. Nothing will be released that isn't production ready and signed off by all teams this topic touches.

I don't think it's feasible to have them be maintained by two separate teams

That's fine for me. It was just an idea but if we aren't there yet, it's just that.

AbhiPrasad · 2022-10-17T13:12:14Z

one other option would be to dump otel attributes into a new custom context, and then let relay figure out which of those attributes can be indexed and which limits to apply. More time would be needed (and an answer to the above questions) to determine whether that's a good idea.

So for now, we are going to go with this approach - and then we can evaluate what to do as we see data coming in. So nothing stored in tags for now for transactions.

I'm going to revamps this based on some convos we had, and merge this in as a starting point. The next step we need to take here is to document the SDK API so that we have a baseline to work off of.

Once we're comfortable with that, we can start RFCs / wider convos re: logic in Relay and product implications.

untitaker · 2022-10-17T14:08:09Z

Sounds good, thanks @AbhiPrasad!

vercel bot had a problem deploying to Preview September 13, 2022 11:44 Failure

feat: OpenTelemetry Spec

faa85a8

AbhiPrasad force-pushed the abhi-otel branch from c67b805 to faa85a8 Compare September 13, 2022 12:25

vercel bot had a problem deploying to Preview September 13, 2022 12:26 Failure

vladanpaunovic reviewed Sep 13, 2022

View reviewed changes

src/docs/sdk/performance/opentelemetry.mdx Outdated Show resolved Hide resolved

AbhiPrasad self-assigned this Sep 14, 2022

remove broken links

4465be8

vercel bot had a problem deploying to Preview September 15, 2022 13:51 Failure

remove table comment that borks the mdx transpiler 🤔

b146918

vercel bot deployed to Preview September 15, 2022 13:56 View deployment

get span protocol down

6d374fc

vercel bot deployed to Preview September 19, 2022 09:00 View deployment

AbhiPrasad added 3 commits September 19, 2022 11:47

clean up naming

85881b7

add docs around span events

0894725

remove todos around span protocol

5343eed

vercel bot deployed to Preview September 19, 2022 10:08 View deployment

add otel context spec

67ee7cd

vercel bot deployed to Preview September 19, 2022 11:43 View deployment

add transaction protocol notes

b44fa21

AbhiPrasad force-pushed the abhi-otel branch from b442499 to b44fa21 Compare September 19, 2022 12:02

vercel bot deployed to Preview September 19, 2022 12:03 View deployment

AbhiPrasad marked this pull request as ready for review September 19, 2022 12:41

AbhiPrasad changed the title ~~feat: OpenTelemetry Spec~~ feat: Initial OpenTelemetry Spec Sep 19, 2022

AbhiPrasad requested a review from mjq-sentry September 19, 2022 12:42

simplify span descriptions

6d75ff9

vercel bot deployed to Preview September 20, 2022 09:00 View deployment

AbhiPrasad mentioned this pull request Sep 23, 2022

fix: Update span ops getsentry/sentry-symfony#655

Merged

vladanpaunovic reviewed Sep 26, 2022

View reviewed changes

danielkhan reviewed Sep 26, 2022

View reviewed changes

antonpirker reviewed Sep 27, 2022

View reviewed changes

souredoutlook mentioned this pull request Sep 27, 2022

Is there any interest in supporting Open Telemetry? getsentry/sentry#39231

Closed

untitaker reviewed Sep 29, 2022

View reviewed changes

smeubank mentioned this pull request Oct 3, 2022

Ruby SDK basic OTEL Support getsentry/sentry-ruby#1907

Closed

1 task

smeubank mentioned this pull request Oct 17, 2022

Basic OpenTelemetry (OTEL) Support getsentry/sentry-python#1687

Closed

AbhiPrasad added 3 commits October 17, 2022 15:16

Merge branch 'master' into abhi-otel

0c4a40d

include info about attribtues, mention no tags in transactions

69db3e0

make the tags ultra clear

4cfca01

vercel bot had a problem deploying to Preview October 17, 2022 14:00 Failure

vercel bot had a problem deploying to Preview October 17, 2022 14:01 Failure

fix link

cbcd399

vercel bot deployed to Preview October 17, 2022 15:24 View deployment

AbhiPrasad merged commit 5a8a175 into master Oct 17, 2022

AbhiPrasad deleted the abhi-otel branch October 17, 2022 15:26

smeubank mentioned this pull request Oct 30, 2022

[GO] Add basic OpenTelemetry support getsentry/sentry-go#486

Closed

smeubank mentioned this pull request Nov 25, 2022

OpenTelemetry Support getsentry/sentry-dotnet#2066

Closed


		## Approach

		TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876


		## Transaction Protocol

		There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction.


		Aside from information from Spans and Transactions, OpenTelemetry has meta-level information about the SDK, resource, and service that generated spans. To track this information, we generate a new OpenTelemetry Event Context.

		The existence of this context on an event (transaction or error) is how the Sentry backend will know that the incoming event is an OpenTelemetry event.


		When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>.

		TODO: Add history about OpenTelemetry Sentry Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/sentryexporter


		## Background

		When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>.


		In Sentry, we have two options for how to treat span events. First, we can add them as breadcrumbs to the transaction the span belongs to. Second, we can create an artificial "point-in-time" span (a span with 0 duration), and add it to the span tree. TODO on what approach we take here.

		In the special case that the span event is an exception span, [where the `name` of the span event is `exception`](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/), we also have the possibility of generating a Sentry error from an exception. In this case, we can create this [exception based on the attributes of an event](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/#attributes), which include the error message and stacktrace. This exception can also inherit all other attributes of the span event + span as tags on the event.

Uh oh!

feat: Initial OpenTelemetry Spec #686

feat: Initial OpenTelemetry Spec #686

Uh oh!

Conversation

AbhiPrasad commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Open Questions

Final Thoughts

Uh oh!

vercel bot commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vladanpaunovic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielkhan commented Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

untitaker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fpacifici commented Oct 2, 2022

Uh oh!

fpacifici commented Oct 2, 2022

Uh oh!

AbhiPrasad commented Oct 11, 2022

Uh oh!

untitaker commented Oct 11, 2022

Uh oh!

danielkhan commented Oct 11, 2022

Uh oh!

untitaker commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

untitaker commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielkhan commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AbhiPrasad commented Oct 17, 2022

Uh oh!

untitaker commented Oct 17, 2022

Uh oh!

AbhiPrasad commented Sep 13, 2022 •

edited

Loading

vercel bot commented Sep 13, 2022 •

edited

Loading

danielkhan commented Sep 26, 2022 •

edited

Loading

untitaker commented Oct 11, 2022 •

edited

Loading

untitaker commented Oct 11, 2022 •

edited

Loading

danielkhan commented Oct 12, 2022 •

edited

Loading