Skip to content
This repository has been archived by the owner on Aug 14, 2024. It is now read-only.

feat: Initial OpenTelemetry Spec #686

Merged
merged 14 commits into from
Oct 17, 2022
Merged

feat: Initial OpenTelemetry Spec #686

merged 14 commits into from
Oct 17, 2022

Conversation

AbhiPrasad
Copy link
Member

@AbhiPrasad AbhiPrasad commented Sep 13, 2022

https://develop-git-abhi-otel.sentry.dev/sdk/performance/opentelemetry/

This PR outlines an initial spec for OpenTelemetry usage in Sentry. It focuses on outlining the Span protocol (how to map from otel span -> sentry span), as well as mapping out some high level sections for us to work on here.

Open Questions

There are still some open questions we need to answer together at this stage.

  1. At a high level we established that we would like the OpenTelemetry SDK to be responsible to create transactions. Therefore we must decide how this is going to be done

If there are two traces ongoing, do we create two transactions? If we do, which one becomes the active transaction? Do we store state about all ongoing transactions?

My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about. Pseudo-algo:

# Ideally it's a Weak Map that holds WeakRef
map = new Map();

def on_span_start(otel_span):
  transaction = map.get(otel_span.trace_id)
  if transaction:
    return transaction

  new_transaction = transaction_from_span(otel_span)
  map.add(new_transaction.trace_id, new_transaction)
  if (map.length === 1):
    make_active_transaction(new_transaction)
    
def on_span_finish(otel_span):
  transaction = map.get(otel_span.trace_id)
  if transaction
    if transaction.span_id == otel_span.span_id:
      transaction.finish(otel_span.end_timestamp)
      map.delete(transaction.trace_id)
      if get_active_transaction(transaction) == transaction
        make_active_transaction(nil)
    else:
      add_span_to_transaction(otel_span)
  # if no transaction exists, that means the transaction was already sent to Sentry and finished. 
  # Create a new transaction from this span and send it to Sentry
  new_transaction = transaction_from_span(otel_span)
  new_transaction.finish(otel_span.end_timestamp)
  1. Right now I've briefly just said we rely on a combination of opentelemetry span kind, span status, span attributes and span name to decide sentry span operation and span description. Getting a correct span operation is imperative to making sure we can make use of Sentry product features (span operations, performance issues, etc) - so we'll need to make sure these are consistent.

OpenTelemetry has a list of semantic conventions that identify the type of span based on span attributes, but this is annoying to keep duplicating in every single SDK. The easiest solution here is just to push all of that work onto Relay.

So,

  1. SDK maps all attributes -> tags
  2. In Relay we detect a transaction as coming from opentelemetry:
  3. In Relay, we process the tags to generate ops, descriptions and status for the spans (and also maybe transaction)

With dynamic sampling this is also not a problem, since we can just use the OpenTelemetry Span name as the transaction name, so everything should be consistent.

We did some of this mapping work in the Sentry Exporter. For the purposes of the MVP in Ruby/Python, we can prob do this mapping in the SDK though.

Final Thoughts

My goal here is to get this merged in after the protocol gets approved. We can then split up the work around each of the different sections and iterate on this together.

@vercel
Copy link

vercel bot commented Sep 13, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
develop ✅ Ready (Inspect) Visit Preview Oct 17, 2022 at 3:24PM (UTC)

@AbhiPrasad AbhiPrasad self-assigned this Sep 14, 2022
@AbhiPrasad AbhiPrasad marked this pull request as ready for review September 19, 2022 12:41
@AbhiPrasad AbhiPrasad changed the title feat: OpenTelemetry Spec feat: Initial OpenTelemetry Spec Sep 19, 2022
Copy link
Contributor

@vladanpaunovic vladanpaunovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AbhiPrasad!

I left a few minor details


## Approach

TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if @sl0thentr0py and @mjq-sentry would chime in a few words about the approach @mjq-sentry took. Can you guys do this?


## Transaction Protocol

There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When talking about this, I would make it clearer on what spans are Otel spans and what spans are Sentry spans.

I can kinda figure it out, but it feels like I am guessing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - maybe an illustration. I can help if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on clarifying the wording, let me see what I can do about the illustration.


Aside from information from Spans and Transactions, OpenTelemetry has meta-level information about the SDK, resource, and service that generated spans. To track this information, we generate a new OpenTelemetry Event Context.

The existence of this context on an event (transaction or error) is how the Sentry backend will know that the incoming event is an OpenTelemetry event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can link to event payload here and move the Otel context to event payload page.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually going to leave this in here so we don't confuse others reading the develop docs (new hires, other teams, etc.). once we are comfortable with this new otel context schema, let's move this outside into the main event payload page.

}
```

The reason sdk and service are split are so they can be indexed as top level fields in the future for easier usage within Sentry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to capture how are we planning to fill in the service fields

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done using by grabbing them off the OpenTelemetry resource, but I can do this!

@danielkhan
Copy link

danielkhan commented Sep 26, 2022

@AbhiPrasad

If there are two traces ongoing, do we create two transactions? If we do, which one becomes the active transaction? Do we store state about all ongoing transactions?

Can you outline a scenario that would cause two ongoing traces within one execution?
Technically it's possible for Node but this should not be in the same context.
Within the same context, I would assume that there is only one active trace with unique trace ID.

My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about.

Such operations often cause memory leaks, don't they?

OpenTelemetry has a list of semantic conventions that identify the type of span based on span attributes, but this is annoying to keep duplicating in every single SDK. The easiest solution here is just to push all of that work onto Relay.

Yes to avoid redundancies by centralizing on Relay if needed.

Generally, the copy is excellent but sometimes reads a bit like a dev spec to me. Are the docs really the right place for all of it?


When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>.

TODO: Add history about OpenTelemetry Sentry Exporter: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/sentryexporter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really needed in the docs? I would leave that to a README.


## Background

When Sentry performance monitoring was initially introduced, OpenTelemetry was in early stages. This lead to us adopt a slightly different model from OpenTelemetry, notably we have this concept of transactions that OpenTelemetry does not have. We've described this, and some more historical background, in our <Link to="/sdk/research/performance/">performance monitoring research document</Link>.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just mention that Sentry and OpenTelemetry have conceptual differences. This is normal and true for each vendor with an SDK that predates OTel. It doesn't matter why it is like that today. It's more important to a reader what the differences are.


## Approach

TODO: Talk about the approach we are using, based on Matt's hackweek project - https://github.com/getsentry/sentry-ruby/pull/1876

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to go into the docs?
Should this document our way to a solution or the solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should document the solution, I'll work with the team to get this in there.

};

// https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#telemetry-sdk
sdk?: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this "otel_sdk" (or something similar) so we do not confuse it with sentry sdk versions, making just everything just more explicit?

Copy link
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this spec has some potentially big implications for storage so I would like somebody from S&S to review this as well.


## Transaction Protocol

There is no concept of a transaction within OpenTelemetry, so we rely on promoting spans to become transactions. The span `description` becomes the transaction `name`, and the span `op` becomes the transaction `op`. Therefore, OpenTelemetry spans must be mapped to Sentry spans before they can be promoted to become a transaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The span description becomes the transaction name

the span description is in some cases extremely high cardinality. Are we going to introduce more transaction name sources to mitigate this?


In Sentry, we have two options for how to treat span events. First, we can add them as breadcrumbs to the transaction the span belongs to. Second, we can create an artificial "point-in-time" span (a span with 0 duration), and add it to the span tree. TODO on what approach we take here.

In the special case that the span event is an exception span, [where the `name` of the span event is `exception`](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/), we also have the possibility of generating a Sentry error from an exception. In this case, we can create this [exception based on the attributes of an event](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/exceptions/#attributes), which include the error message and stacktrace. This exception can also inherit all other attributes of the span event + span as tags on the event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which and how many attributes do we expect there? if there are a ton of default attributes attached to an exception span, can we define which of them actually end up in event tags?

there is a storage concern about adding too many default tags to the event payload, and a product one as well considering user-defined tags and tags inherited from otel count against the same size limit

@fpacifici
Copy link

My opinion: We track all transactions in a set, and remove them when we get finished. So instead of assigning to the active transaction on the scope, we assign to transactions the span processor is aware about. Pseudo-algo:

If I get this correctly, in this system, a sequence of spans at the root level of a trace would create one new transaction per span.
In the current Sentry protocol you could fit all of them in the same transaction.
Switching from the old model to the new model can multiply the number of produced transactions. Do we have a sense of the potential impact this would have on customer's quotas and ingestion capacity ?

@fpacifici
Copy link

Since this could have very large cross system implications have you considered discussing proposing an rfc ?
https://github.com/getsentry/rfcs/

@AbhiPrasad
Copy link
Member Author

@fpacifici @untitaker Thanks for taking a look! I think we need to do some more testing before we can better understand storage and ingestion ramifications. For example, theoretically this change should send transactions at the same rate fir both an OpenTelemetry and Sentry SDK (otel sdk behaves === sentry sdk), but until we test various scenarios we cannot confirm this.

The primary purpose of this document was to outline a high level approach so we could start experimenting with the SDKs. Once we get to a comfortable spot and have an SDK prototype, we will open a more formal RFC so we can discuss infrastructure ramifications. There we can also bring up changes we'll need to make to Relay/ingestion to support this.

@untitaker
Copy link
Member

There we can also bring up changes we'll need to make to Relay/ingestion to support this.

What if SDK changes are needed? Are those then still possible? If the answer is no, this is IMO not a sustainable approach to prototyping.

For example I see a 50:50 chance that we cannot send and store this data as tags because there are too many. Would your proposal be to later detect those payloads in Relay and selectively move tags elsewhere? I'm sure that's convenient for SDK authors but painful to maintain for ingest.

@danielkhan
Copy link

@untitaker

I'm sure that's convenient for SDK authors but painful to maintain for ingest.

Yet, if we need to do a translation of values and tags, it would be redundant to do that in the individual SDKs. Instead, I would rather - if possible - vow for a dedicated mapping component that does this. This component could maybe also be maintained by the SDK team.

@untitaker
Copy link
Member

untitaker commented Oct 11, 2022

Yet, if we need to do a translation of values and tags, it would be redundant to do that in the individual SDKs.

I don't have an issue with putting more things in Relay. I mainly take issue with the idea that questions concerning infrastructure and software architecture are entirely left unanswered until we have shipped a prototype potentially already used by customers. I don't have a clear suggestion for what we should do (because we don't have an answer to those questions), and I suspect the concern about tags is not the biggest concern with this proposal anyway. If our answer is that SDK changes are still possible after prototyping (incl breakign changes) that's sufficient to me.


Regarding tags specifically

So I think this is going too much into detail, but some concerns with tags off the top of my head:

  1. generically named otel attributes can conflict with user-defined tags
  2. tags have a single static limit applied to them. if there are more than n tags, we have no way of knowing which tags to trim first unless we build a static map of tag names we consider otel-specific
  3. tags can only be strings (so greater-than or smaller-than queries are not possible)

then there are combinations of those problems. for example, let's say the UI is built to interpret foo.bar tag a certain way (because foo.bar has meaning in otel), eg assume an integer value. What happens if the user sends a custom tag of the same name?

you can fix all of those problems by making relay convert tags into some other structure in the payload (which is something we currently don't do). There are two downsides to that:

  • the event json visible in the UI further diverges from what the user sends
  • tag limits are still applied beforehand depending on customer relay version

one other option would be to dump otel attributes into a new custom context, and then let relay figure out which of those attributes can be indexed and which limits to apply. More time would be needed (and an answer to the above questions) to determine whether that's a good idea.

Or, again, you're saying upfront it's fine to change the schema in breaking ways post-prototyping.

@untitaker
Copy link
Member

untitaker commented Oct 11, 2022

Instead, I would rather - if possible - vow for a dedicated mapping component that does this.

I think what you're suggesting here is that we should maintain two separate schemas: ingestion schema maintained by SDKs, and storage schema maintained by infra teams. I think that idea has potential but it's not where we're at right now, and it would break some assumptions both customers (and tooling built by customers) and our internals have. One needs to be a superset of the other, I don't think it's feasible to have them be maintained by two separate teams

@danielkhan
Copy link

danielkhan commented Oct 12, 2022

I mainly take issue with the idea that questions concerning infrastructure and software architecture are entirely left unanswered until we have shipped a prototype potentially already used by customers.

We won't release a prototype to customers, @untitaker.
This right here is the right place to raise all the concerns and discuss solutions and once this is all figured out, we will go ahead and productize this. Nothing will be released that isn't production ready and signed off by all teams this topic touches.

I don't think it's feasible to have them be maintained by two separate teams

That's fine for me. It was just an idea but if we aren't there yet, it's just that.

@AbhiPrasad
Copy link
Member Author

one other option would be to dump otel attributes into a new custom context, and then let relay figure out which of those attributes can be indexed and which limits to apply. More time would be needed (and an answer to the above questions) to determine whether that's a good idea.

So for now, we are going to go with this approach - and then we can evaluate what to do as we see data coming in. So nothing stored in tags for now for transactions.

I'm going to revamps this based on some convos we had, and merge this in as a starting point. The next step we need to take here is to document the SDK API so that we have a baseline to work off of.

Once we're comfortable with that, we can start RFCs / wider convos re: logic in Relay and product implications.

@untitaker
Copy link
Member

Sounds good, thanks @AbhiPrasad!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

7 participants