Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starfish Tracing Model #83

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
214 changes: 214 additions & 0 deletions text/0083-starfish-tracing-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
- Start Date: 2023-03-29
- RFC Type: feature
- RFC PR: [#83](https://github.com/getsentry/rfcs/pull/83)
- RFC Status: draft

# Summary

This RFC proposes a new tracing model for Sentry's performance product to better
address future product improvements. The goal of this model is to allow
**storing entire traces without gaps**, support **dynamic sampling**, **indexing
of spans** and to **extract metrics from spans pre-sampling**. This is an
evolution of the current transaction based approach.

# Motivation

Today Sentry has a strong concept of a "transaction" which appears in multiple parts of the
product. It is both the transport of span data, the billable entity and the only indexed
part of the product experience. This means that spans that exist outside of the transaction
cannot be represented and it also means that spans within a transaction are not indexed.

The existing model has worked well for us to get started with evolving the Sentry errors
product to capture performance traces, but it has restricted our ability to evolve the product
forward. It has created some restrictions on the SDK technology side (from high
memory pressure, payload size limits) and also has promoted a separation of transactions and
spans on the API layer which is untypical for tracing products. It also has meant that Sentry
has challenges with accepting traces coming directly from an OpenTelemetry exporter as the
transaction concept is not a concept that OpenTelemetry has.

We want to set a future direction that enables more flexible product choices and that we can
move towards from our existing tracing model. The goals are:

* Support capturing entire traces
* Have a data model story that allows us higher compatibility with Open Telemetry. Specifically
we want a model that would permit us to ingest Open Telemetry data right from an exporter
* Have a clear story for indexing and extracing metrics on a per-span level
* Unified spans and transactions from an SDK perspective
* Enable a path that allows clients to send 100% of spans outwards at least to a local aggregator,
preferrably a remote relay
Comment on lines +37 to +38

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which types of aggregation do you have in mind here ?
Is that the usual metrics extraction or something else?

* To better and directly support dynamic sampling in the core tracing model

We want to lay out a better path forward that

* capture entire traces
* browser tabs = a trace
* index and extract metrics on a span level
* clients send 100% of metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is in the TODO, but how does this proposal help with this "100% sending" or "memory-pressure/payload size" issue listed above? what about the current CPU overhead?

It is helpful for me to remember our own sampling limits, even after dynamic sampling has been implemented, are often set extremely low for performance reasons https://github.com/getsentry/sentry/blob/cc47128efd5a00e66b2475acc318ff57860fdf94/src/sentry/utils/sdk.py#L35

Copy link
Member

@JoshFerge JoshFerge Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even with a local relay I don't understand how we could ever have metrics sampled enough to be useful for SLAs / Alerts where performance of the process is not severely degraded in medium throughput and above processes. It seems to me the only way to accomplish this is for metrics collection directly in the process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah maybe this is what

extract metrics from spans pre-sampling.

is going to address?

* dynamic sampling narrows in on traces

# Terms

The new tracing model is an extension to our existing tracing model. As such we try to adhere
to some of the existing terms. Note that this document is intentionally glossing over some of
the details to better describe the desired end result. Individual RFCs will have to be written
to narrow down on specific schema definitions.

## Session

A session is an optional concept when talking about user actors on the system. A
session outlives one or more traces and is exclusively used when talking about human
interactions with a system.

## Trace

A trace has no end. It bundles spans together, some of those spans are organized
into segments (marked in red in the graph). The user experience does not center
around a trace, which really is an internal way to bundle things together but it
narrows down on segments within the trace.

```mermaid
gantt
title Example Starfish Trace
dateFormat x
axisFormat %S.%L

section Frontend
/checkout :crit, 0, 1500ms
GET /api/session :150, 170ms
POST /api/analytics :190, 70ms
GET /api/checkout/state :200, 500ms
GET /api/checkout/cart :1100, 140ms
<App/> :1300, 180ms
POST /api/analytics :done, 1450, 70ms
GET /assistent/poll :done, 1450, 120ms
POST /api/analytics :done, 1580, 70ms

section API Service
/api/checkout/state :crit, 240, 440ms
cache.get session#58;[redacted] :360, 10ms
db.query select from session :370, 20ms
db.query select from user :390, 20ms
db.query select from checkout :410, 20ms
http.request GET http#58;//payments/poll :450, 210ms
thread.spawn refresh-checkout-cache :done, 670, 220ms

section Payment Service
/poll :crit, 470, 180ms
db.query select from payment :490, 30ms
db.query update payment :530, 60ms
```

## Span
AbhiPrasad marked this conversation as resolved.
Show resolved Hide resolved

Spans behave very similar to how the function today, but they get elevated to a more significant
level. They largely follow the general semantics in the wider tracing eco system. To drive
our product ideas we are going to ensure that the quality of the spans is high and that they
provide at least the following pieces of information:

* `op`: defines the core operation that the span represents (eg: `db.query`)
* `description`: the most significant description of what this operation is (eg: the database query).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would heavily be in favor of re-naming this to name.

For those unfamiliar we currently have an inconsistency in the sentry schema where transactions have a name attribute and spans have a description (functionality identical).

I think most users are familiar with name as an attribute (closer to otel and other vendors as well then), and it would require less delta for us to approach.

We even alias span descriptions to names in the UI

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to get rid of description as the primary field then we need to rethink how we do the UX around some of those pieces. We currently stash into description what others for instance stash into http.route etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to get rid of description, simply just call it name instead of description. It can work the same exact way.

The description also gets processed and cleaned in the process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we expect to use this?
When we experimented with span search and when we built suspect spans one of the major issues was that all the useful information was stored unstructured in this field and all queries would have been regexes. It did not end well.

@nikhars , if this is going to be a searchable field you may want to look into ngram indexes.

* `trace_id`: a span relates to one trace by ID
mitsuhiko marked this conversation as resolved.
Show resolved Hide resolved
* `parent_span_id`: a span optionally points back to a parent trace which could be from a different
service
* `segment_id`: a span that is part of a segment, always refers back to it by the segment's
`span_id`.
* `is_segment`: when set to `true` this span is a segment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If segment is meant to be either a transaction or an interaction, should we build it into the spec here?

so segment: "transaction" | "interaction" | null, where null says that the span is not a segment.

* `start_time` and `end_time` to give the span time information.
* `tags`: a key/value pair of arbitrary tags that are set per-span. Conceptually however spans
also inherit the tags of the segments they are contained in.
* `measurements`: are span specific metrics that are stored with the span. There is a natural
measurement of a span which is the `duration` which is automatically calculated from the difference
of the end to the start timestamp.

Spans always belong to a trace, but not all spans belong to a segment. A span not belonging to a
segment are referred to as "detached" spans, spans that belong to a segment are "attached" spans.
A span must only be attached to a segment if it belongs to the same process and service. Remote
spans must never be attached to a segment.
mitsuhiko marked this conversation as resolved.
Show resolved Hide resolved

## Transaction

Previously transaction referred to the container that held spans. In the future a "transaction"
refers to the name of a specific type of segment level span that describes a meaningful activity
that starts with a request, and results in some meaningful response. By definition a span that
holds a transaction tag becomes a "segment". Other than that, it's just a regular span.

## Segment

A segment is a special type of span that is the "logical" activity in a service. For instance a
segment can be the endpoint implementation of an API request, it could be a task that is processed
by a task worker, it could be the navigation a user performs in a UI or a screen transition.
Conceptionally segments fall into two categories: "transactions" which are quite mechanical and
clearly defined operations and "interactions" which are user triggered operations. The difference
is that an "interaction" has a human user as an actor in it, that might influence it, whereas a
"transaction" is unlikely to be interrupted once started. A user for instance is quite likely to
navigate again even before the previous interaction finished, whereas a task is more likely than not
to conclude, even if what triggered the task is no longer interested in the result of the task.

The primary user experience in the product can narrow down on certain segments and make the trace
explorable via that segment.

**Locality:** All the spans that are attached to a segment thus must be local to the service and process. It's
still possible for a span to relate to a child of a segment or a segment directly via the `parent_span_id`,
but the `segment_id` must not be set.

**Joining:** At the end of a segment an implicit join is taking place. Any span
*that has not concluded
yet will be detached from the segment. In the following example the `<App/>`
span is part of the segment `/checkout` still where as the HTTP request related
spans that did not finish when the `/checkout` segment ended are then detached:

```mermaid
gantt
title Trace Showing Attached and Detached Spans
dateFormat x
axisFormat %S.%L

section Frontend
/checkout :crit, 0, 500ms
<App/> :300, 180ms
POST /api/analytics :done, 450, 70ms
GET /assistent/poll :done, 450, 120ms
POST /api/analytics :done, 580, 70ms
```

**Logical Tag Promotion:** tags attached to a segment logically also belong to the
spans contained within. This does not mean that tags are actually duplicated down
to all child spans, but for instance it means that if a `release` or `environment`
tag is attached to a segment, then it also automatically extends to all the child
spans that are attached to that segment. This is particularly relevant for metrics
extraction.

**Logical Metrics Promotion:** certain metrics relating to the composition of child
attached child spans are promoted to the segment. For instance the break downs
(how much time was spent in db vs http) that previously was a transaction level
property now is calculated onto the segment. Likewise we might consider counting
number of spans of a certain category per segment and have these counters be
promoted into the segment.

## Batches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an interest in a tracing model that actually works for batched workloads.

I'm trying to figure out how a tracing story could work for arroyo, and so far it seems that one would both:

  1. want a transaction wrapped around a function that operates on a batch
  2. want to be able to correlate that transaction to the origin of each message in that batch

so that to me implies that one transaction may have multiple parent traces

that's the only solution i could make sense of, but it sucks for a lot of reasons, so I'm curious whether we have a better plan here


TODO: document me.

# Metrics

The starfish tracing model does not enable metrics ingestion, but it allows attaching
metrics to spans. Span bound metrics are called "measurements". Every span gets a
default measurement called `duration` attached to it but other measurements can be
added. For instance LCP and other important web vitals can be attached directly to a
segment as measurement. SDKs can add further measurements to spans if they see
value in it. For instance if the SDK understands how long it waited for the
establishing of a connection it could attach a `socket.connect_time` or similar
numbers. Other examples might be things like `cache.result_size` etc.

Measurements are ingested broken down by segment and span level tags before dynamic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about breaking down by all tags and letting these be arbitrary.
We have plenty of unique values in tags today.
Spans represent currently on average 25x the volume of transactions. If the cardinality will be extreme (like unique values in tags to propagate) pre-aggregation will be worthless independently on the storage used, so we need to be careful here.

sampling throws away data.

# Drawbacks

# TODO

* metrics extraction
* tag propagation
* transitional mapping