Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return weight value alongside a sampling decision. #180

Closed

Conversation

lizthegrey
Copy link
Member

@lizthegrey lizthegrey commented Jul 11, 2019

See https://www.honeycomb.io/blog/dynamic-sampling-by-example/2/ for info on why this is necessary.

RFC in flight: open-telemetry/oteps#24

cc @maplebed on our end who also will be interested in this.

(this change primarily introduced to force re-running of the CLA test)
@@ -190,7 +190,7 @@ The OpenTelemetry `SpanContext` representation conforms to the [w3c TraceContext

`SpanId` A valid span identifier is an 8-byte array with at least one non-zero byte.

`TraceOptions` contain details about the trace. Unlike Tracestate values, TraceOptions are present in all traces. Currently, the only TraceOption is a boolean `recorded` [flag](https://www.w3.org/TR/trace-context/#recorded-flag-00000001).
`TraceOptions` contain details about the trace. Unlike Tracestate values, TraceOptions are present in all traces. Currently, TraceOption includes a boolean `recorded` [flag](https://www.w3.org/TR/trace-context/#recorded-flag-00000001) and the `sampleWeight`, which is the approximate number of potentially unsampled events a recorded span represents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceOptions is a bitmask now following the W3C spec, why not put the sample weight in Tracestate which carries vendor-specific info?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link https://www.w3.org/TR/trace-context/#recorded-flag-00000001 seems to point to non-existing recorded-flag-00000001 anchor.

Copy link
Member Author

@lizthegrey lizthegrey Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try https://www.w3.org/TR/trace-context/#trace-flags

Yes, as per our discussion, putting sample rate in tracestate seems fine as a "vendor-specific" thing specific to OpenTelemetry SDK.

However, I don't want to have a Honeycomb-specific sample rate that doesn't interoperate with the other opentelemetry SDK samplers, etc.


For instance, its value should be `1 / samplingProbability` for a ProbabilitySampler, or `observedQps / targetQps` for a `RateLimitingSampler`. With a sampling probability of `0.01`, `COUNT(spans) WHERE service=foo` will return `49*100=4900` rather than the unweighted value `49` and `sum(request_count{service=foo})` will return an exact number like `4930`.

This value should be passed along on the wire between different OpenTelemetry-using applications as part of the trace header, along with the `recorded` bit that `IsSampled()` might influence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior on services that implements W3C TraceContext, but is not using OpenTelemetry?
The W3C TraceContext doesn't guarantee that things in tracestate will be passed along (e.g. in case of truncation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizthegrey is there a known way for passing this value (and other sampling decisions, like the sampling algorithm used) via the wire? Are there any known standard ways and wire formats for doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampling algorithm used, in general, doesn't matter; only the value matters for being able to reconstruct the data statistically.

There is no standard 😢

https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Sampling.md is the closest we come, and as @bogdandrutu detailed to me last week, the way OpenCensus does this is that you can look at the bits of trace ID to figure out what fraction of the sample space it occupied (e.g. if the lowest N bits were all 1, then we know it was sampled at least at rate 1 in 2**N), but this isn't statistically as reliable as explicitly passing the value because you might mis-estimate a trace as being worth more than it is, by chance...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it requires additional justification to explain why this should be passed on the wire.

A system that records the sampling weight of its root span in the root span's SpanData could reconstruct the weight of the trace, with no propagation of weights.

The one use for propagating weights that I'm aware of: you can force a child to start a root span if the parent was not sampled (due to low probability) and the child explicitly has set a higher probability sampler. Then the child wants to know what rate the parent was sampled at, in order to adjust its own sampling rate accordingly. Is this what you're looking for @lizthegrey?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That assumes that root spans will never be lost (e.g. if the root span isn't async, and there's a panic before it can complete), which seems like not so great of a strategy :)

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagating sampling rate across the wire is beyond API spec, we might need to explore this in RFC.

@tedsuo
Copy link
Contributor

tedsuo commented Aug 13, 2019

@lizthegrey I feel this needs further discussion - I list two reasons below.

Sampling is tricky because it relates more to analysis than other parts of the API. In general, exposing a sampling API is problematic, as it focused on how the observability system works internally. As opposed to other APIs, which are focused on allowing the user to report what their code is doing. Exposing internals is concerning to me, as it locks us in to a particular implementation in an area where there is rapid ongoing development. An RFC which describes the reasons why an end user would need to access sampling information programmatically would be helpful. It's clear how an observability system uses weights and other sampling information, but it's less clear why an end user would need this info.

The second issues relates to propagating this information. If we do end up wanting to expose a weight, we need to propagate it somewhere other than tracestate. So the w3c Trace-Context spec would first need to be modified before we expose it here.

I hope these reasons make sense; please let me know if you agree or have any questions. If you do agree, I recommend we close this PR for now, and address this first as an RFC and a W3C proposal.

@bogdandrutu
Copy link
Member

I think the proposal is not to expose this to the user but to the exporter for every Span. I see no reason to not be able to use the Tracestate for this.

@tedsuo
Copy link
Contributor

tedsuo commented Oct 4, 2019

Hi @lizthegrey, just changing in on this PR. Do you have any further thoughts?

@lizthegrey
Copy link
Member Author

We're waiting on open-telemetry/oteps#42 (comment) to be resolved, then we can proceed here.

@SergeyKanzhelev SergeyKanzhelev added this to the Alpha v0.3 milestone Oct 16, 2019
@lizthegrey
Copy link
Member Author

As per OTel Spec SIG meeting, this requires an OTEP first and sample code in one language SDK before we can make this change in the spec. cc @SergeyKanzhelev and @akvanhar whom I'll be collaborating with on this, and @bogdandrutu who had some opinions here.

@@ -55,6 +55,16 @@ be displayed on debug pages or in the logs. Example:

Return sampling decision whether span should be sampled or not.

### GetSampleWeight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can allow sampler to set and modify more tracestate fileds than samplingWeight. So maybe API will be GetTraceState that will return a new tracestate object

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Decision can carry the entire tracestate if willing to modify it. And sampleWeight will be defined in Probability sampler separately as one of the possible keys.

@SergeyKanzhelev
Copy link
Member

Related PR: #331

@jmacd
Copy link
Contributor

jmacd commented Jan 22, 2020

Let's close this for now. We can open a new PR when we have a more concrete proposal.

@jmacd jmacd closed this Jan 22, 2020
TuckTuckFloof pushed a commit to TuckTuckFloof/opentelemetry-specification that referenced this pull request Oct 15, 2020
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 21, 2024
The OTEP text correctly says:
>In addition it is also possible to optionally specify spans and events that the transformation should apply to.

However, the block after that text is missing the apply_to_spans
that the text alludes to. I wrote the text correctly but made a
mistake when copy/pasting the yaml from my experimental codebase.

I consider this to be a non-breaking change. It is a bug fix since
the tag in yaml is supposed to be there according to the text.
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 23, 2024
The OTEP text correctly says:
>In addition it is also possible to optionally specify spans and events that the transformation should apply to.

However, the block after that text is missing the apply_to_spans
that the text alludes to. I wrote the text correctly but made a
mistake when copy/pasting the yaml from my experimental codebase.

I consider this to be a non-breaking change. It is a bug fix since
the tag in yaml is supposed to be there according to the text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants