Return weight value alongside a sampling decision. #180

lizthegrey · 2019-07-11T18:02:41Z

See https://www.honeycomb.io/blog/dynamic-sampling-by-example/2/ for info on why this is necessary.

cc @maplebed on our end who also will be interested in this.

See https://www.honeycomb.io/blog/dynamic-sampling-by-example/2/ for info on why this is necessary.

(this change primarily introduced to force re-running of the CLA test)

c24t · 2019-07-12T20:24:46Z

specification/tracing-api.md

@@ -190,7 +190,7 @@ The OpenTelemetry `SpanContext` representation conforms to the [w3c TraceContext

 `SpanId` A valid span identifier is an 8-byte array with at least one non-zero byte.

-`TraceOptions` contain details about the trace. Unlike Tracestate values, TraceOptions are present in all traces. Currently, the only TraceOption is a boolean `recorded` [flag](https://www.w3.org/TR/trace-context/#recorded-flag-00000001).
+`TraceOptions` contain details about the trace. Unlike Tracestate values, TraceOptions are present in all traces. Currently, TraceOption includes a boolean `recorded` [flag](https://www.w3.org/TR/trace-context/#recorded-flag-00000001) and the `sampleWeight`, which is the approximate number of potentially unsampled events a recorded span represents.


TraceOptions is a bitmask now following the W3C spec, why not put the sample weight in Tracestate which carries vendor-specific info?

The link https://www.w3.org/TR/trace-context/#recorded-flag-00000001 seems to point to non-existing recorded-flag-00000001 anchor.

Try https://www.w3.org/TR/trace-context/#trace-flags

Yes, as per our discussion, putting sample rate in tracestate seems fine as a "vendor-specific" thing specific to OpenTelemetry SDK.

However, I don't want to have a Honeycomb-specific sample rate that doesn't interoperate with the other opentelemetry SDK samplers, etc.

reyang · 2019-07-16T15:34:13Z

specification/sampling-api.md

+
+For instance, its value should be `1 / samplingProbability` for a ProbabilitySampler, or `observedQps / targetQps` for a `RateLimitingSampler`.  With a sampling probability of `0.01`, `COUNT(spans) WHERE service=foo` will return `49*100=4900` rather than the unweighted value `49` and `sum(request_count{service=foo})` will return an exact number like `4930`.
+
+This value should be passed along on the wire between different OpenTelemetry-using applications as part of the trace header, along with the `recorded` bit that `IsSampled()` might influence.


What is the expected behavior on services that implements W3C TraceContext, but is not using OpenTelemetry?
The W3C TraceContext doesn't guarantee that things in tracestate will be passed along (e.g. in case of truncation).

@lizthegrey is there a known way for passing this value (and other sampling decisions, like the sampling algorithm used) via the wire? Are there any known standard ways and wire formats for doing this?

The sampling algorithm used, in general, doesn't matter; only the value matters for being able to reconstruct the data statistically.

There is no standard 😢

https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Sampling.md is the closest we come, and as @bogdandrutu detailed to me last week, the way OpenCensus does this is that you can look at the bits of trace ID to figure out what fraction of the sample space it occupied (e.g. if the lowest N bits were all 1, then we know it was sampled at least at rate 1 in 2**N), but this isn't statistically as reliable as explicitly passing the value because you might mis-estimate a trace as being worth more than it is, by chance...

I think it requires additional justification to explain why this should be passed on the wire.

A system that records the sampling weight of its root span in the root span's SpanData could reconstruct the weight of the trace, with no propagation of weights.

The one use for propagating weights that I'm aware of: you can force a child to start a root span if the parent was not sampled (due to low probability) and the child explicitly has set a higher probability sampler. Then the child wants to know what rate the parent was sampled at, in order to adjust its own sampling rate accordingly. Is this what you're looking for @lizthegrey?

That assumes that root spans will never be lost (e.g. if the root span isn't async, and there's a panic before it can complete), which seems like not so great of a strategy :)

reyang

Propagating sampling rate across the wire is beyond API spec, we might need to explore this in RFC.

tedsuo · 2019-08-13T03:56:40Z

@lizthegrey I feel this needs further discussion - I list two reasons below.

Sampling is tricky because it relates more to analysis than other parts of the API. In general, exposing a sampling API is problematic, as it focused on how the observability system works internally. As opposed to other APIs, which are focused on allowing the user to report what their code is doing. Exposing internals is concerning to me, as it locks us in to a particular implementation in an area where there is rapid ongoing development. An RFC which describes the reasons why an end user would need to access sampling information programmatically would be helpful. It's clear how an observability system uses weights and other sampling information, but it's less clear why an end user would need this info.

The second issues relates to propagating this information. If we do end up wanting to expose a weight, we need to propagate it somewhere other than tracestate. So the w3c Trace-Context spec would first need to be modified before we expose it here.

I hope these reasons make sense; please let me know if you agree or have any questions. If you do agree, I recommend we close this PR for now, and address this first as an RFC and a W3C proposal.

bogdandrutu · 2019-08-20T00:59:28Z

I think the proposal is not to expose this to the user but to the exporter for every Span. I see no reason to not be able to use the Tracestate for this.

tedsuo · 2019-10-04T04:35:07Z

Hi @lizthegrey, just changing in on this PR. Do you have any further thoughts?

lizthegrey · 2019-10-08T15:34:13Z

We're waiting on open-telemetry/oteps#42 (comment) to be resolved, then we can proceed here.

lizthegrey · 2019-11-12T16:47:19Z

As per OTel Spec SIG meeting, this requires an OTEP first and sample code in one language SDK before we can make this change in the spec. cc @SergeyKanzhelev and @akvanhar whom I'll be collaborating with on this, and @bogdandrutu who had some opinions here.

SergeyKanzhelev · 2019-11-12T16:54:46Z

specification/sampling-api.md

@@ -55,6 +55,16 @@ be displayed on debug pages or in the logs. Example:

 Return sampling decision whether span should be sampled or not.

+### GetSampleWeight


perhaps we can allow sampler to set and modify more tracestate fileds than samplingWeight. So maybe API will be GetTraceState that will return a new tracestate object

SergeyKanzhelev

I think Decision can carry the entire tracestate if willing to modify it. And sampleWeight will be defined in Probability sampler separately as one of the possible keys.

SergeyKanzhelev · 2019-11-12T16:56:28Z

Related PR: #331

jmacd · 2020-01-22T16:28:58Z

Let's close this for now. We can open a new PR when we have a more concrete proposal.

The OTEP text correctly says: >In addition it is also possible to optionally specify spans and events that the transformation should apply to. However, the block after that text is missing the apply_to_spans that the text alludes to. I wrote the text correctly but made a mistake when copy/pasting the yaml from my experimental codebase. I consider this to be a non-breaking change. It is a bug fix since the tag in yaml is supposed to be there according to the text.

Return weight value alongside a sampling decision.

36bd3ca

See https://www.honeycomb.io/blog/dynamic-sampling-by-example/2/ for info on why this is necessary.

lizthegrey requested review from AloisReitbauer, bogdandrutu, c24t, carlosalberto, iredelmeier, reyang, SergeyKanzhelev, songy23, tedsuo and yurishkuro as code owners July 11, 2019 18:02

lizthegrey added 2 commits July 11, 2019 14:11

Also propagate sampleWeight in the SpanContext

f1c7ddc

Update sampling-api.md

6e58ed2

(this change primarily introduced to force re-running of the CLA test)

lizthegrey mentioned this pull request Jul 12, 2019

API: Sampling Decision may need to update tracestate, not only attributes #66

Closed

c24t reviewed Jul 12, 2019

View reviewed changes

reyang reviewed Jul 16, 2019

View reviewed changes

reyang requested changes Jul 16, 2019

View reviewed changes

lizthegrey mentioned this pull request Aug 20, 2019

REQUEST: New membership for lizthegrey open-telemetry/community#160

Closed

6 tasks

SergeyKanzhelev added this to the Alpha v0.3 milestone Oct 16, 2019

SergeyKanzhelev reviewed Nov 12, 2019

View reviewed changes

SergeyKanzhelev suggested changes Nov 12, 2019

View reviewed changes

jmacd mentioned this pull request Nov 15, 2019

Should we support fractional count in SummaryValue open-telemetry/opentelemetry-proto#59

Closed

jmacd closed this Jan 22, 2020

TuckTuckFloof pushed a commit to TuckTuckFloof/opentelemetry-specification that referenced this pull request Oct 15, 2020

Context api alternative implementation (open-telemetry#180)

37b275c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return weight value alongside a sampling decision. #180

Return weight value alongside a sampling decision. #180

lizthegrey commented Jul 11, 2019 •

edited

Loading

c24t Jul 12, 2019

tigrannajaryan Aug 20, 2019

lizthegrey Aug 20, 2019 •

edited

Loading

reyang Jul 16, 2019

tigrannajaryan Aug 20, 2019

lizthegrey Aug 20, 2019

jmacd Oct 24, 2019

lizthegrey Oct 25, 2019

reyang left a comment

tedsuo commented Aug 13, 2019

bogdandrutu commented Aug 20, 2019

tedsuo commented Oct 4, 2019

lizthegrey commented Oct 8, 2019

lizthegrey commented Nov 12, 2019

SergeyKanzhelev Nov 12, 2019

SergeyKanzhelev left a comment

SergeyKanzhelev commented Nov 12, 2019

jmacd commented Jan 22, 2020


		For instance, its value should be `1 / samplingProbability` for a ProbabilitySampler, or `observedQps / targetQps` for a `RateLimitingSampler`. With a sampling probability of `0.01`, `COUNT(spans) WHERE service=foo` will return `49*100=4900` rather than the unweighted value `49` and `sum(request_count{service=foo})` will return an exact number like `4930`.

		This value should be passed along on the wire between different OpenTelemetry-using applications as part of the trace header, along with the `recorded` bit that `IsSampled()` might influence.

		@@ -55,6 +55,16 @@ be displayed on debug pages or in the logs. Example:

		Return sampling decision whether span should be sampled or not.

		### GetSampleWeight

Return weight value alongside a sampling decision. #180

Return weight value alongside a sampling decision. #180

Conversation

lizthegrey commented Jul 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lizthegrey Aug 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyang left a comment

Choose a reason for hiding this comment

tedsuo commented Aug 13, 2019

bogdandrutu commented Aug 20, 2019

tedsuo commented Oct 4, 2019

lizthegrey commented Oct 8, 2019

lizthegrey commented Nov 12, 2019

Choose a reason for hiding this comment

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

SergeyKanzhelev commented Nov 12, 2019

jmacd commented Jan 22, 2020

lizthegrey commented Jul 11, 2019 •

edited

Loading

lizthegrey Aug 20, 2019 •

edited

Loading