-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return weight value alongside a sampling decision. #180
Conversation
See https://www.honeycomb.io/blog/dynamic-sampling-by-example/2/ for info on why this is necessary.
(this change primarily introduced to force re-running of the CLA test)
@@ -190,7 +190,7 @@ The OpenTelemetry `SpanContext` representation conforms to the [w3c TraceContext | |||
|
|||
`SpanId` A valid span identifier is an 8-byte array with at least one non-zero byte. | |||
|
|||
`TraceOptions` contain details about the trace. Unlike Tracestate values, TraceOptions are present in all traces. Currently, the only TraceOption is a boolean `recorded` [flag](https://www.w3.org/TR/trace-context/#recorded-flag-00000001). | |||
`TraceOptions` contain details about the trace. Unlike Tracestate values, TraceOptions are present in all traces. Currently, TraceOption includes a boolean `recorded` [flag](https://www.w3.org/TR/trace-context/#recorded-flag-00000001) and the `sampleWeight`, which is the approximate number of potentially unsampled events a recorded span represents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TraceOptions
is a bitmask now following the W3C spec, why not put the sample weight in Tracestate
which carries vendor-specific info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link https://www.w3.org/TR/trace-context/#recorded-flag-00000001 seems to point to non-existing recorded-flag-00000001
anchor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try https://www.w3.org/TR/trace-context/#trace-flags
Yes, as per our discussion, putting sample rate in tracestate seems fine as a "vendor-specific" thing specific to OpenTelemetry SDK.
However, I don't want to have a Honeycomb-specific sample rate that doesn't interoperate with the other opentelemetry SDK samplers, etc.
|
||
For instance, its value should be `1 / samplingProbability` for a ProbabilitySampler, or `observedQps / targetQps` for a `RateLimitingSampler`. With a sampling probability of `0.01`, `COUNT(spans) WHERE service=foo` will return `49*100=4900` rather than the unweighted value `49` and `sum(request_count{service=foo})` will return an exact number like `4930`. | ||
|
||
This value should be passed along on the wire between different OpenTelemetry-using applications as part of the trace header, along with the `recorded` bit that `IsSampled()` might influence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected behavior on services that implements W3C TraceContext, but is not using OpenTelemetry?
The W3C TraceContext doesn't guarantee that things in tracestate
will be passed along (e.g. in case of truncation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lizthegrey is there a known way for passing this value (and other sampling decisions, like the sampling algorithm used) via the wire? Are there any known standard ways and wire formats for doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sampling algorithm used, in general, doesn't matter; only the value matters for being able to reconstruct the data statistically.
There is no standard 😢
https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Sampling.md is the closest we come, and as @bogdandrutu detailed to me last week, the way OpenCensus does this is that you can look at the bits of trace ID to figure out what fraction of the sample space it occupied (e.g. if the lowest N bits were all 1, then we know it was sampled at least at rate 1 in 2**N), but this isn't statistically as reliable as explicitly passing the value because you might mis-estimate a trace as being worth more than it is, by chance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it requires additional justification to explain why this should be passed on the wire.
A system that records the sampling weight of its root span in the root span's SpanData could reconstruct the weight of the trace, with no propagation of weights.
The one use for propagating weights that I'm aware of: you can force a child to start a root span if the parent was not sampled (due to low probability) and the child explicitly has set a higher probability sampler. Then the child wants to know what rate the parent was sampled at, in order to adjust its own sampling rate accordingly. Is this what you're looking for @lizthegrey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That assumes that root spans will never be lost (e.g. if the root span isn't async, and there's a panic before it can complete), which seems like not so great of a strategy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagating sampling rate across the wire is beyond API spec, we might need to explore this in RFC.
@lizthegrey I feel this needs further discussion - I list two reasons below. Sampling is tricky because it relates more to analysis than other parts of the API. In general, exposing a sampling API is problematic, as it focused on how the observability system works internally. As opposed to other APIs, which are focused on allowing the user to report what their code is doing. Exposing internals is concerning to me, as it locks us in to a particular implementation in an area where there is rapid ongoing development. An RFC which describes the reasons why an end user would need to access sampling information programmatically would be helpful. It's clear how an observability system uses weights and other sampling information, but it's less clear why an end user would need this info. The second issues relates to propagating this information. If we do end up wanting to expose a weight, we need to propagate it somewhere other than I hope these reasons make sense; please let me know if you agree or have any questions. If you do agree, I recommend we close this PR for now, and address this first as an RFC and a W3C proposal. |
I think the proposal is not to expose this to the user but to the exporter for every Span. I see no reason to not be able to use the Tracestate for this. |
Hi @lizthegrey, just changing in on this PR. Do you have any further thoughts? |
We're waiting on open-telemetry/oteps#42 (comment) to be resolved, then we can proceed here. |
As per OTel Spec SIG meeting, this requires an OTEP first and sample code in one language SDK before we can make this change in the spec. cc @SergeyKanzhelev and @akvanhar whom I'll be collaborating with on this, and @bogdandrutu who had some opinions here. |
@@ -55,6 +55,16 @@ be displayed on debug pages or in the logs. Example: | |||
|
|||
Return sampling decision whether span should be sampled or not. | |||
|
|||
### GetSampleWeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can allow sampler to set and modify more tracestate
fileds than samplingWeight. So maybe API will be GetTraceState
that will return a new tracestate object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Decision can carry the entire tracestate if willing to modify it. And sampleWeight will be defined in Probability sampler separately as one of the possible keys.
Related PR: #331 |
Let's close this for now. We can open a new PR when we have a more concrete proposal. |
The OTEP text correctly says: >In addition it is also possible to optionally specify spans and events that the transformation should apply to. However, the block after that text is missing the apply_to_spans that the text alludes to. I wrote the text correctly but made a mistake when copy/pasting the yaml from my experimental codebase. I consider this to be a non-breaking change. It is a bug fix since the tag in yaml is supposed to be there according to the text.
The OTEP text correctly says: >In addition it is also possible to optionally specify spans and events that the transformation should apply to. However, the block after that text is missing the apply_to_spans that the text alludes to. I wrote the text correctly but made a mistake when copy/pasting the yaml from my experimental codebase. I consider this to be a non-breaking change. It is a bug fix since the tag in yaml is supposed to be there according to the text.
See https://www.honeycomb.io/blog/dynamic-sampling-by-example/2/ for info on why this is necessary.
RFC in flight: open-telemetry/oteps#24
cc @maplebed on our end who also will be interested in this.