-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-power-of-two consistent tail probability sampling in TraceState #226
Changes from all commits
4d3b94b
c3f1ed2
03f693c
df6b1d0
3c507de
a276ea1
14ad23c
4380c6b
8940b66
9a5e9ce
cfa1b44
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,295 @@ | ||
# Non-power-of-two Probability Sampling using 56 random TraceID bits | ||
|
||
## Motivation | ||
|
||
The existing, experimental [specification for probability sampling | ||
using | ||
TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md) | ||
supporting Span-to-Metrics pipelines is limited to powers-of-two | ||
probabilities and is designed to work without making assumptions about | ||
TraceID randomness. The existing mechanism could only achieve | ||
non-power-of-two sampling using interpolation between powers of two, | ||
which was only possible at the head sampling time. It could not be | ||
used with non-power-of-two sampling probabilities for span sampling in | ||
the rest of the collection path. This proposal aims to address the | ||
above two limitations for a couple of reasons: | ||
|
||
1. Certain customers want support for non-powers-of-two probabilities | ||
(e.g., 10% sampling rate or 75% sampling rate) and it should be | ||
possible to do it cleanly irrespective of where the sampling is | ||
happening. | ||
2. There is a need for consistent sampling in the collection path | ||
(outside of the head-sampling paths) and using the inherent | ||
randomness in the traceID is a less-expensive solution than | ||
referencing a custom "r-value" from the tracestate in every span. | ||
|
||
In this proposal, we will cover how this new mechanism can be used in | ||
both head-based sampling and different forms of tail-based sampling. | ||
|
||
The term "Tail sampling" is in common use to describe _various_ forms | ||
of sampling that take place after a span starts. The term "Tail" in | ||
this phrase distinguishes other techniques from head sampling, however | ||
the term is only broadly descriptive. | ||
|
||
Head sampling requires the use of TraceState to propagate context | ||
about sampling decisions from parent spans to child spans. With sampling | ||
information included in the TraceState, spans can be labeled with their | ||
effective adjusted count, making it possible to count spans as they | ||
arrive at their destination in real time, meaning before assembling | ||
complete traces. | ||
|
||
Here, the term Intermediate Span Sampling is used to describe sampling | ||
performed on individual spans at any point in their collection path. | ||
Like Head sampling, Intermediate Span Sampling benefits from being | ||
consistent, because it makes recovery of complete traces possible | ||
after spans have independently sampled. On the other hand, when "Tail | ||
sampling" refers to sampling of complete traces, sampling consistency | ||
is not an important property. | ||
|
||
Intermediate Span Sampling is exemplified by the | ||
[OpenTelemetry-Collector-Contrib's `probabilisticsampler` | ||
processor](https://pkg.go.dev/github.com/open-telemetry/opentelemetry-collector-contrib/processor/probabilisticsamplerprocessor). | ||
This proposal is motivated by wanting to compute Span-to-Metrics from | ||
spans that have been sampled by such a processor. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be good to cover (later in the document) what should happen to the Adjusted Counts in the spans when both these mechanisms are present:
Not sure if such a hybrid usage is a common scenario, but in this case, if the trace is tail-sampled, ideally we would need each span's (of the sampled trace) adjusted count to be updated accordingly, correct? That might be challenging as it involves processing/updating every span's data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you are doing probabalistic head sampling and some sort of non-traceID-based sampling in the tail (keep 10% of spans with http 400 errors) then you need to multiply the probabilities. If you are doing purely probabalistic sampling at both head and tail, based on TraceID, you should update the adjusted count to the minimum of the two probabilities. Although this is open to a little bit of interpretation in what actually happens, because you may be intending to reduce overall traffic regardless of what happened in the head. Example: You sample at the head by 1 in 10 for ServiceA, a particularly high-volume service. And then you add a tail sampler setup to probabalistically sample by 1 in 4 for all your traffic. You might find it surprising that the volume for ServiceA doesn't change! I can imagine someone expecting that ServiceA is now sampled at 1 in 40. This may require a few detailed scenarios in the discussion |
||
|
||
This proposal makes use of the [draft-standard W3C tracecontext | ||
`random` | ||
flag](https://w3c.github.io/trace-context/#random-trace-id-flag), | ||
which is an indicator that 56 bits of true randomness are available | ||
for probability sampler decisions. The benefit of this is that this | ||
inherently random value can be used by intermediate span samplers to | ||
make _consistent_ sampling decisions. It would be a less-expensive | ||
solution than the earlier proposal of looking up the r-value from the | ||
tracestate of each span. | ||
|
||
This proposes to create a specification with support for 56-bit | ||
precision consistent Head and Intermediate Span sampling. Because | ||
this proposal is also meant for use with Head sampling, a new member | ||
of the OpenTelemetry TraceState field will be defined. Intermediate | ||
Span Samplers will modify the TraceState field of spans they sample. | ||
|
||
Note also there is an interest in probabilistic sampling of | ||
OpenTelemetry Log Records on the collection path too. This proposal | ||
recommends the creation of a new field in the OpenTelemetry Log Record | ||
with equivalent use and interpretation as the (W3C trace-context) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would there be any context propagation involved here, or is it just standardizing a field/column in this record to help extrapolate metrics? If it is only the latter, the statement "with equivalent use and interpretation as the (W3C trace-context)" may need to be reworded. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, probably should be reworded. But I do think that if you were not recording spans and you were recording logs, you'd record the TraceState in your log record. At that point, maybe it should still be called TraceState. |
||
TraceState field. It would be appropriate to name this field | ||
`LogState`. | ||
|
||
This proposal does makes r-value an optional 56-bit number as opposed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: This proposal makes... |
||
to a required 6-bit number. When the r-value is supplied, it acts as | ||
an alternative source of randomness which allows tail-samplers to | ||
support versions of tracecontext without the `random` bit as well as | ||
more advanced use-cases. For example, independent traces can be | ||
consistently sampled by starting them with identical r-values. | ||
|
||
This proposal deprecates the experimental p-value. For existing | ||
stored data, the specification may recommend replacing `p:X` with an | ||
equivalent t-value; for example, `p:2` can be replaced by `t:4` and | ||
`p:20` can be replaced by `t:0x1p-20`. | ||
|
||
## Explanation | ||
|
||
This document proposes a new OpenTelemetry specific tracestate value | ||
called t-value. This t-value encodes either the sampling probability | ||
(a floating point value) directly or the "adjusted count" of a span | ||
(an integer). The letter "t" here is a shorthand for "threshold". The | ||
value encoded here can be mapped to a threshold value that a sampler | ||
can compare to a value formed using the rightmost 7 bytes of the | ||
traceID. | ||
|
||
The syntax of the r-value changes in this proposal, as it contains 56 | ||
bits of information. The recommended syntax is to use 14 hexadecimal | ||
characters (e.g., `r:1a2b3c4d5e6f78`). The specification will | ||
recommend samplers drop invalid r-values, so that existing | ||
implementations of r-value are not mistakenly sampled. | ||
|
||
Like the existing specification, r-values will be synthesized as | ||
necessary. However, the specification will recommend that r-values | ||
not be synthesized automatically when the W3C tracecontext `random` | ||
flag is set. To achieve the advanced use-case involving multiple | ||
traces with the same r-value, users should set the `r-value` in the | ||
tracestate before starting correlated trace root spans. | ||
|
||
### Detailed design | ||
|
||
Let's look at the details of how this threshold can be calculated. | ||
This proposal defines the sampling "threshold" as a 7-byte string used | ||
to make consistent sampling decisions, as follows. | ||
|
||
1. When the r-value is present and parses as a 56-bit random value, | ||
use it, otherwise bytes 10-16 of the TraceID are interpreted as a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is worth specifying whether this counts from 0 or 1, or, even better, including an annotated traceID here, just for clarity. |
||
56-bit random value in big-endian byte order | ||
2. The sampling probability (range `[0x1p-56, 1]`) is multiplied by | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think most readers will be unfamiliar with floating point hex notation (I was) and this is probably needlessly terse. One way to express it would be Similarly below, I might say 2^56 rather than using the hex notation. |
||
`0x1p+56`, yielding a unsigned Threshold value in the range `[1, | ||
0x1p+56]`. | ||
3. If the unsigned TraceID random value (range `[0, 0x1p+56)`) is | ||
less-than the sampling Threshold, the span is sampled, otherwise it | ||
is discarded. | ||
|
||
For head samplers, there is an opportunity to synthesize a new r-value | ||
when the tracecontext does not set the `random` bit (as the existing | ||
specification recommends synthesizing r-values for head samplers | ||
whenever there is none). However, this opportunity is not available | ||
to tail samplers. | ||
|
||
To calculate the Sampling threshold, we began with an IEEE-754 | ||
standard double-precision floating point number. With 52-bits of | ||
significand and a floating exponent, the probability value used to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Double-precision floating-point values have a 52-bit mantissa but are able to represent 53-bit significands (except for subnormal values). See https://cs.stackexchange.com/a/152267/102560. |
||
calculate a threshold may be capable of representing more-or-less | ||
precision than the sampler can execute. | ||
|
||
We have many ways of encoding a floating point number as a string, | ||
some of which result in loss of precision. This specification dicates | ||
exactly how to calculate a sampling threshold from a floating point | ||
number, and it is the sampling threshold that determines exactly the | ||
effective sampling probability. The conversion between sampling | ||
probability and threshold is not always reversible, so to determine | ||
the sampling probability exactly from an encoded t-value, first | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: This is the first reference to t-value in this document, but t-value hasn't been introduced yet. Update: Above, I have proposed a short high-level introduction to t-value. [Overall my general feedback is that it would be good to first explain the 10,000 foot view of the new proposal before this section which dives too much into the low-level details on the exact calculation approach.] |
||
compute the exact sampling threshold, then use the threshold to derive | ||
the exact sampling probability. | ||
|
||
From the exact sampling probability, we are able to compute (subject | ||
to machine precision) the adjusted count of each span. For example, | ||
given a sampling probability encoded as "0.1", we first compute the | ||
nearest base-2 floating point, which is exactly 0x1.999999999999ap-04, | ||
which is approximately 0.10000000000000000555. The exact quantity in | ||
this example, 0x1.999999999999ap-04, is multiplied by `0x1p+56` and | ||
rounded to an unsigned integer (7205759403792794). This specification | ||
says that to carry out sampling probability "0.1", we should keep | ||
Traces whose least-significant 56 bits form an unsigned value less | ||
than 7205759403792794. | ||
|
||
## T-value encoding for adjusted counts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be good to define the mutation rules and propagation rules for t-value. E.g., something on the lines of:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not quite answering your question, but I've prototyped open-telemetry/opentelemetry-collector-contrib#22058 with a different sort of answer to your question. In this case referring to span data records, where there are multiple collectors in a pipeline. The first collector may sample at 1/10; when a subsequent collector samples at 1/20, the t-value of the selected spans will be updated. If the subsequent collector samples at 1/2, however, it is being less selective than the first collector, so it should not modify the t-value. That is to say that t-value adjusted counts should not fall and t-valued probabilities should not rise. See the logic here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/22058/files#diff-33f10350e2875f926dd2be6fc4c6bb88cfd8043cf6ac6d100295cf654771d90dR210-R219 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there's a problem with such sampling behavior. Let's assume that the previous collector in chain sampled all traces with errors with probability 1, and all remaining traces with 1/100. If the next collector in chain is configured with 1/10, it will not touch the healthy traces, but will decimate the traces with errors. So any stratified sampling logic must be known and repeated by all collectors in the pipeline. Even if we prohibit stratified sampling, to set up a collector sampling probability in any meaningful way we have to know the minimum sampling probability of all the preceding collectors. |
||
|
||
The example used sampling probability "0.1", which is a concisely | ||
rounded value but not exactly a power of two. The use of decimal | ||
floating point in this case conceals the fact that there is an integer | ||
reciprocal, and when there is an integer reciprocal there are good | ||
reasons to preserve it. Rather than encoding "0.1", it is appealing | ||
to encode the adjusted count (i.e., "10") because it conveys exactly | ||
the user's intention. | ||
|
||
This suggests that the t-value encoding be designed to accept either | ||
the sampling probability or the adjusted count, depending on how the | ||
sampling probability was derived. Thus, the proposed t-value shall be | ||
parsed as a floating point or integer number using any POSIX-supported | ||
printf format specifier. Values in the range [0x1p-56, 0x1p+56] are | ||
valid. Values in the range [0x1p-56, 1] are interpreted as a sampling | ||
probability, while values in the range [1, 0x1p+56] are intepreted as | ||
an adjusted count. Adjusted count values must be integers, while | ||
sampling probability values can be arbitrary floating point values. | ||
|
||
Whether to encode sampling probabilty or adjusted count is a choice. | ||
In both cases, the interpreted value translates into an exact | ||
threshold, which determines the exact inclusion probability. From the | ||
exact inclusion probability, we can determine the adjusted count to | ||
use in a span-to-metrics pipeline. When the t-value is _stated_ as an | ||
adjusted count (as opposed to a sampling probabilty), implementations | ||
can use the integer value in a span-to-metrics pipeline. Otherwise, | ||
implementations should use an adjusted count of 1 divided by the | ||
sampling probability. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a minor thing, but perhaps a section describing how to encode powers of two sample probabilities would be helpful. Since I am not 100% familiar with the POSIX-supported printf format, I wonder what would be the most efficient way. For example, if the sampling probability is 2^(-20) (corresponding to p=20), we could write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason I ask is that powers of two sampling probabilities are a natural discretization for me, since this is the only discretization that results in integer adjusted counts while the relative spacing is constant. Thus, I believe we will often see t-values that are powers of two. Therefore, it might be useful to define a more compact representation of the t-value if it is a power of two. Possibly it makes sense to keep the p-value? |
||
## Where to store t-value in a Span and/or Log Record | ||
|
||
As specified, t-value should be encoded in the TraceState field of the | ||
span. Probabilistic head samplers should set t-value in propagated | ||
contexts so that children using ParentBased samplers are correctly | ||
counted. | ||
|
||
Although prepared as a solution for Head and Intermediate Span | ||
sampling, the t-value encoding scheme could also be used to convey | ||
Logs sampling. This document proposes to add an optional `LogState` | ||
string to the OTLP LogRecord, defined identically to the W3C | ||
tracecontext `TraceState` field. | ||
|
||
## Re-sampling with t-value | ||
|
||
It is possible to re-sample spans that have already been sampled, | ||
according to their t-value. This allows a processor to further reduce | ||
the volume of data it is sending by lowering the sampling threshold. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to understand the expected user behavior here. Let's take this example:
Let's say 1000 spans are emitted. Let's say ~100 spans are sampled by the head sampler (based on the above 10% rate). Now when they come to the intermediate sampler, isn't the user expectation that only ~5 out of these ~100 spans are sampled (and the rest 95 discarded)? If my understanding is correct, I was thinking that the t-value would need to be updated for those non-discarded spans, however the below proposal seems to be based on a different user behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By my understanding, the head sampler with a rate of 10% will export 100 spans (of 1000), and then the 5% sampler will export 50 of those spans if it's a consistent sampler. My latest update to this OTEP includes an "s-value" discussed in last week's SIG which would allow non-consistent independent sampling to be reported, so the 5% simple probability sampler could attach There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the record, I agree with @kalyanaj that 5% sampling rate for intermediate or tail sampling should mean that the volume of spans after sampling is about 20 times smaller than the input volume. Otherwise, assuming possible different sampling rates for the spans from the input stream, the user cannot really predict what the effect of 5% sampling would be. |
||
|
||
In such a sampler, the incoming span will be inspected for an existing | ||
t-value. If found, the incoming t-value is converted to a sampling | ||
threshold and compared against the new threshold. These are two cases: | ||
|
||
- If the Threshold calculated from the incoming t-value is less than | ||
or equal to the current sampler's Threshold, the outgoing t-value is | ||
copied from the incoming t-value. In this case, the span had | ||
already been sampled with a less-than-or-equal probability compared | ||
with the current sampler, so for consistency the span simply passes | ||
through. | ||
- If the Threshold calculated from the incoming t-value is larger than | ||
the current sampler's Threshold, the current sampler's Threshold is | ||
re-applied; if the TraceID random value is less than the current | ||
Sampler's threshold, the span passes through with the current | ||
sampler's t-value, otherwise the span is discarded. | ||
|
||
## S-value encoding for non-consistent adjusted counts | ||
|
||
There are cases where sampling does not need to be consistent or is | ||
intentionally not consistent. Existing samplers often apply a simple | ||
probability test, for example. This specification recommends | ||
introducing a new tracestate member `s-value` for conveying the | ||
accumulation of adjusted count due to independent sampling stages. | ||
|
||
Unlike resampling with `t-value`, independent non-consistent samplers | ||
will multiply the effect of their sampling into `s-value`. | ||
|
||
## Examples | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be good to add two more examples that shows how consistent probability sampling can be achieved across multiple participants. Example 1:
Example 2:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These examples sound good to me! Will do. |
||
|
||
### 90% consistent intermediate span sampling | ||
|
||
A span that has been sampled at 90% by an intermediate processor will | ||
have `ot=t:0.9` added to its TraceState field in the Span record. The | ||
sampling threshold is `0.9 * 0x1p+56`. | ||
|
||
### 90% head consistent sampling | ||
|
||
A span that has been sampled at 90% by a head sampler will add | ||
`ot=t:0.9` to the TraceState context propagated to its children and | ||
record the same in its Span record. The sampling threshold is `0.9 * | ||
0x1p+56`. | ||
|
||
### 1-in-3 consistent sampling | ||
|
||
The tracestate value `ot=t:3` corresponds with 1-in-3 sampling. The | ||
sampling threshold is `1/3 * 0x1p+56`. | ||
|
||
### 30% simple probability sampling | ||
|
||
The tracestate value `ot=s:0.3` corresponds with 30% sampling by one | ||
or more sampling stages. This would be the tracestate recorded by | ||
`probabilisticsampler` when using a `HashSeed` configuration instead | ||
of the consistent approach. | ||
|
||
### 10% probability sampling twice | ||
|
||
The tracestate value `ot=s:0.01` corresponds with 10% sampling by one | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe expand this to show how the tracestate would be modified at each stage? |
||
stage and then 10% sampling by a second stage. | ||
|
||
## Trade-offs and mitigations | ||
|
||
Support for encoding t-value as either a probability or an adjusted | ||
count is meant to give the user control over loss of precision. At | ||
the same time, it can be read by humans. | ||
|
||
Floating point numbers can be encoded exactly to avoid ambiguity, for | ||
example, using hexadecimal floating point representation. Likewise, | ||
adjusted counts can be encoded exactly as integers to convey the | ||
user's intended sampling probability without floating point conversion | ||
loss. | ||
|
||
## Prior art and alternatives | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Towards the end, we may want to call out that one benefit of the r-value based randomness was that it could be used to get consistent sampling across multiple traces (e.g., all traces started within a time window by a participant) - it would be good to call out that it should be possible to support it in the future as a complement to the current proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we decide to use arbitrary sampling probabilities, we should not use the current definition of the r-value. It makes no sense to have different discretizations for the r-value (powers of two) and for the t-value (56-bit values). Therefore, the r-value should rather be a 14-digit hex value that overrides the random bits of the trace ID, if present. This way we could also handle traces where the random flag is not set in the trace context. If the flag is not set and there is also no r-value, we could require consistent samplers to set the r-value by generating a 56-bit random value. |
||
|
||
The existing p-value, r-value mechanism could only achieve | ||
non-power-of-two sampling using interpolation between powers of two, | ||
which was only possible at the head. That specification could not be | ||
used for Intermediate Span sampling using non-power-of-two sampling | ||
probabilities. | ||
|
||
There is a case to be made that users who apply simple probability | ||
sampling with hard-coded probabilities are not asking for what they | ||
want, which is to apply a rate-limit in their sampler. It is true | ||
that rate-limited sampling can be achieved confined to power-of-two | ||
sampling probabilities, but we feel this does not diminish the case | ||
for simply supporting non-power-of-two probabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't consistent sampling a good goal to have even with tail-sampling? Maybe it is easier to achieve (if a system has assembled all the spans of a trace), but I didn't get why it is "not an important property".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is that when "tail sampling" means "making the decision when you have the entire trace available", you're by definition making a self-consistent decision since there's only one instance making the decision. You either get all the spans or none of them. It doesn't really matter very much in this context whether a different instance would have come to the same decision.
It's also the case that we are sometimes talking about non-probabalistic decisions like "is there an exception span event in this trace?"