Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-power-of-two consistent tail probability sampling in TraceState #226

Closed
295 changes: 295 additions & 0 deletions text/trace/0226-sampling-random-traceids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# Non-power-of-two Probability Sampling using 56 random TraceID bits

## Motivation

The existing, experimental [specification for probability sampling
using
TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md)
supporting Span-to-Metrics pipelines is limited to powers-of-two
probabilities and is designed to work without making assumptions about
TraceID randomness. The existing mechanism could only achieve
non-power-of-two sampling using interpolation between powers of two,
which was only possible at the head sampling time. It could not be
used with non-power-of-two sampling probabilities for span sampling in
the rest of the collection path. This proposal aims to address the
above two limitations for a couple of reasons:

1. Certain customers want support for non-powers-of-two probabilities
(e.g., 10% sampling rate or 75% sampling rate) and it should be
possible to do it cleanly irrespective of where the sampling is
happening.
2. There is a need for consistent sampling in the collection path
(outside of the head-sampling paths) and using the inherent
randomness in the traceID is a less-expensive solution than
referencing a custom "r-value" from the tracestate in every span.

In this proposal, we will cover how this new mechanism can be used in
both head-based sampling and different forms of tail-based sampling.

The term "Tail sampling" is in common use to describe _various_ forms
of sampling that take place after a span starts. The term "Tail" in
this phrase distinguishes other techniques from head sampling, however
the term is only broadly descriptive.

Head sampling requires the use of TraceState to propagate context
about sampling decisions from parent spans to child spans. With sampling
information included in the TraceState, spans can be labeled with their
effective adjusted count, making it possible to count spans as they
arrive at their destination in real time, meaning before assembling
complete traces.

Here, the term Intermediate Span Sampling is used to describe sampling
performed on individual spans at any point in their collection path.
Like Head sampling, Intermediate Span Sampling benefits from being
consistent, because it makes recovery of complete traces possible
after spans have independently sampled. On the other hand, when "Tail
sampling" refers to sampling of complete traces, sampling consistency
is not an important property.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't consistent sampling a good goal to have even with tail-sampling? Maybe it is easier to achieve (if a system has assembled all the spans of a trace), but I didn't get why it is "not an important property".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that when "tail sampling" means "making the decision when you have the entire trace available", you're by definition making a self-consistent decision since there's only one instance making the decision. You either get all the spans or none of them. It doesn't really matter very much in this context whether a different instance would have come to the same decision.

It's also the case that we are sometimes talking about non-probabalistic decisions like "is there an exception span event in this trace?"


Intermediate Span Sampling is exemplified by the
[OpenTelemetry-Collector-Contrib's `probabilisticsampler`
processor](https://pkg.go.dev/github.com/open-telemetry/opentelemetry-collector-contrib/processor/probabilisticsamplerprocessor).
This proposal is motivated by wanting to compute Span-to-Metrics from
spans that have been sampled by such a processor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to cover (later in the document) what should happen to the Adjusted Counts in the spans when both these mechanisms are present:

Not sure if such a hybrid usage is a common scenario, but in this case, if the trace is tail-sampled, ideally we would need each span's (of the sampled trace) adjusted count to be updated accordingly, correct? That might be challenging as it involves processing/updating every span's data.

Reference: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md#probabilistic-sampling-processor-compared-to-the-tail-sampling-processor-with-the-probabilistic-policy

Copy link
Member

@kentquirk kentquirk May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are doing probabalistic head sampling and some sort of non-traceID-based sampling in the tail (keep 10% of spans with http 400 errors) then you need to multiply the probabilities.

If you are doing purely probabalistic sampling at both head and tail, based on TraceID, you should update the adjusted count to the minimum of the two probabilities. Although this is open to a little bit of interpretation in what actually happens, because you may be intending to reduce overall traffic regardless of what happened in the head.

Example: You sample at the head by 1 in 10 for ServiceA, a particularly high-volume service. And then you add a tail sampler setup to probabalistically sample by 1 in 4 for all your traffic. You might find it surprising that the volume for ServiceA doesn't change! I can imagine someone expecting that ServiceA is now sampled at 1 in 40.

This may require a few detailed scenarios in the discussion


This proposal makes use of the [draft-standard W3C tracecontext
`random`
flag](https://w3c.github.io/trace-context/#random-trace-id-flag),
which is an indicator that 56 bits of true randomness are available
for probability sampler decisions. The benefit of this is that this
inherently random value can be used by intermediate span samplers to
make _consistent_ sampling decisions. It would be a less-expensive
solution than the earlier proposal of looking up the r-value from the
tracestate of each span.

This proposes to create a specification with support for 56-bit
precision consistent Head and Intermediate Span sampling. Because
this proposal is also meant for use with Head sampling, a new member
of the OpenTelemetry TraceState field will be defined. Intermediate
Span Samplers will modify the TraceState field of spans they sample.

Note also there is an interest in probabilistic sampling of
OpenTelemetry Log Records on the collection path too. This proposal
recommends the creation of a new field in the OpenTelemetry Log Record
with equivalent use and interpretation as the (W3C trace-context)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any context propagation involved here, or is it just standardizing a field/column in this record to help extrapolate metrics? If it is only the latter, the statement "with equivalent use and interpretation as the (W3C trace-context)" may need to be reworded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, probably should be reworded. But I do think that if you were not recording spans and you were recording logs, you'd record the TraceState in your log record. At that point, maybe it should still be called TraceState.

TraceState field. It would be appropriate to name this field
`LogState`.

This proposal does makes r-value an optional 56-bit number as opposed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: This proposal makes...

to a required 6-bit number. When the r-value is supplied, it acts as
an alternative source of randomness which allows tail-samplers to
support versions of tracecontext without the `random` bit as well as
more advanced use-cases. For example, independent traces can be
consistently sampled by starting them with identical r-values.

This proposal deprecates the experimental p-value. For existing
stored data, the specification may recommend replacing `p:X` with an
equivalent t-value; for example, `p:2` can be replaced by `t:4` and
`p:20` can be replaced by `t:0x1p-20`.

## Explanation

This document proposes a new OpenTelemetry specific tracestate value
called t-value. This t-value encodes either the sampling probability
(a floating point value) directly or the "adjusted count" of a span
(an integer). The letter "t" here is a shorthand for "threshold". The
value encoded here can be mapped to a threshold value that a sampler
can compare to a value formed using the rightmost 7 bytes of the
traceID.

The syntax of the r-value changes in this proposal, as it contains 56
bits of information. The recommended syntax is to use 14 hexadecimal
characters (e.g., `r:1a2b3c4d5e6f78`). The specification will
recommend samplers drop invalid r-values, so that existing
implementations of r-value are not mistakenly sampled.

Like the existing specification, r-values will be synthesized as
necessary. However, the specification will recommend that r-values
not be synthesized automatically when the W3C tracecontext `random`
flag is set. To achieve the advanced use-case involving multiple
traces with the same r-value, users should set the `r-value` in the
tracestate before starting correlated trace root spans.

### Detailed design

Let's look at the details of how this threshold can be calculated.
This proposal defines the sampling "threshold" as a 7-byte string used
to make consistent sampling decisions, as follows.

1. When the r-value is present and parses as a 56-bit random value,
use it, otherwise bytes 10-16 of the TraceID are interpreted as a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth specifying whether this counts from 0 or 1, or, even better, including an annotated traceID here, just for clarity.

56-bit random value in big-endian byte order
2. The sampling probability (range `[0x1p-56, 1]`) is multiplied by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most readers will be unfamiliar with floating point hex notation (I was) and this is probably needlessly terse. One way to express it would be (0, 1], but that also might be too confusing. Perhaps "greater than 0 and less than or equal to 1" or even 0 < n <= 1?

Similarly below, I might say 2^56 rather than using the hex notation.

`0x1p+56`, yielding a unsigned Threshold value in the range `[1,
0x1p+56]`.
3. If the unsigned TraceID random value (range `[0, 0x1p+56)`) is
less-than the sampling Threshold, the span is sampled, otherwise it
is discarded.

For head samplers, there is an opportunity to synthesize a new r-value
when the tracecontext does not set the `random` bit (as the existing
specification recommends synthesizing r-values for head samplers
whenever there is none). However, this opportunity is not available
to tail samplers.

To calculate the Sampling threshold, we began with an IEEE-754
standard double-precision floating point number. With 52-bits of
significand and a floating exponent, the probability value used to
Copy link
Contributor

@oertl oertl Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 52-bits of significand...

Double-precision floating-point values have a 52-bit mantissa but are able to represent 53-bit significands (except for subnormal values). See https://cs.stackexchange.com/a/152267/102560.

calculate a threshold may be capable of representing more-or-less
precision than the sampler can execute.

We have many ways of encoding a floating point number as a string,
some of which result in loss of precision. This specification dicates
exactly how to calculate a sampling threshold from a floating point
number, and it is the sampling threshold that determines exactly the
effective sampling probability. The conversion between sampling
probability and threshold is not always reversible, so to determine
the sampling probability exactly from an encoded t-value, first

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is the first reference to t-value in this document, but t-value hasn't been introduced yet. Update: Above, I have proposed a short high-level introduction to t-value.

[Overall my general feedback is that it would be good to first explain the 10,000 foot view of the new proposal before this section which dives too much into the low-level details on the exact calculation approach.]

compute the exact sampling threshold, then use the threshold to derive
the exact sampling probability.

From the exact sampling probability, we are able to compute (subject
to machine precision) the adjusted count of each span. For example,
given a sampling probability encoded as "0.1", we first compute the
nearest base-2 floating point, which is exactly 0x1.999999999999ap-04,
which is approximately 0.10000000000000000555. The exact quantity in
this example, 0x1.999999999999ap-04, is multiplied by `0x1p+56` and
rounded to an unsigned integer (7205759403792794). This specification
says that to carry out sampling probability "0.1", we should keep
Traces whose least-significant 56 bits form an unsigned value less
than 7205759403792794.

## T-value encoding for adjusted counts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to define the mutation rules and propagation rules for t-value. E.g., something on the lines of:

  • if a participant is doing parent-based sampling, it should propagate the t-value from its parent.
  • if a participant is doing consistent probability sampling using its own sampling rate, it should mutate the t-value to set the new adjusted count / sampling rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite answering your question, but I've prototyped open-telemetry/opentelemetry-collector-contrib#22058 with a different sort of answer to your question.

In this case referring to span data records, where there are multiple collectors in a pipeline. The first collector may sample at 1/10; when a subsequent collector samples at 1/20, the t-value of the selected spans will be updated. If the subsequent collector samples at 1/2, however, it is being less selective than the first collector, so it should not modify the t-value. That is to say that t-value adjusted counts should not fall and t-valued probabilities should not rise.

See the logic here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/22058/files#diff-33f10350e2875f926dd2be6fc4c6bb88cfd8043cf6ac6d100295cf654771d90dR210-R219

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a problem with such sampling behavior. Let's assume that the previous collector in chain sampled all traces with errors with probability 1, and all remaining traces with 1/100. If the next collector in chain is configured with 1/10, it will not touch the healthy traces, but will decimate the traces with errors. So any stratified sampling logic must be known and repeated by all collectors in the pipeline. Even if we prohibit stratified sampling, to set up a collector sampling probability in any meaningful way we have to know the minimum sampling probability of all the preceding collectors.


The example used sampling probability "0.1", which is a concisely
rounded value but not exactly a power of two. The use of decimal
floating point in this case conceals the fact that there is an integer
reciprocal, and when there is an integer reciprocal there are good
reasons to preserve it. Rather than encoding "0.1", it is appealing
to encode the adjusted count (i.e., "10") because it conveys exactly
the user's intention.

This suggests that the t-value encoding be designed to accept either
the sampling probability or the adjusted count, depending on how the
sampling probability was derived. Thus, the proposed t-value shall be
parsed as a floating point or integer number using any POSIX-supported
printf format specifier. Values in the range [0x1p-56, 0x1p+56] are
valid. Values in the range [0x1p-56, 1] are interpreted as a sampling
probability, while values in the range [1, 0x1p+56] are intepreted as
an adjusted count. Adjusted count values must be integers, while
sampling probability values can be arbitrary floating point values.

Whether to encode sampling probabilty or adjusted count is a choice.
In both cases, the interpreted value translates into an exact
threshold, which determines the exact inclusion probability. From the
exact inclusion probability, we can determine the adjusted count to
use in a span-to-metrics pipeline. When the t-value is _stated_ as an
adjusted count (as opposed to a sampling probabilty), implementations
can use the integer value in a span-to-metrics pipeline. Otherwise,
implementations should use an adjusted count of 1 divided by the
sampling probability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor thing, but perhaps a section describing how to encode powers of two sample probabilities would be helpful. Since I am not 100% familiar with the POSIX-supported printf format, I wonder what would be the most efficient way. For example, if the sampling probability is 2^(-20) (corresponding to p=20), we could write t=0x1p-20 or t=1048576, but would t=0x1p+20 or even t=0x1p20 be allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that powers of two sampling probabilities are a natural discretization for me, since this is the only discretization that results in integer adjusted counts while the relative spacing is constant. Thus, I believe we will often see t-values that are powers of two. Therefore, it might be useful to define a more compact representation of the t-value if it is a power of two. Possibly it makes sense to keep the p-value?

## Where to store t-value in a Span and/or Log Record

As specified, t-value should be encoded in the TraceState field of the
span. Probabilistic head samplers should set t-value in propagated
contexts so that children using ParentBased samplers are correctly
counted.

Although prepared as a solution for Head and Intermediate Span
sampling, the t-value encoding scheme could also be used to convey
Logs sampling. This document proposes to add an optional `LogState`
string to the OTLP LogRecord, defined identically to the W3C
tracecontext `TraceState` field.

## Re-sampling with t-value

It is possible to re-sample spans that have already been sampled,
according to their t-value. This allows a processor to further reduce
the volume of data it is sending by lowering the sampling threshold.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to understand the expected user behavior here. Let's take this example:

  • A head sampler is used with a sampling rate of say 10%,
  • Let's say an intermediate sampler is also specified with a sampling rate of say 5%.

Let's say 1000 spans are emitted. Let's say ~100 spans are sampled by the head sampler (based on the above 10% rate). Now when they come to the intermediate sampler, isn't the user expectation that only ~5 out of these ~100 spans are sampled (and the rest 95 discarded)? If my understanding is correct, I was thinking that the t-value would need to be updated for those non-discarded spans, however the below proposal seems to be based on a different user behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By my understanding, the head sampler with a rate of 10% will export 100 spans (of 1000), and then the 5% sampler will export 50 of those spans if it's a consistent sampler.

My latest update to this OTEP includes an "s-value" discussed in last week's SIG which would allow non-consistent independent sampling to be reported, so the 5% simple probability sampler could attach s:20 or s:0.05.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I agree with @kalyanaj that 5% sampling rate for intermediate or tail sampling should mean that the volume of spans after sampling is about 20 times smaller than the input volume. Otherwise, assuming possible different sampling rates for the spans from the input stream, the user cannot really predict what the effect of 5% sampling would be.


In such a sampler, the incoming span will be inspected for an existing
t-value. If found, the incoming t-value is converted to a sampling
threshold and compared against the new threshold. These are two cases:

- If the Threshold calculated from the incoming t-value is less than
or equal to the current sampler's Threshold, the outgoing t-value is
copied from the incoming t-value. In this case, the span had
already been sampled with a less-than-or-equal probability compared
with the current sampler, so for consistency the span simply passes
through.
- If the Threshold calculated from the incoming t-value is larger than
the current sampler's Threshold, the current sampler's Threshold is
re-applied; if the TraceID random value is less than the current
Sampler's threshold, the span passes through with the current
sampler's t-value, otherwise the span is discarded.

## S-value encoding for non-consistent adjusted counts

There are cases where sampling does not need to be consistent or is
intentionally not consistent. Existing samplers often apply a simple
probability test, for example. This specification recommends
introducing a new tracestate member `s-value` for conveying the
accumulation of adjusted count due to independent sampling stages.

Unlike resampling with `t-value`, independent non-consistent samplers
will multiply the effect of their sampling into `s-value`.

## Examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add two more examples that shows how consistent probability sampling can be achieved across multiple participants.

Example 1:

  • Upstream participant samples at 10% probability (ot=t:0.1 is sent as part of tracestate)
  • Downstream participant does parent-based sampling. It uses the sampled flag to make the decision, gets the t-value from the parent context and emits it as part of its context (ot=t:0.1 is sent as part of tracestate to further downstream participants)

Example 2:

  • Upstream participant samples at 10% probability (ot=t:0.1 is sent as part of tracestate)
  • Downstream participant samples at 5% probability - it calculates a threshold based on its sampling rate and compares with the traceID last 7 bytes to make the sampling decision (ot=t:20 is sent as part of tracestate).
  • Downstream participant does parent-based sampling (uses the sampled flag to make the decision, gets the t-value from the parent context and emits it as part of its context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples sound good to me! Will do.


### 90% consistent intermediate span sampling

A span that has been sampled at 90% by an intermediate processor will
have `ot=t:0.9` added to its TraceState field in the Span record. The
sampling threshold is `0.9 * 0x1p+56`.

### 90% head consistent sampling

A span that has been sampled at 90% by a head sampler will add
`ot=t:0.9` to the TraceState context propagated to its children and
record the same in its Span record. The sampling threshold is `0.9 *
0x1p+56`.

### 1-in-3 consistent sampling

The tracestate value `ot=t:3` corresponds with 1-in-3 sampling. The
sampling threshold is `1/3 * 0x1p+56`.

### 30% simple probability sampling

The tracestate value `ot=s:0.3` corresponds with 30% sampling by one
or more sampling stages. This would be the tracestate recorded by
`probabilisticsampler` when using a `HashSeed` configuration instead
of the consistent approach.

### 10% probability sampling twice

The tracestate value `ot=s:0.01` corresponds with 10% sampling by one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand this to show how the tracestate would be modified at each stage?

stage and then 10% sampling by a second stage.

## Trade-offs and mitigations

Support for encoding t-value as either a probability or an adjusted
count is meant to give the user control over loss of precision. At
the same time, it can be read by humans.

Floating point numbers can be encoded exactly to avoid ambiguity, for
example, using hexadecimal floating point representation. Likewise,
adjusted counts can be encoded exactly as integers to convey the
user's intended sampling probability without floating point conversion
loss.

## Prior art and alternatives

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Towards the end, we may want to call out that one benefit of the r-value based randomness was that it could be used to get consistent sampling across multiple traces (e.g., all traces started within a time window by a participant) - it would be good to call out that it should be possible to support it in the future as a complement to the current proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to use arbitrary sampling probabilities, we should not use the current definition of the r-value. It makes no sense to have different discretizations for the r-value (powers of two) and for the t-value (56-bit values). Therefore, the r-value should rather be a 14-digit hex value that overrides the random bits of the trace ID, if present. This way we could also handle traces where the random flag is not set in the trace context. If the flag is not set and there is also no r-value, we could require consistent samplers to set the r-value by generating a 56-bit random value.


The existing p-value, r-value mechanism could only achieve
non-power-of-two sampling using interpolation between powers of two,
which was only possible at the head. That specification could not be
used for Intermediate Span sampling using non-power-of-two sampling
probabilities.

There is a case to be made that users who apply simple probability
sampling with hard-coded probabilities are not asking for what they
want, which is to apply a rate-limit in their sampler. It is true
that rate-limited sampling can be achieved confined to power-of-two
sampling probabilities, but we feel this does not diminish the case
for simply supporting non-power-of-two probabilities.