-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add probability sampler details #331
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -122,8 +122,67 @@ These are the default samplers implemented in the OpenTelemetry SDK: | |
|
||
#### Probability Sampler algorithm | ||
|
||
TODO: Add details about how the probability sampler is implemented as a function | ||
of the `TraceID`. | ||
The `ProbabilitySampler` makes a determistic sampling decision for each span | ||
based on the span's trace ID. It samples a fixed percentage of all spans | ||
following a user-supplied sampling rate. | ||
|
||
According to the [W3C Trace Context | ||
spec](https://www.w3.org/TR/trace-context/#trace-id), vendor-supplied trace IDs | ||
may include both random and non-random components. To avoid sampling based on | ||
the non-random component, the sampler should consider only the leftmost portion | ||
of the trace ID. Implementations MAY allow the user to configure the number of | ||
bits of the trace ID that the sampler considers. | ||
|
||
The `ProbabilitySampler` should generally only sample traces with trace IDs | ||
less than a certain value. This value can be computed from the sampling rate, | ||
and implementations may choose to cache it. The sampling decision should follow | ||
the formula: | ||
|
||
``` | ||
to_sample = traceID >> (128 - precision) < round(rate * pow(2, precision)) | ||
``` | ||
|
||
Where: | ||
|
||
- `rate` is the sampling rate, in `[0.0, 1.0]` | ||
- `precision`: the number of bits of the trace ID to consider, in `[1, 128]` | ||
- `traceID`: is the integer trace ID of the span to be created, assuming | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be possible to describe an algorithm while operate on traceID as bytes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because we shouldn't describe the trace ID as an integer here when it appears in the API as a byte array, or because we want to avoid the conversion from bytes to int? If it's just that the description is wrong then this is easy to reword. If we're trying to avoid the conversion then we could convert the bound to a byte array (once, at sampler creation time) and do a bytewise comparison with the leftmost bytes of the trace ID to make the sampling decision. It's not clear that this would be much of an improvement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I described keeping the trace ID a byte array in 3b3d321, but this makes the spec a bit more verbose and difficult to read. If this isn't what you were looking for I'm happy to revert it and find another way to say this. |
||
big-endian byte order | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Weirdly, the W3C spec says "vendors MUST keep the random part of trace-id ID on the left side" without taking a stance on trace ID endianness, so the random part may be high- or low-order bytes. We may just want to declare trace ID to be a big-endian unsigned int somewhere and never worry about this again. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The W3C spec is not really "weird", it just follows common algorithms for UUIDs. https://en.wikipedia.org/wiki/Universally_unique_identifier. We should ensure that our specified algorithm works with those. E.g. consider:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem of endianness is only introduced in this PR because it attempts to interpret byte sequence as numbers. W3C spec does not make such assumptions, so the question of endianness does not arrise there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Oberon00 this actually doesn't work with v1 UUIDs since the leftmost portion is set from the timestamp, and this assumes that it's random. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe that assumption is too much. The W3C spec also doesn't even recommend using random values: https://www.w3.org/TR/trace-context/#trace-id. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not the entire value, just the part we use to make the sampling decision. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sentence you quoted only applies if the tracing system operate with IDs that are shorter than 128/64 byte. A tracing system may use the whole 128/64 bytes to fill it with deterministic data that is unique (e.g. a globally unique process identifier + a sequential ID for the created trace) but not random There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whoops, not exactly, it applies to
So it merely says: "If you have a random part in your ID, keep it left." |
||
- `round(float f)`: is a function that rounds `f` to the nearest integer | ||
|
||
Note that the effective sampling rate is the number closest to `rate` that can | ||
be expressed as a multiple of `2^-precision`. As a consequence, it's not | ||
possible to set arbitrarily low sampling rates, even on platforms that support | ||
arbitrary-precision arithmetic. | ||
|
||
A `ProbabilitySampler` with rate `0.0` MUST NOT choose to sample any traces, | ||
even if the leftmost precision-many bits of the trace ID are all `0`. | ||
Similarly, a `ProbabilitySampler` with rate `1.0` MUST choose to sample all | ||
traces, even if the leftmost precision-many bits of the trace ID trace ID are | ||
all `1`. | ||
|
||
**Example:** | ||
|
||
Consider a `ProbabilitySampler` with rate `.25` and 16 bit precision. | ||
|
||
First, find the lowest truncated trace ID that will not be sampled. This number | ||
represents the 25th percentile of the range of possible values: | ||
|
||
``` | ||
.25 * 2^16 = 0x4000 | ||
``` | ||
|
||
Make a sampling decision for a given trace by comparing this number to the | ||
leftmost 16 bits of its 128 bit trace ID: | ||
|
||
``` | ||
x = traceID >> (128 - 16) | ||
to_sample = x < 0x4000 | ||
``` | ||
|
||
This sampler should sample traces with truncated trace ID in the range | ||
`[0x0000, 0x4000)`. Assuming `x` is uniformly distributed over `[0x0000, | ||
0xffff]`, this should account for 25% of all traces. | ||
|
||
## Tracer Creation | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basing the algorithm on trace id should not be a requirement. It's an optimization that is possible in some languages, but could be pretty expensive in others, considering that trace id is just a blob of bytes that needs to be interpreted as a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a performance optimization (at least not primarily). Using the trace ID as the only input (and the same algorithm) ensures that all spans of the trace have the same sampling decision, which is required to avoid broken traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that the root span of a trace might use a probability sampler, and all the descendants would use the sampled bit from context, so there's only one decision, and I agree with @yurishkuro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also expect the sampler to respect the sampled bit by default, but we say here that it should be possible to configure the sampler to apply to all spans, or any span with a remote parent.
L119 actually says the default behavior should be to make a new sampling decision for spans with remote parents, but I don't know the motivation for this, and think there's a good argument for restricting this to root spans by default.
One benefit of this sampling algorithm is that lower-probability samplers always sample a subset of traces of higher-probability samplers. Because of this, each service could set its own sampling rate and we'd still get complete traces for a percentage of requests equal to the lowest sampling rate. For example, if we have three services with sampling rates
.3
,.2
, and.1
respectively, we'd get complete traces for.1
of requests. With independent random samplers we'd get traces for.006
of requests.Even if we didn't want to make it possible to change the sampling rate along the call chain, there are some clear benefits to making the sampling decision deterministic. E.g. checking that telemetry is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good arguments. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, @yurishkuro's complaint is that we have to convert the byte array into an int to do the comparison. FWIW an implementation could compute the bound as a byte array instead and do a left-to-right bytewise comparison (see this gist for an example in python), but it's not clear that this would be faster than converting to an int, especially if the language boxes the bytes to do the comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeterministicProbabilitySampler
is a mouthful. I'd preferPercentageSampler
orRateSampler
, but I thinkProbabilitySampler
is a fine description, and has the benefit of having the same name in OpenCensus.I can think of two obvious risks for people doing statistical inference on sampled traces: one is that trace IDs aren't even nominally randomly distributed. For example, if the trace ID includes a timestamp fragment, the sampled request rate might look cyclical or bursty even if the actual request rate is constant. The other is that the service itself handles requests differently depending on the trace ID.
But these are bugs in the service or instrumentation, not problems of insufficient randomness.
AFAICT we don't actually need strong statistical randomness here. We just need the "random" part of trace ID to be roughly uniformly distributed over all possible values, and not to be correlated with the stats we're monitoring (latency, error rate, etc.).
The fact that PRNGs are more P than R doesn't seem unusually bad in this context. And in any case a
RandomProbabilitySampler
would suffer from the same problems.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if sampler does not depend on the trace-id at all? There is no need to consistently select the same trace for sampling as the sampling decision is mostly made when root span is created.
Sampler can simply increment an atomic counter by 1 for every root span is created. If sampling rate is 0.01 then trace is sampled if the count is multiple of 100. Counter can be initialized to a random value once, so across different instance/process they are not synchronized. This sounds more like RateSampler though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this above in #331 (comment). We want to be able to make new sampling decisions for remote parent spans, which is why it's helpful to sample consistently.
Now that I read your description,
RateSampler
is clearly a better name for that than the behavior I'm describing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RateSampler
would imply time dependency, like rate limiting, this is more of a RatioSampler.