Description
Component(s)
processor/tailsampling
What happened?
Description
Within the algorithm for processing of subpolicies within the composite policy, the following may happen:
- the current subpolicy being processed has set the current trace as
Sampled
- we check if allowing this sampling would put this subpolicy over its specific spans/sec limit or the global limit
- if either check evaluates to true, we return
NotSampled
and exit the function, thereby never checking any of the other subpolicies.
My personal opinion is that this is a bug, but I can see how this might be construed as a feature.
Steps to Reproduce
- start up an otelcol with the provided config
- send a bunch of traces in (for example, by using xk6-client-tracing)
Expected Result
The always sample subpolicy (ASSP) receives the traces first, and any unsampled traces then move on to the probabilistic subpolicy (PSP).
If a ramping increase of spans/sec is used, we will at first see a 100% sampling rate for this Composite policy followed by an inflection point where the ASSP saturates at 100 spans/sec and 80 % of spans above this point are not sampled. Similarly, when we look at the exporter's rate/sec rate, we can see that there is an inflection at around this time, where the slope of the graph after the inflection is 20% of the slop before.
Actual Result
ASSP always receives all traces and PSP never receives any traces. The spans per second flatlines at the maximum number of spans per second to be allocated to ASSP
Collector version
v0.116.0
Environment information
Environment
OS: Ubuntu 22.04
Compiler(if manually compiled): ocb@v0.116.0 go1.23.4
OpenTelemetry Collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: "localhost:14317"
processors:
tail_sampling:
policies:
- name: composite-policy
type: composite
composite:
max_total_spans_per_second: 10_000
policy_order:
- always-sample-policy
- probabilistic-policy
composite_sub_policy:
- name: always-sample-policy
type: always_sample
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 20
rate_allocation:
- policy: always-sample-policy
percent: 5
- policy: probabilistic-policy
percent: 95
exporters:
otlp:
endpoint: "localhost:4317"
tls:
insecure: true
debug:
verbosity: detailed
service:
telemetry:
logs:
level: DEBUG
metrics:
level: detailed
readers:
- pull:
exporter:
prometheus:
host: "0.0.0.0"
port: 8888
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlp]
Log output
2024-12-24T17:29:51.319-0500 debug tailsamplingprocessor@v0.116.0/processor.go:271 Sampling Policy Evaluation ticked {"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
2024-12-24T17:29:51.320-0500 debug sampling/always_sample.go:29 Evaluating spans in always-sample filter {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "always_sample"}
### Additional context
We need two changes
## Return On Global Max Right Away
We should move the check for
spansInSecondIfSampled <= c.maxTotalSPS
Out of the loop. There's no need to bother with processing any of the subpolicies if the trace will put the global system over global limit regardless of which subpolicy samples it.
I actually think this inequality might be a mistake.
`spansInSecondIfSampled` [only takes into account the number of sampled spans *for this subpolicy* and the number of spans in the current trace](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.116.0/processor/tailsamplingprocessor/internal/sampling/composite.go#L106). It doesn't make sense to compare this to the `maxTotalSPS` of the full `Composite` policy.
## Modify Logic for `Sampled`/`NotSampled` for Subpolicies
[This logic](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.116.0/processor/tailsamplingprocessor/internal/sampling/composite.go#L116-L120) is counter-intuitive and -- I would argue -- restricts the utility of the `Composite` policy. It's saying that if any one of the subpolicies is over its limit then we should give up on this trace.
We should get rid of this `return NotSampled, nil` and allow the remaining subpolicies to process the trace. We still have a `return NotSampled, nil` at the end of the function in case no policy wants to sample the trace or no policy is able to sample the trace based on its SPS limits.
Activity