Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail sampling processor "count_traces_sampled" metric is inaccurate #25882

Closed
lbogdan opened this issue Aug 18, 2023 · 7 comments · Fixed by #26724
Closed

Tail sampling processor "count_traces_sampled" metric is inaccurate #25882

lbogdan opened this issue Aug 18, 2023 · 7 comments · Fixed by #26724
Labels
help wanted Extra attention is needed priority:p2 Medium processor/tailsampling Tail sampling processor

Comments

@lbogdan
Copy link

lbogdan commented Aug 18, 2023

Component(s)

processor/tailsampling

Describe the issue you're reporting

It looks like the "count_traces_sampled" metric of the tail sampling processor is inaccurate, or at least it's not calculated how I would expect.

Specifically, given two policies named "policy1" and "policy2", the metric values for {policy="policy1",sampled="false"} and {policy="policy2",sampled="false"}, and those for {policy="policy1",sampled="true"} and {policy="policy2",sampled="true"} are always equal, although I would expect them to be different, for different policies.

Indeed, looking at the code, they are all calculated based on finalDecision, and not each policy's decision:

for _, p := range tsp.policies {
switch finalDecision {
case sampling.Sampled:
// any single policy that decides to sample will cause the decision to be sampled
// the nextConsumer will get the context from the first matching policy
if matchingPolicy == nil {
matchingPolicy = p
}
_ = stats.RecordWithTags(
p.ctx,
[]tag.Mutator{tag.Upsert(tagSampledKey, "true")},
statCountTracesSampled.M(int64(1)),
)
metrics.decisionSampled++
case sampling.NotSampled:
_ = stats.RecordWithTags(
p.ctx,
[]tag.Mutator{tag.Upsert(tagSampledKey, "false")},
statCountTracesSampled.M(int64(1)),
)
metrics.decisionNotSampled++
}
}

Why I think that's a bug, and not intended behavior, is that per-policy decisions are actually stored in trace.Decisions[i] during evaluation, but then never used anywhere (that I could find).

So I think

	for _, p := range tsp.policies {
		switch finalDecision {

should actually be

	for i, p := range tsp.policies {
		switch trace.Decisions[i] {
@lbogdan lbogdan added the needs triage New item requiring triage label Aug 18, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Aug 18, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

Sounds reasonable, would you be open to sending in a PR?

@bryan-aguilar bryan-aguilar added help wanted Extra attention is needed priority:p2 Medium and removed needs triage New item requiring triage labels Aug 23, 2023
@lbogdan
Copy link
Author

lbogdan commented Aug 24, 2023

Sure, I'll get to it this weekend!

@jmsnll
Copy link
Contributor

jmsnll commented Sep 13, 2023

How is the count_traces_sampled metric meant to be interpreted for each policy? Would you expect each "sampling decision" to only be recorded once and by the policy that resulted in the trace being sampled? Or recorded multiple times for each policy that would have resulted in the trace being sampled?

For example with two probabilistic policies like so, assuming perfect distribution:

processors:
  tail_sampling:
    policies:
      - name: 50-percent-probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 50
      - name: 20-percent-probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 20

If 100 traces passed through, would you expect the metric counts to read like so with each decision only counted once?

otelcol_processor_tail_sampling_count_traces_sampled{policy="50-percent-probabilistic",sampled="false"} 50
otelcol_processor_tail_sampling_count_traces_sampled{policy="50-percent-probabilistic",sampled="true"} 50
otelcol_processor_tail_sampling_count_traces_sampled{policy="20-percent-probabilistic",sampled="false"} 40
otelcol_processor_tail_sampling_count_traces_sampled{policy="20-percent-probabilistic",sampled="true"} 10

Or multiple times?

otelcol_processor_tail_sampling_count_traces_sampled{policy="50-percent-probabilistic",sampled="false"} 50
otelcol_processor_tail_sampling_count_traces_sampled{policy="50-percent-probabilistic",sampled="true"} 50
otelcol_processor_tail_sampling_count_traces_sampled{policy="20-percent-probabilistic",sampled="false"} 80
otelcol_processor_tail_sampling_count_traces_sampled{policy="20-percent-probabilistic",sampled="true"} 20

@lbogdan
Copy link
Author

lbogdan commented Sep 14, 2023

With my suggested code change, it would be the latter - every decision would increment every policy, either the sampled="true" or the sampled="false" metric, depending on the outcome.

Also, thinking about it, a total metric (without a policy tag, or with policy="final-decision" or similar) would be helpful, too, as this cannot be inferred from the existing metrics. Currently all policy metrics equal this total metric.

@edwintye
Copy link
Contributor

While we (me and @jmsnll) were trying to work out a solution internally (to see the various output of the metric), we realized a potential bug in that the tail sampler does not check for uniqueness in the policy name. More concretely, if we use a config as below, where someone accidentally use identical name for two different policies

receivers:
  otlp:
    protocols:
      grpc:
processors:
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: status-code-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: status-code-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 50
exporters:
  logging:
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [logging]

then firing off 100 traces with STATUS_CODE_UNSET result in

otelcol_processor_tail_sampling_count_traces_sampled{policy="status-code-policy",sampled="true",service_name="otelcontribcol",service_version="0.84.0-dev"} 49

which is true as per the config but deceiving to anyone without looking at the config. As the label is taken from the policy context in stats.RecordWithTags and there is no check during the initial unmarshal, we can in theory give every single policy the same name which result in the metric being incorrect. This issue with non-unique policy name remains regardless of whether this issue is fixed (and may have wider impact).

One would argue that no engineer should make the mistake of having non-unique policy name. However, if a new (silence) tag say final-decision is introduced, it may clash with a config that is using that exact policy name final-decision. I am not sure how likely this scenario is but I thought it is worth flagging as we have already made that mistake ourselves.

@jpkrohling
Copy link
Member

bug in that the tail sampler does not check for uniqueness in the policy name

😱 , I'll send a quick PR to get that fixed. Thanks for pointing this out.

One would argue that no engineer should make the mistake of having non-unique policy name.

That should not be an assumption.

jpkrohling pushed a commit that referenced this issue Sep 27, 2023
…ampled` metric (#26724)

**Description:** 

Previously, the calculation of `count_traces_sampled` was incorrect,
resulting in all per-policy metric counts being the same.

```
count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 50
count_traces_sampled{policy="policy-b", sampled="true"} 50
```

This issue stemmed from the metrics being incremented based on the final
decision of trace sampling, rather than being attributed to the policies
causing the sampling.

With the fix implemented, the total sampling count can no longer be
inferred from the metrics alone. To address this, a new metric,
`global_count_traces_sampled`, was introduced to accurately capture the
global count of sampled traces.

```
count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 24
count_traces_sampled{policy="policy-b", sampled="true"} 76
global_count_traces_sampled{sampled="false"} 24
global_count_traces_sampled{sampled="true"} 76
```

Reusing the `count_traces_sampled` policy metric for this purpose would
have either meant:

1. Leaving the policy field empty, which didn't feel very explicit.
2. Setting a global placeholder policy name (such as `final-decision`),
which could then potentially collide with existing user-configured
policy names without validation.

**Link to tracking Issue:** Fixes #25882

**Testing:** 

Tested with various combinations of `probabilistic`, `latency`,
`status_code` and attribute policies (including combinations of `and`
and `composite` policies).

No unit tests were added, open to suggestions for adding tests for
metrics but I couldn't find any examples of it being done elsewhere.

**Documentation:** No documentation changes.
jmsnll added a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
…ampled` metric (open-telemetry#26724)

**Description:** 

Previously, the calculation of `count_traces_sampled` was incorrect,
resulting in all per-policy metric counts being the same.

```
count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 50
count_traces_sampled{policy="policy-b", sampled="true"} 50
```

This issue stemmed from the metrics being incremented based on the final
decision of trace sampling, rather than being attributed to the policies
causing the sampling.

With the fix implemented, the total sampling count can no longer be
inferred from the metrics alone. To address this, a new metric,
`global_count_traces_sampled`, was introduced to accurately capture the
global count of sampled traces.

```
count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 24
count_traces_sampled{policy="policy-b", sampled="true"} 76
global_count_traces_sampled{sampled="false"} 24
global_count_traces_sampled{sampled="true"} 76
```

Reusing the `count_traces_sampled` policy metric for this purpose would
have either meant:

1. Leaving the policy field empty, which didn't feel very explicit.
2. Setting a global placeholder policy name (such as `final-decision`),
which could then potentially collide with existing user-configured
policy names without validation.

**Link to tracking Issue:** Fixes open-telemetry#25882

**Testing:** 

Tested with various combinations of `probabilistic`, `latency`,
`status_code` and attribute policies (including combinations of `and`
and `composite` policies).

No unit tests were added, open to suggestions for adding tests for
metrics but I couldn't find any examples of it being done elsewhere.

**Documentation:** No documentation changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed priority:p2 Medium processor/tailsampling Tail sampling processor
Projects
None yet
5 participants