Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric otelcol_processor_tail_sampling_count_traces_sampled showing wrong values based on policy, sampled labels #27567

Closed
0x006EA1E5 opened this issue Oct 9, 2023 · 7 comments
Labels
bug Something isn't working processor/tailsampling Tail sampling processor

Comments

@0x006EA1E5
Copy link

0x006EA1E5 commented Oct 9, 2023

Component(s)

processor/tailsampling

What happened?

Description

For the tail sampling processor, metric otelcol_processor_tail_sampling_count_traces_sampled, we get the same value for every policy. I believe it should count sampled = true or false differently for each policy, indicating if the policy matched a given trace

Steps to Reproduce

Create a config file for the collector, with the tail_sampling processor with two distinct policies. For example, one for error traces, and one for high latency traces. Also enable the prometheus scrape endpoint. Something like:

processors:
  tail_sampling:
    decision_wait: 10s
    expected_new_traces_per_sec: 100
    policies:
      [

        {
          name: sample-all-error-traces,
          type: status_code,
          status_code: {status_codes: [ERROR]}
        },
        {
          name: sample-all-high-latency,
          type: latency,
          latency: { threshold_ms: 3000 }
        }
      ]

service:
  telemetry:
    metrics:
      level: detailed
      address: "0.0.0.0:8888"

Send a low latency, error trace.
Check the metrics endpoint e.g., curl localhost:8888/metrics | grep otelcol_processor_tail_sampling_count_traces_sampled. You might have to wait a while to see something (the decision_wait time?)
Eventually you should see two lines,

otelcol_processor_tail_sampling_count_traces_sampled{policy="sample-all-error-traces",sampled="true",service_instance_id="b8487c8c-4c5c-4334-9228-ee8f0389f8c0",service_name="otelcol-ocado",service_version="0.86.0"} 1
otelcol_processor_tail_sampling_count_traces_sampled{policy="sample-all-high-latency",sampled="true",service_instance_id="b8487c8c-4c5c-4334-9228-ee8f0389f8c0",service_name="otelcol-ocado",service_version="0.86.0"} 1

Note, for both policies, sampled="true" and the count is 1, even though only one policy should have matched.

Now send a "high latency" trace, that should match the other policy.
When the metrics update, both policies (with label sampled="true") will read 2

Expected Result

In the above scenario, when we first only send a single error trace, we should get the metric for policy="sample-all-error-traces",sampled="true" to be 1, and policy="sample-all-high-latency",sampled="false" (note sampled is false) to be 1.

Then when we send a high latency trace, we should see 4 lines, two for each each policy, with sampled="true" and sampled="false"

Actual Result

For every trace that matches any policy (i.e., the trace is sampled), the counter for all policies sampled="true" increments.
Likewise, for every trace that doesn't match any policy (i.e., the trace is not sampled), the counter for all policies sampled="false" increments.

Collector version

0.86.0

Environment information

Environment

docker image /otel/opentelemetry-collector:0.86.0@sha256:b8733b43d9061e3f332817f9f373ba5bf59803e67edfc4e70f280cb0afb49dd5

OpenTelemetry Collector configuration

extensions:
  memory_ballast:
    size_in_percentage: 34
  zpages:
    endpoint: 0.0.0.0:55679
  health_check:

receivers:
  otlp:
    protocols:
      grpc:
      http:
  prometheus:
    config:
      scrape_configs:
        - job_name: "otel-collector"
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8888"]

processors:
  batch/metrics:
  batch/traces:
    send_batch_max_size: 1500
    send_batch_size: 1000
  tail_sampling:
    decision_wait: 30s
    expected_new_traces_per_sec: 100
    policies:
      [
        {
          name: sample-10%-all-traces,
          type: probabilistic,
          probabilistic: {sampling_percentage: 10}
        },
        {
          name: sample-all-error-traces,
          type: status_code,
          status_code: {status_codes: [ERROR]}
        },
        {
          name: sample-all-high-latency,
          type: latency,
          latency: { threshold_ms: 3000 }
        }
      ]
  memory_limiter:
    check_interval: 1s          # 1s is the default value
    limit_percentage: 80        # This define the hard limit
    spike_limit_percentage: 20  # Soft limit = limit_percentage - spike_limit_percentage


exporters:
  logging:
  otlp/grafana:
    endpoint: http://grafana
    headers:
      authorization: Basic ${TEMPO_BASIC_AUTH}
  prometheusremotewrite:
    endpoint: http://prometheus
    remote_write_queue:
      enabled: true
      queue_size: 5000  # increase the queue size to 5000
    target_info:
      enabled: false
    resource_to_telemetry_conversion:
      enabled: true

service:
  telemetry:
    metrics:
      level: detailed
      address: "0.0.0.0:8888"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch/traces]
      exporters: [otlp/grafana, logging]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch/metrics]
      exporters: [prometheusremotewrite]
  extensions: [memory_ballast, zpages, health_check]

Log output

No response

Additional context

No response

@0x006EA1E5 0x006EA1E5 added bug Something isn't working needs triage New item requiring triage labels Oct 9, 2023
@crobert-1 crobert-1 added the processor/tailsampling Tail sampling processor label Oct 9, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2023

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@0x006EA1E5
Copy link
Author

/label processor/tailsampling

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2023

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

0x006EA1E5 added a commit to 0x006EA1E5/opentelemetry-collector-contrib that referenced this issue Oct 10, 2023
@0x006EA1E5
Copy link
Author

Oh, I guess this is a duplicate of #25882 🤦

@crobert-1
Copy link
Member

@0x006EA1E5 Should we close this issue as a duplicate, or do you want to re-test on v0.87.0 first to make sure the issue has been resolved?

@0x006EA1E5
Copy link
Author

Looks like this is already fixed in 0.87.0, but I will check on Monday if that's okay?

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Oct 12, 2023
@0x006EA1E5
Copy link
Author

closing as fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/tailsampling Tail sampling processor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants