Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

probabilisticsampler processor stops sampling 'sampling_percentage: 60' #30079

Open
Sakib37 opened this issue Dec 19, 2023 · 9 comments
Open
Assignees
Labels
bug Something isn't working processor/probabilisticsampler Probabilistic Sampler processor Stale

Comments

@Sakib37
Copy link

Sakib37 commented Dec 19, 2023

Component(s)

processor/probabilisticsampler

What happened?

Description

I am trying to control the percentage of logs that will be shipped to the backend. I am using probabilisticsamplerprocessor. During this test there was no change in the number of logs in the cluster(i.e. no new pods are added in the cluster)

I am using the following config

      probabilistic_sampler/logs:
        hash_seed: 22
        sampling_percentage: 98
        attribute_source: record
        from_attribute: "cluster" # This attribute is added via 'resource' processor

With this config, I get around 1.8K logs in Datadog dashbaord. Now, I gradually reduce the sampling_percentage from 98 to 90, 80, 70, 65, 60. But in Datadog I see so significant effect of this sampling until sampling_percentage 65 and the total amount of logs stays almost the same.

However, when I set sampling_percentage to 60, there are no logs available in the backend(Datadog). I tried the following two configs as well

  probabilistic_sampler/logs:
    hash_seed: 22
    sampling_percentage: 98
  probabilistic_sampler/logs:
    sampling_percentage: 98

In every case, when I set sampling_percentage to 60, there is no log in the backend.
My log pipeline in otel collector is as below

 logs/datadog:
      exporters:
      - debug
      - datadog
      processors:
      - resource/common
      - k8sattributes
      - memory_limiter
      - probabilistic_sampler/logs
      - batch/logs
      - transform/filelog_labels
      receivers:
      - filelog

Steps to Reproduce

Try to sample logs using probabilisticsamplerprocessor and set sampling_percentage to 60 or below.

Expected Result

I expect accurate sampling based on percentage. If with 65% sampling I get 1k logs then with 60% sampling I should at least get ~900 log lines in the backend.

Actual Result

No logs in the backend after setting sampling_percentage to 60

Collector version

0.91.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
    filelog:
      exclude: []
      include:
      - /var/log/pods/*/*/*.log
      include_file_name: false
      include_file_path: true
      operators:
      - id: get-format
        routes:
        - expr: body matches "^\\{"
          output: parser-docker
        - expr: body matches "^[^ Z]+ "
          output: parser-crio
        - expr: body matches "^[^ Z]+Z"
          output: parser-containerd
        type: router
      - id: parser-crio
        regex: ^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
        timestamp:
          layout: 2006-01-02T15:04:05.999999999Z07:00
          layout_type: gotime
          parse_from: attributes.time
        type: regex_parser
      - combine_field: attributes.log
        combine_with: ""
        id: crio-recombine
        is_last_entry: attributes.logtag == 'F'
        max_log_size: 102400
        output: extract_metadata_from_filepath
        source_identifier: attributes["log.file.path"]
        type: recombine
      - id: parser-containerd
        regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
        timestamp:
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          parse_from: attributes.time
        type: regex_parser
      - combine_field: attributes.log
        combine_with: ""
        id: containerd-recombine
        is_last_entry: attributes.logtag == 'F'
        max_log_size: 102400
        output: extract_metadata_from_filepath
        source_identifier: attributes["log.file.path"]
        type: recombine
      - id: parser-docker
        output: extract_metadata_from_filepath
        timestamp:
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          parse_from: attributes.time
        type: json_parser
      - id: extract_metadata_from_filepath
        parse_from: attributes["log.file.path"]
        regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
        type: regex_parser
      - from: attributes.stream
        to: attributes["log.iostream"]
        type: move
      - from: attributes.container_name
        to: resource["k8s.container.name"]
        type: move
      - from: attributes.namespace
        to: resource["k8s.namespace.name"]
        type: move
      - from: attributes.pod_name
        to: resource["k8s.pod.name"]
        type: move
      - from: attributes.restart_count
        to: resource["k8s.container.restart_count"]
        type: move
      - from: attributes.uid
        to: resource["k8s.pod.uid"]
        type: move
      - from: attributes.log
        to: body
        type: move

  batch/logs:
      # send_batch_max_size must be greater or equal to send_batch_size
      send_batch_max_size: 11000
      send_batch_size: 10000
      timeout: 10s

  transform/filelog_labels:
          log_statements:
          - context: log
            statements:
            # For the index
            - set(resource.attributes["service.name"], "integrations/kubernetes/logs")
            - set(resource.attributes["cluster"], attributes["cluster"])
            - set(resource.attributes["pod"], resource.attributes["k8s.pod.name"])
            - set(resource.attributes["container"], resource.attributes["k8s.container.name"])
            - set(resource.attributes["namespace"], resource.attributes["k8s.namespace.name"])
            - set(resource.attributes["filename"], attributes["log.file.path"])
            - set(resource.attributes["loki.resource.labels"], "pod, namespace, container, cluster, filename")
            # For the body
            - set(resource.attributes["loki.format"], "raw")
            - >
              set(body, Concat([
                Concat(["name", resource.attributes["k8s.object.name"]], "="),
                Concat(["kind", resource.attributes["k8s.object.kind"]], "="),
                Concat(["action", attributes["k8s.event.action"]], "="),
                Concat(["objectAPIversion", resource.attributes["k8s.object.api_version"]], "="),
                Concat(["objectRV", resource.attributes["k8s.object.resource_version"]], "="),
                Concat(["reason", attributes["k8s.event.reason"]], "="),
                Concat(["type", severity_text], "="),
                Concat(["count", resource.attributes["k8s.event.count"]], "="),
                Concat(["msg", body], "=")
              ], " "))

  exporters:
    debug: {}
    datadog:
      api:
        key: $${env:DATADOG_API_KEY}
        site: datadoghq.com

  service:
    extensions:
      - health_check
      - memory_ballast

  pipelines:
    logs/datadog:
        receivers:
          - filelog
        processors:
          - resource/common
          - k8sattributes
          - memory_limiter
          - probabilistic_sampler/logs
          - batch/logs
          #- transform/filelog_labels
        exporters:
          - debug
          - datadog

Log output

2023-12-19 10:43:45,447 INFO app [trace_id=de6328f2336ce1f7feeee7b512330250 span_id=1dcf616458576690 resource.service.name=ping_pong] waitress-2 : custom log 2
2023-12-19 10:43:45,959 INFO app [trace_id=a5e9e494204dd4feef2d5b1b90a04d7a span_id=dc3a55ba5efb65ab resource.service.name=ping_pong] waitress-3 : custom log 1
2023-12-19 10:43:45,959 INFO app [trace_id=a5e9e494204dd4feef2d5b1b90a04d7a span_id=e25a65fd90261b82 resource.service.name=ping_pong] waitress-3 : custom log 2
2023-12-19 10:43:46,484 INFO app [trace_id=c92a5ae80e33c1bc7f072067d204e2e5 span_id=08a8f332bb7eea3a resource.service.name=ping_pong] waitress-1 : custom log 1
2023-12-19 10:43:46,490 INFO app [trace_id=c92a5ae80e33c1bc7f072067d204e2e5 span_id=88b75c683989c2ef resource.service.name=ping_pong] waitress-1 : custom log 2
2023-12-19 10:43:47,024 INFO app [trace_id=ba552851ef2df3eaf9fddeae77ed40c2 span_id=d5a840450ec02c9e resource.service.name=ping_pong] waitress-0 : custom log 1

Additional context

No response

@Sakib37 Sakib37 added bug Something isn't working needs triage New item requiring triage labels Dec 19, 2023
@github-actions github-actions bot added the processor/probabilisticsampler Probabilistic Sampler processor label Dec 19, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme
Copy link
Contributor

atoulme commented Dec 23, 2023

That all depends on the attribute "record" that you use as source of sampling decision. It looks like it's not properly distributed, and therefore you get uneven results.

@atoulme atoulme removed the needs triage New item requiring triage label Dec 23, 2023
@pierzapin
Copy link

Just adding a +1 to this report. Similar barebones config and binary outcome.
I see that @Sakib37 experienced this without the "attribute_source: record" configuration (as did I) which would indicate that @atoulme 's observation here is unlikely to be the only factor

@jpkrohling
Copy link
Member

Would you please provide the state of the metric count_logs_sampled, as well as the receiver's "accepted span" and the exporter's "sent spans"? This would help understand where the problem might be.

Copy link
Contributor

github-actions bot commented Apr 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 1, 2024
@jpkrohling jpkrohling removed the Stale label Apr 30, 2024
@jpkrohling jpkrohling self-assigned this Apr 30, 2024
Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 1, 2024
@jpkrohling jpkrohling removed the Stale label Jul 8, 2024
@jpkrohling
Copy link
Member

@jmacd , do you have time to look into this one?

Copy link
Contributor

github-actions bot commented Sep 9, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 9, 2024
@jpkrohling jpkrohling removed the Stale label Sep 9, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/probabilisticsampler Probabilistic Sampler processor Stale
Projects
None yet
Development

No branches or pull requests

4 participants