Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Connector/Servicegraph] samples has been rejected because of same timestamp, but a different value (err-mimir-sample-duplicate-timestamp) #34169

Open
VijayPatil872 opened this issue Jul 19, 2024 · 9 comments
Labels
bug Something isn't working connector/servicegraph

Comments

@VijayPatil872
Copy link

Component(s)

connector/servicegraph

What happened?

Description

Currently we are facing an issue on open telemetry-collector for "service graph connector" that few samples has been rejected because another sample with the same timestamp, but a different value, has already been ingested. (err-mimir-sample-duplicate-timestamp) when the metrics are ingested to mimir.

We are using servicegraphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace.
the servicegraph exporting the metrics to Grafana mimir with prometheusremotewrite exporter. The mimir distributer failed to inject some of the metrics and gives the following error,

ts=2024-07-19T07:26:46.442694833Z caller=push.go:171 level=error user=default-processor-servicegraph msg="push error" err="failed pushing to ingester mimir-ingester-zone-a-2: user=default-processor-servicegraph: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-07-19T07:26:46.23Z and is from series traces_service_graph_request_client_seconds_bucket{client=\"claims-service\", connection_type=\"virtual_node\", failed=\"false\", le=\"0.1\", server=\"xxxxx.redis.cache.windows.net\"}"

could someone please help on eliminating this error

Steps to Reproduce

Expected Result

The metrics failure should be zero.

Actual Result

We see metrics failed because of above mentioned error on open telemetry dashboard as given below.
image

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:


    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500


  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
  service:


    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp
     
      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

@VijayPatil872 VijayPatil872 added bug Something isn't working needs triage New item requiring triage labels Jul 19, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@mapno
Copy link
Contributor

mapno commented Jul 19, 2024

Hi @VijayPatil872. Yes, this is a very unfortunate issue of horizontally scaling the connector. A workaround is adding a label to the metrics that corresponds to the collector pod name or something unique that makes the series unique across instances.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Jul 19, 2024
@VijayPatil872
Copy link
Author

Hi @mapno will you please suggest/elaborate more on how to add label to the metrics that corresponds to the collector pod name or something unique that makes the series unique across instances.

@mapno
Copy link
Contributor

mapno commented Jul 22, 2024

Hi @VijayPatil872. I believe something like the k8sattributesprocessor should work for that. With it, you can add a label like k8s.pod.name to your metrics and make the series unique between instances.

@VijayPatil872
Copy link
Author

Hi @mapno some workaround done with k8sattributesprocessor, it is seen that label mentioned in configuration are seen in otel logs, but issue still persists. it is not worked for me.

@mapno
Copy link
Contributor

mapno commented Jul 24, 2024

Do the metrics now have k8s.pod.name as label and you still get the same errors?

@VijayPatil872
Copy link
Author

Hi @mapno the k8sattributesprocessor with following configuration added,

 k8sattributes:
         auth_type: "serviceAccount"
         passthrough: false
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.statefulset.name
          - k8s.daemonset.name
          - k8s.cronjob.name
          - k8s.job.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.pod.start_time
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name

It is seen in open telemetry collector logs the labels are getting added whichever available still issue persists.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 24, 2024
@MayurCXone
Copy link

not stale.

@github-actions github-actions bot removed the Stale label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/servicegraph
Projects
None yet
Development

No branches or pull requests

4 participants