[Connector/Servicegraph] samples has been rejected because of same timestamp, but a different value (err-mimir-sample-duplicate-timestamp) #34169

VijayPatil872 · 2024-07-19T12:28:31Z

Component(s)

connector/servicegraph

What happened?

Description

Currently we are facing an issue on open telemetry-collector for "service graph connector" that few samples has been rejected because another sample with the same timestamp, but a different value, has already been ingested. (err-mimir-sample-duplicate-timestamp) when the metrics are ingested to mimir.

We are using servicegraphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace.
the servicegraph exporting the metrics to Grafana mimir with prometheusremotewrite exporter. The mimir distributer failed to inject some of the metrics and gives the following error,

ts=2024-07-19T07:26:46.442694833Z caller=push.go:171 level=error user=default-processor-servicegraph msg="push error" err="failed pushing to ingester mimir-ingester-zone-a-2: user=default-processor-servicegraph: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-07-19T07:26:46.23Z and is from series traces_service_graph_request_client_seconds_bucket{client=\"claims-service\", connection_type=\"virtual_node\", failed=\"false\", le=\"0.1\", server=\"xxxxx.redis.cache.windows.net\"}"

could someone please help on eliminating this error

Steps to Reproduce

Expected Result

The metrics failure should be zero.

Actual Result

We see metrics failed because of above mentioned error on open telemetry dashboard as given below.

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:


    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500


  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
  service:


    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp
     
      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-19T12:28:57Z

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno @JaredTan95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mapno · 2024-07-19T13:20:18Z

Hi @VijayPatil872. Yes, this is a very unfortunate issue of horizontally scaling the connector. A workaround is adding a label to the metrics that corresponds to the collector pod name or something unique that makes the series unique across instances.

VijayPatil872 · 2024-07-22T05:32:25Z

Hi @mapno will you please suggest/elaborate more on how to add label to the metrics that corresponds to the collector pod name or something unique that makes the series unique across instances.

mapno · 2024-07-22T10:22:17Z

Hi @VijayPatil872. I believe something like the k8sattributesprocessor should work for that. With it, you can add a label like k8s.pod.name to your metrics and make the series unique between instances.

VijayPatil872 · 2024-07-23T11:33:13Z

Hi @mapno some workaround done with k8sattributesprocessor, it is seen that label mentioned in configuration are seen in otel logs, but issue still persists. it is not worked for me.

mapno · 2024-07-24T13:56:42Z

Do the metrics now have k8s.pod.name as label and you still get the same errors?

VijayPatil872 · 2024-07-25T12:09:31Z

Hi @mapno the k8sattributesprocessor with following configuration added,

 k8sattributes:
         auth_type: "serviceAccount"
         passthrough: false
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.statefulset.name
          - k8s.daemonset.name
          - k8s.cronjob.name
          - k8s.job.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.pod.start_time
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name

It is seen in open telemetry collector logs the labels are getting added whichever available still issue persists.

github-actions · 2024-09-24T03:33:54Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno @JaredTan95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

MayurCXone · 2024-10-21T11:09:26Z

not stale.

VijayPatil872 added bug Something isn't working needs triage New item requiring triage labels Jul 19, 2024

github-actions bot added the connector/servicegraph label Jul 19, 2024

crobert-1 removed the needs triage New item requiring triage label Jul 19, 2024

github-actions bot mentioned this issue Jul 23, 2024

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

github-actions bot added the Stale label Sep 24, 2024

github-actions bot removed the Stale label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Connector/Servicegraph] samples has been rejected because of same timestamp, but a different value (err-mimir-sample-duplicate-timestamp) #34169

[Connector/Servicegraph] samples has been rejected because of same timestamp, but a different value (err-mimir-sample-duplicate-timestamp) #34169

VijayPatil872 commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

mapno commented Jul 19, 2024

VijayPatil872 commented Jul 22, 2024

mapno commented Jul 22, 2024

VijayPatil872 commented Jul 23, 2024

mapno commented Jul 24, 2024

VijayPatil872 commented Jul 25, 2024

github-actions bot commented Sep 24, 2024

MayurCXone commented Oct 21, 2024