Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere Input - Prometheus : Error on ingesting samples with different value but same timestamp #9705

Closed
6monlambert opened this issue Sep 1, 2021 · 1 comment
Labels
area/prometheus bug unexpected problem or unintended behavior

Comments

@6monlambert
Copy link
Contributor

Relevant telegraf.conf:

[agent]
  flush_interval = "30s"
  interval = "30s"
  round_interval = true
  metric_batch_size = 5000
  metric_buffer_limit = 100000
  debug=true
  logfile = "/var/log/telegraf.log"
[[outputs.prometheus_client]]
  ## Address to listen on.
  listen = ":9273"

  ## Metric version controls the mapping from Telegraf metrics into
  ## Prometheus format.  When using the prometheus input, use the same value in
  ## both plugins to ensure metrics are round-tripped without modification.

  export_timestamp = true
  data_format = "prometheus"
  path = "/metrics"
  metric_version = 2

## Realtime instance
[[inputs.vsphere]]
  interval = "1m"
  vcenters = [ "XXX" ]
  username = "XXX"
  password = "XXX"

  insecure_skip_verify = true
  force_discover_on_init = true

  # Exclude all historical metrics
  datastore_metric_exclude = ["*"]
  cluster_metric_exclude = ["*"]
  datacenter_metric_exclude = ["*"]
  #collect_concurrency = 5
  #discover_concurrency = 5
  vm_metric_include = [
    "cpu.usagemhz.average",
    "cpu.usage.average",
    "cpu.used.summation",
    "cpu.wait.summation",
    "mem.active.average",
    "mem.granted.average",
    "mem.usage.average",
    "mem.consumed.average",
    "net.usage.average",
    "power.power.average",
    "sys.uptime.latest",
  ]

  host_metric_include = [
    "cpu.coreUtilization.average",
    "cpu.usage.average",
    "cpu.usagemhz.average",
    "cpu.used.summation",
    "cpu.utilization.average",
    "cpu.wait.summation",
    "disk.read.average",
    "disk.write.average",
    "mem.active.average",
    "mem.totalCapacity.average",
    "mem.usage.average",
    "net.usage.average",
    "power.power.average",
    "sys.uptime.latest",
  ]

# Historical instance
[[inputs.vsphere]]

  interval = "300s"

  vcenters = [ "XXX" ]
  username = "XXX"
  password = "XXX"

  insecure_skip_verify = true
  force_discover_on_init = true
  host_metric_exclude = ["*"] # Exclude realtime metrics
  vm_metric_exclude = ["*"] # Exclude realtime metrics
  cluster_metric_exclude = ["vpxd.stats.maxQueryMetrics"]
  #max_query_metrics = 256
  #collect_concurrency = 3

System info:

Linux XXX 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Telegraf 1.19.0 (git: HEAD ed412d7)

Steps to reproduce:

I have duplicate timestamp error on my Prometheus server when I collect the metrics returned by the vSphere plugin
The message : error on ingesting samples with different value but same timestamp
When I activate debug mode in Prometheus, every metrics returns this error : msg="Duplicate sample for timestamp"

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'telegraf'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 15s
    basic_auth:
      username: 'XXX'
      password: 'XXX'
    static_configs:
           - targets: ['XXX:9273', 'XXX:9273']

I tried to remove the export_timestamp in my agent, but this causes data loss.

Expected behavior:

Not having this duplicate timestamp error, or not having data loss

Actual behavior:

When I remove the timestamp from the exporter, I have theses holes in my graphs :

image

Additional info:

Sep  1 12:22:03 pue-poller telegraf[3004]: 2021-09-01T10:22:03Z D! [outputs.prometheus_client] Buffer fullness: 6348 / 100000 metrics
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [inputs.vsphere] CollectChunk for vm returned 30052 metrics
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [inputs.vsphere] CollectChunk for vm returned 30052 metrics
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [inputs.vsphere] Latest sample for vm set to 2021-09-01 10:22:00 +0000 UTC
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 5695 remaining
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [outputs.prometheus_client] Wrote batch of 5000 metrics in 661.371467ms
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [outputs.prometheus_client] Buffer fullness: 9514 / 100000 metrics
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [outputs.prometheus_client] Wrote batch of 5000 metrics in 126.269194ms
Sep  1 12:22:04 pue-poller telegraf[3004]: 2021-09-01T10:22:04Z D! [outputs.prometheus_client] Buffer fullness: 4514 / 100000 metrics
Sep  1 12:22:15 pue-poller telegraf[3004]: 2021-09-01T10:22:15Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:22:15 +0200 CEST, 2021-09-01 12:22:30 +0200 CEST]
Sep  1 12:22:15 pue-poller telegraf[3004]: 2021-09-01T10:22:15Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:22:15 +0200 CEST, 2021-09-01 12:22:30 +0200 CEST]
Sep  1 12:22:30 pue-poller telegraf[3004]: 2021-09-01T10:22:30Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:22:30 +0200 CEST, 2021-09-01 12:22:45 +0200 CEST]
Sep  1 12:22:30 pue-poller telegraf[3004]: 2021-09-01T10:22:30Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:22:30 +0200 CEST, 2021-09-01 12:22:45 +0200 CEST]
Sep  1 12:22:34 pue-poller telegraf[3004]: 2021-09-01T10:22:34Z D! [outputs.prometheus_client] Wrote batch of 4544 metrics in 158.030895ms
Sep  1 12:22:34 pue-poller telegraf[3004]: 2021-09-01T10:22:34Z D! [outputs.prometheus_client] Buffer fullness: 0 / 100000 metrics
Sep  1 12:22:45 pue-poller telegraf[3004]: 2021-09-01T10:22:45Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:22:45 +0200 CEST, 2021-09-01 12:23:00 +0200 CEST]
Sep  1 12:22:45 pue-poller telegraf[3004]: 2021-09-01T10:22:45Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:22:45 +0200 CEST, 2021-09-01 12:23:00 +0200 CEST]
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:23:00 +0200 CEST, 2021-09-01 12:23:15 +0200 CEST]
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [aggregators.basicstats] Updated aggregation range [2021-09-01 12:23:00 +0200 CEST, 2021-09-01 12:23:15 +0200 CEST]
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Raw interval 58.582334725s, padded: 1m8.582334725s, estimated: 1m0s
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Interval estimated to 1m0s
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Latest: 2021-09-01 10:22:00 +0000 UTC, elapsed: 65.014441, resource: vm
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Collecting metrics for 505 objects of type vm for XXX
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Submitting job for vm: 505 objects, 5555 metrics
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Query for vm has 505 QuerySpecs
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Raw interval 58.58388584s, padded: 1m8.58388584s, estimated: 1m0s
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Interval estimated to 1m0s
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Latest: 2021-09-01 10:22:00 +0000 UTC, elapsed: 65.030854, resource: host
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Collecting metrics for 10 objects of type host for XXX
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Submitting job for host: 10 objects, 140 metrics
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Query for host has 10 QuerySpecs
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Query for host returned metrics for 10 objects
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] CollectChunk for host returned 2370 metrics
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] CollectChunk for host returned 2370 metrics
Sep  1 12:23:00 pue-poller telegraf[3004]: 2021-09-01T10:23:00Z D! [inputs.vsphere] Latest sample for host set to 2021-09-01 10:23:00 +0000 UTC
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [inputs.vsphere] Query for vm returned metrics for 505 objects
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [inputs.vsphere] CollectChunk for vm returned 28805 metrics
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [inputs.vsphere] CollectChunk for vm returned 28805 metrics
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [inputs.vsphere] Latest sample for vm set to 2021-09-01 10:23:00 +0000 UTC
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [inputs.vsphere] purged timestamp cache. 0 deleted with 5695 remaining
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [outputs.prometheus_client] Wrote batch of 5000 metrics in 187.67811ms
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [outputs.prometheus_client] Buffer fullness: 13153 / 100000 metrics
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [outputs.prometheus_client] Wrote batch of 5000 metrics in 25.151929ms
Sep  1 12:23:02 pue-poller telegraf[3004]: 2021-09-01T10:23:02Z D! [outputs.prometheus_client] Buffer fullness: 8153 / 100000 metrics
@6monlambert 6monlambert added the bug unexpected problem or unintended behavior label Sep 1, 2021
@6monlambert
Copy link
Contributor Author

I managed to deal with this issue.

First, I migrated to VictoriaMetrics (https://github.com/VictoriaMetrics/VictoriaMetrics) which scales better than Prometheus, and uses the same language (so I did not have to modify my dashboards).

I removed the timestamps from the Prometheus exporter and adjusted the scrape interval on my Prometheus and Grafana dashboards. As I collect the metrics every 30 seconds on my vCenters, I put the scrape interval to 30s in Grafana but let the Prometheus scrape by default on my VM. That was (I think) the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prometheus bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

1 participant