You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Read metrics from VMware vCenter
[[inputs.vsphere]]
interval = "20s"
## List of vCenter URLs to be monitored. These three lines must be uncommented
## and edited for the plugin to work.
vcenters = [ "https://example.com:8989/sdk" ]
username = "user@corp.local"
password = "secret"
## VMs
## Typical VM metrics (if omitted or empty, all metrics are collected)
vm_metric_include = ["*"]
# vm_instances = true ## true by default
## Hosts
## Typical host metrics (if omitted or empty, all metrics are collected)
host_metric_include = [ "*" ]
# host_metric_exclude = [] ## Nothing excluded by default
# host_instances = true ## true by default
## Clusters
cluster_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
# cluster_metric_exclude = [] ## Nothing excluded by default
# cluster_instances = true ## true by default
## Datastores
datastore_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
# datastore_metric_exclude = [] ## Nothing excluded by default
# datastore_instances = false ## false by default for Datastores only
## Datacenters
datacenter_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
# datacenter_instances = false ## false by default for Datastores only
## Plugin Settings
## separator character to use for measurement and field names (default: "_")
# separator = "_"
## number of objects to retreive per query for realtime resources (vms and hosts)
## set to 64 for vCenter 5.5 and 6.0 (default: 256)
# max_query_objects = 256
## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
## set to 64 for vCenter 5.5 and 6.0 (default: 256)
# max_query_metrics = 256
## number of go routines to use for collection and discovery of objects and metrics
collect_concurrency = 3
discover_concurrency = 3
## whether or not to force discovery of new objects on initial gather call before collecting metrics
## when true for large environments this may cause errors for time elapsed while collecting metrics
## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
# force_discover_on_init = false
## the interval before (re)discovering objects subject to metrics collection (default: 300s)
# object_discovery_interval = "300s"
## timeout applies to any of the api request made to vcenter
# timeout = "20s"
## Optional SSL Config
# ssl_ca = "/path/to/cafile"
# ssl_cert = "/path/to/certfile"
# ssl_key = "/path/to/keyfile"
## Use SSL but skip chain & host verification
insecure_skip_verify = true
Very hard to reproduce. You have to get a network error at the exact right time. This happened when I deliberately was trying to overload a undersized system.
See logfile for information how this happened.
Expected behavior:
Data should be collected without error.
Actual behavior:
Panic in workerpool.go, due to unlocking of an unlocked Mutex. See attached logfile!
Additional info:
This bug is due to a typo in the code that handles errors from the goroutine querying for metadata. If you lose your network connection at the exact moment when the metadata query is issued, you will hit a section of code where an mutex.Lock was accidentally mistyped as an unlock.
What a coincidence! Just yesterday I tried the new plugin and it gave me a panic when trying to collect data and created the gist with the panic, but didn't have enough time to write up the issue!
Reviewing your log, it seems that is the same panic, but to be sure, here it is:
It seems that was happening every 300s, when the agent was trying to retrieve metrics from cluster resources:
...
2018-09-27T08:43:00Z D! [input.vsphere]: Latest: 2018-09-27 10:38:00.22902964 +0200 CEST m=+53.726632359, elapsed: 304.788621, resource: datacenter
2018-09-27T08:43:00Z D! [input.vsphere]: Start of sample period deemed to be 2018-09-27 10:38:00.22902964 +0200 CEST m=+53.726632359
2018-09-27T08:43:00Z D! [input.vsphere]: Collecting metrics for 1 objects of type datacenter for myvcenter.mydomain.com
2018-09-27T08:43:00Z D! [input.vsphere]: Query returned 20 metrics
2018-09-27T08:43:00Z D! [input.vsphere]: Latest: 2018-09-27 10:38:00.229495562 +0200 CEST m=+53.727098115, elapsed: 305.015600, resource: cluster
2018-09-27T08:43:00Z D! [input.vsphere]: Start of sample period deemed to be 2018-09-27 10:38:00.229495562 +0200 CEST m=+53.727098115
2018-09-27T08:43:00Z D! [input.vsphere]: Collecting metrics for 1 objects of type cluster for myvcenter.mydomain.com
2018-09-27T08:43:00Z D! [input.vsphere]: Query returned 0 metrics
[telegraf stopped]
I will try your fix and will give you some feedback!
@sbengo Hmmm... I thought it would take longer for someone to hit that bug, but OK. :) Do you get any other error just before the panic? This bug is in the error handling code, so some other issue must have triggered it.
Also, if you want, I can build you a "hotfix" binary with that bug fixed.
Relevant telegraf.conf:
System info:
Ubuntu 16.04 AWS "Small" configuration.
Telegraf 1.18
Steps to reproduce:
Very hard to reproduce. You have to get a network error at the exact right time. This happened when I deliberately was trying to overload a undersized system.
See logfile for information how this happened.
Expected behavior:
Data should be collected without error.
Actual behavior:
Panic in workerpool.go, due to unlocking of an unlocked Mutex. See attached logfile!
Additional info:
This bug is due to a typo in the code that handles errors from the goroutine querying for metadata. If you lose your network connection at the exact moment when the metadata query is issued, you will hit a section of code where an mutex.Lock was accidentally mistyped as an unlock.
telegraf/plugins/inputs/vsphere/endpoint.go
Lines 660 to 666 in af0ef55
Logfile: https://gist.github.com/prydin/82976e39378434bc2cc97cbdddf806fc
The text was updated successfully, but these errors were encountered: