Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere input plugin: Panic if network error when calling a specific vCenter method #4764

Closed
prydin opened this issue Sep 27, 2018 · 3 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@prydin
Copy link
Contributor

prydin commented Sep 27, 2018

Relevant telegraf.conf:

# Read metrics from VMware vCenter
[[inputs.vsphere]]
  interval = "20s"
  ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ "https://example.com:8989/sdk" ]
  username = "user@corp.local"
  password = "secret"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  vm_metric_include = ["*"]
  
  # vm_instances = true ## true by default

  ## Hosts
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  host_metric_include = [ "*" ]
  
# host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default

  ## Clusters
  cluster_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
  # cluster_metric_exclude = [] ## Nothing excluded by default
  # cluster_instances = true ## true by default

  ## Datastores
  datastore_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
  # datastore_instances = false ## false by default for Datastores only

  ## Datacenters
  datacenter_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
  # datacenter_instances = false ## false by default for Datastores only

  ## Plugin Settings
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
  collect_concurrency = 3
  discover_concurrency = 3

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  # object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  # timeout = "20s"

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
  insecure_skip_verify = true

System info:

Ubuntu 16.04 AWS "Small" configuration.
Telegraf 1.18

Steps to reproduce:

Very hard to reproduce. You have to get a network error at the exact right time. This happened when I deliberately was trying to overload a undersized system.

See logfile for information how this happened.

Expected behavior:

Data should be collected without error.

Actual behavior:

Panic in workerpool.go, due to unlocking of an unlocked Mutex. See attached logfile!

Additional info:

This bug is due to a typo in the code that handles errors from the goroutine querying for metadata. If you lose your network connection at the exact moment when the metadata query is issued, you will hit a section of code where an mutex.Lock was accidentally mistyped as an unlock.

var mux sync.Mutex
err := make(multiError, 0)
wp.Drain(ctx, func(ctx context.Context, in interface{}) bool {
if in != nil {
mux.Unlock()
defer mux.Unlock()
err = append(err, in.(error))

Logfile: https://gist.github.com/prydin/82976e39378434bc2cc97cbdddf806fc

@russorat russorat added this to the 1.8.1 milestone Sep 27, 2018
@russorat russorat added the bug unexpected problem or unintended behavior label Sep 27, 2018
@sbengo
Copy link

sbengo commented Sep 28, 2018

Hi again @prydin !

What a coincidence! Just yesterday I tried the new plugin and it gave me a panic when trying to collect data and created the gist with the panic, but didn't have enough time to write up the issue!

Reviewing your log, it seems that is the same panic, but to be sure, here it is:

It seems that was happening every 300s, when the agent was trying to retrieve metrics from cluster resources:

...
2018-09-27T08:43:00Z D! [input.vsphere]: Latest: 2018-09-27 10:38:00.22902964 +0200 CEST m=+53.726632359, elapsed: 304.788621, resource: datacenter
2018-09-27T08:43:00Z D! [input.vsphere]: Start of sample period deemed to be 2018-09-27 10:38:00.22902964 +0200 CEST m=+53.726632359
2018-09-27T08:43:00Z D! [input.vsphere]: Collecting metrics for 1 objects of type datacenter for myvcenter.mydomain.com
2018-09-27T08:43:00Z D! [input.vsphere]: Query returned 20 metrics
2018-09-27T08:43:00Z D! [input.vsphere]: Latest: 2018-09-27 10:38:00.229495562 +0200 CEST m=+53.727098115, elapsed: 305.015600, resource: cluster
2018-09-27T08:43:00Z D! [input.vsphere]: Start of sample period deemed to be 2018-09-27 10:38:00.229495562 +0200 CEST m=+53.727098115
2018-09-27T08:43:00Z D! [input.vsphere]: Collecting metrics for 1 objects of type cluster for myvcenter.mydomain.com
2018-09-27T08:43:00Z D! [input.vsphere]: Query returned 0 metrics

[telegraf stopped]

I will try your fix and will give you some feedback!

@prydin
Copy link
Contributor Author

prydin commented Sep 28, 2018

@sbengo Hmmm... I thought it would take longer for someone to hit that bug, but OK. :) Do you get any other error just before the panic? This bug is in the error handling code, so some other issue must have triggered it.

Also, if you want, I can build you a "hotfix" binary with that bug fixed.

@prydin
Copy link
Contributor Author

prydin commented Oct 2, 2018

Unofficial hotfix. Linux only. Let me know if you need anything else. (Also fixes #4783)
https://github.com/prydin/telegraf/releases/tag/PRYDIN-HOTFIX-4783

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants