-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability overhaul of [inputs.vsphere] #5113
Scalability overhaul of [inputs.vsphere] #5113
Conversation
* Don't stop collection on partial error * Compute average if query returned multiple samples for a metric
…roblems with some versions of vCenter
if p := recover(); p != nil { | ||
log.Printf("E! [input.vsphere] PANIC (recovered): %s", p) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a strict no-panic handling rule in the plugins, if this is required in order to work it will need updated to work without panicing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this because much of the data collection happens in separate go-routines, which, if they panic, will crash the entire process if the panic isn't recovered. I can remove this, but there's a chance panics will crash the entire Telegraf process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more detail: Telegraf gracefully handles panics in the main thread, which is great. However, panics in goroutines will crash Telegraf for the reason that's illustrated in this example: https://play.golang.org/p/jQHzZm5zMQB
This is caused by the fact that panic handlers don't cross goroutine boundaries.
I get that you want to handle panics in a single place so you can log them correctly, but I also would like to make sure Telegraf doesn't crash due to temporary glitches. The reason I implemented this in the first place was a strange, very intermittent problem related to vCenter. It's supposed to return two arrays that are guaranteed to have the same length. At some point, the arrays came back with one being longer than the other, which caused a panic. Of course, I should have checked the length and not assumed that they would be equal, but this illustrates the problem with intermittent panics that are hard to predict and that should, in my opinion, be handled in a graceful and recoverable way.
OK. I'll get off the soap box now. If you want me to toss out the panic handler, I will do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovery in many cases isn't really possible, due to resource leaks and there will always be goroutines that you don't control in libraries. I also don't want to hide programming errors by handling them. I think Go best practice here is to only handle panics if you are expecting them to occur for some reason.
I'm also torn on if I should change the existing panic handling to log where to report bugs and then re-panic to crash Telegraf, or even just remove it altogether. For now I've left it as is, but I remember noticing that it doesn't actually work very well.
So yeah, could you please remove the panic handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. What you're saying makes sense. My concern was consistency, since removing the panic handler means that some parts of the code are "safe" from panics and some aren't (i.e. code that runs in goroutines).
Anyway, I'll drop the handling now. Stand by!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
…g from GetClient(), fixed discovered object counting
Required for all PRs:
This PR resolves some issues found during scalability testing and testing under heavy load. The following changes have been made:
Addresses the following issues:
#5057 Vphere-Input for Telegraf stops working
#5041 [Inputs.vsphere] Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000
#5037 [inputs.vsphere]: Error in plugin: ServerFaultCode: This operation is restricted by the administrator
#5034 [input.vsphere] Option query for maxQueryMetrics failed. Using default