Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: systemstack called from unexpected goroutine #2849

Closed
scjoiner opened this issue May 24, 2017 · 14 comments
Closed

fatal error: systemstack called from unexpected goroutine #2849

scjoiner opened this issue May 24, 2017 · 14 comments
Assignees
Labels
area/ping bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf

Comments

@scjoiner
Copy link

System info:

Docker 1.12.6 on CoreOS 4.9.24
Telegraf 1.3

telegraf.log.txt

elctelegraf.conf.txt

Config and log attached.

We recently switched from using a Synology to CoreOS as our Docker host. We are using the same plugins and devices, but telegraf is crashing every few minutes from system stack errors.

@danielnelson
Copy link
Contributor

Can you test if temporarily removing the ping input helps?

@danielnelson
Copy link
Contributor

Also, are you using the same Telegraf version between OSs?

@scjoiner
Copy link
Author

scjoiner commented May 26, 2017 via email

@scjoiner
Copy link
Author

I am also getting this error on our 2nd server (which has ping active still):

2017-05-25T16:16:10Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s)
2017-05-25T16:16:10Z E! Error in plugin [inputs.ping]: Fatal error processing ping output
---container restart---
2017-05-25T17:01:03Z E! Error in plugin [inputs.ping]: , Command timed out.
2017-05-25T17:01:03Z E! Error in plugin [inputs.ping]: Fatal error processing ping output

@danielnelson
Copy link
Contributor

I tried to reproduce this on our CoreOS cluster today, which is running the same version of CoreOS as you, but was unsuccessful in recreating the crash.

Here are some follow up questions that could help us reproduce:

  • Are you running inside a container on CoreOS or on the host?
  • If a container, what docker image are you using? Is it an official one, a third party one, or a custom image?
  • If a custom docker image, what Telegraf package are you installing into it?
  • Can you describe the networking configuration of your container.

On the [inputs.influxdb] error, this means that the input did not finish processing by the start of the next interval. Make sure that both InfluxDB and Telegraf are not overloaded, and that you can fetch the output of the /debug/vars endpoint http://100.122.149.210:8086/debug/vars from the machine running telegraf. I would also check the size of that response, as sometimes it can become rather large.

Also, I'm pretty curious, is this you? https://www.youtube.com/user/scjoiner

@danielnelson danielnelson added bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf labels May 27, 2017
@scjoiner
Copy link
Author

scjoiner commented May 27, 2017 via email

@danielnelson
Copy link
Contributor

Usually watching the cpu usage is enough, often the cpu is low but on each interval time it spikes up. However, even if the cpu is low, it is possible that telegraf is blocked on IO.

We have a few other tools for checking performance. You can enable the "internal" plugin which will output metrics about various internal processes, such as how long it took to gather the metrics for a plugin. Another option is to enable the profiling server with --pprof-addr :6060.

On the InfluxDB input, I recently looked at a case where the endpoint was returning 15MB of data (#2807). Since this is the input that is not complaining I would take a look at the response size if you fetch the /debug/vars endpoint with curl and check the number of metrics outputted when you run telegraf with the --test flag.

72 servers is on the high side for ping usage, since it forks a ping process for each server every interval. We have some ideas that would make this much better (#2833), and from looking at the code today I see some aspects that are poorly optimized, such as looking for the ping location every interval.

I think your usage of snmp is probably only a minor load, there have been some reports of snmp performance issues but it is scraping hundreds of hosts (#1665).

Another thing to check is how long it is taking to send to the outputs. If this takes too long it can block the input plugins.

You can always increase the interval for individual plugins as well.

I'm mostly familiar with your old Minecraft videos, back when everyone wanted their own EATS road. I remember Etho and Docm used to always link to your stuff.

@scjoiner
Copy link
Author

scjoiner commented May 30, 2017 via email

@danielnelson
Copy link
Contributor

@scjoiner I recently updated master (the development branch) to use Go 1.9, and I wonder if this helps at all with your issue. Is this something you could test?

@scjoiner
Copy link
Author

scjoiner commented Sep 22, 2017 via email

@Hipska
Copy link
Contributor

Hipska commented Jan 4, 2021

Hi @scjoiner is this issue still relevant? If not, or no response, we will close it.

@Hipska Hipska self-assigned this Jan 4, 2021
@Hipska
Copy link
Contributor

Hipska commented Jan 25, 2021

Hi @scjoiner, this is a kind reminder to clarify if this issue is still relevant.

@scjoiner
Copy link
Author

scjoiner commented Jan 25, 2021 via email

@Hipska
Copy link
Contributor

Hipska commented Jan 25, 2021

Thanks, feel free to reopen or create a new issue if this problem occurs again.

@Hipska Hipska closed this as completed Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ping bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf
Projects
None yet
Development

No branches or pull requests

3 participants