-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fatal error: systemstack called from unexpected goroutine #2849
Comments
Can you test if temporarily removing the ping input helps? |
Also, are you using the same Telegraf version between OSs? |
Disabled ping at your suggestion and it’s been stable for approx. 24 hours now.
We have 2 CoreOS servers each running the same version, each experiencing the same issue.
… On May 25, 2017, at 9:54 PM, Daniel Nelson ***@***.***> wrote:
Also, are you using the same Telegraf version between OSs?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr75Jz3GF3BkoYov4HMPdwiMXWm6Yxks5r9jDjgaJpZM4NlJHf>.
|
I am also getting this error on our 2nd server (which has ping active still): 2017-05-25T16:16:10Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s) |
I tried to reproduce this on our CoreOS cluster today, which is running the same version of CoreOS as you, but was unsuccessful in recreating the crash. Here are some follow up questions that could help us reproduce:
On the Also, I'm pretty curious, is this you? https://www.youtube.com/user/scjoiner |
Daniel-
We are running the official telegraf:latest container on CoreOS. The server has a docker0 bridge and one outbound NIC that the ping is traversing to hit the ~72 servers we are pinging. Additionally, as you can see, there are a number of SNMP OIDs we GET. Do you think with the 10s interval we are gathering too much data? What is the best way to evaluate whether we are overtaxing telegraf?
That is indeed my youtube — how did you become familiar with it?
Thanks!
Stephen
… On May 26, 2017, at 7:10 PM, Daniel Nelson ***@***.***> wrote:
I tried to reproduce this on our CoreOS cluster today, which is running the same version of CoreOS as you, but was unsuccessful in recreating the crash.
Here are some follow up questions that could help us reproduce:
Are you running inside a container on CoreOS or on the host?
If a container, what docker image are you using? Is it an official one, a third party one, or a custom image?
If a custom docker image, what Telegraf package are you installing into it?
Can you describe the networking configuration of your container.
On the [inputs.influxdb] error, this means that the input did not finish processing by the start of the next interval. Make sure that both InfluxDB and Telegraf are not overloaded, and that you can fetch the output of the /debug/vars endpoint http://100.122.149.210:8086/debug/vars from the machine running telegraf. I would also check the size of that response, as sometimes it can become rather large.
Also, I'm pretty curious, is this you? https://www.youtube.com/user/scjoiner <https://www.youtube.com/user/scjoiner>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr73YjMmqTH44mL9YytznZ-l_1z1Rcks5r91vagaJpZM4NlJHf>.
|
Usually watching the cpu usage is enough, often the cpu is low but on each interval time it spikes up. However, even if the cpu is low, it is possible that telegraf is blocked on IO. We have a few other tools for checking performance. You can enable the "internal" plugin which will output metrics about various internal processes, such as how long it took to gather the metrics for a plugin. Another option is to enable the profiling server with On the InfluxDB input, I recently looked at a case where the endpoint was returning 15MB of data (#2807). Since this is the input that is not complaining I would take a look at the response size if you fetch the 72 servers is on the high side for ping usage, since it forks a ping process for each server every interval. We have some ideas that would make this much better (#2833), and from looking at the code today I see some aspects that are poorly optimized, such as looking for the ping location every interval. I think your usage of snmp is probably only a minor load, there have been some reports of snmp performance issues but it is scraping hundreds of hosts (#1665). Another thing to check is how long it is taking to send to the outputs. If this takes too long it can block the input plugins. You can always increase the interval for individual plugins as well. I'm mostly familiar with your old Minecraft videos, back when everyone wanted their own EATS road. I remember Etho and Docm used to always link to your stuff. |
Daniel — thanks for the tips, I’m going to look into this this week.
I’m impressed you remember that far back. Those were the days!
… On May 27, 2017, at 12:13 AM, Daniel Nelson ***@***.***> wrote:
Usually watching the cpu usage is enough, often the cpu is low but on each interval time it spikes up. However, even if the cpu is low, it is possible that telegraf is blocked on IO.
We have a few other tools for checking performance. You can enable the "internal" plugin which will output metrics about various internal processes, such as how long it took to gather the metrics for a plugin. Another option is to enable the profiling server <https://blog.golang.org/profiling-go-programs> with --pprof-addr :6060.
On the InfluxDB input, I recently looked at a case where the endpoint was returning 15MB of data (#2807 <#2807>). Since this is the input that is not complaining I would take a look at the response size if you fetch the /debug/vars endpoint with curl and check the number of metrics outputted when you run telegraf with the --test flag.
72 servers is on the high side for ping usage, since it forks a ping process for each server every interval. We have some ideas that would make this much better (#2833 <#2833>), and from looking at the code today I see some aspects that are poorly optimized, such as looking for the ping location every interval.
I think your usage of snmp is probably only a minor load, there have been some reports of snmp performance issues but it is scraping hundreds of hosts (#1665 <#1665>).
Another thing to check is how long it is taking to send to the outputs. If this takes too long it can block the input plugins.
You can always increase the interval for individual plugins as well.
I'm mostly familiar with your old Minecraft videos, back when everyone wanted their own EATS road. I remember Etho and Docm used to always link to your stuff.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr79lUIqYPqcXbe1WC2oWEVpEkBKk0ks5r96LhgaJpZM4NlJHf>.
|
@scjoiner I recently updated master (the development branch) to use Go 1.9, and I wonder if this helps at all with your issue. Is this something you could test? |
Yeah I’d be happy to. I’ll spin up an instance in the AM and let you know.
… On Sep 21, 2017, at 5:20 PM, Daniel Nelson ***@***.***> wrote:
@scjoiner <https://github.com/scjoiner> I recently updated master (the development branch) to use Go 1.9, and I wonder if this helps at all with your issue. Is this something you could test?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr76z6I0yXraOq6wo0Ngslhxgju5qFks5sktMNgaJpZM4NlJHf>.
|
Hi @scjoiner is this issue still relevant? If not, or no response, we will close it. |
Hi @scjoiner, this is a kind reminder to clarify if this issue is still relevant. |
Unsure - go ahead and close
…On Mon, Jan 25, 2021 at 3:03 AM Thomas Casteleyn ***@***.***> wrote:
Hi @scjoiner <https://github.com/scjoiner>, this is a kind reminder to
clarify if this issue is still relevant.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2849 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABA2X3YNT7GHV6TQVLLJMC3S3UQVVANCNFSM4DMUSHPQ>
.
|
Thanks, feel free to reopen or create a new issue if this problem occurs again. |
System info:
Docker 1.12.6 on CoreOS 4.9.24
Telegraf 1.3
telegraf.log.txt
elctelegraf.conf.txt
Config and log attached.
We recently switched from using a Synology to CoreOS as our Docker host. We are using the same plugins and devices, but telegraf is crashing every few minutes from system stack errors.
The text was updated successfully, but these errors were encountered: