fatal error: systemstack called from unexpected goroutine #2849

scjoiner · 2017-05-24T13:54:44Z

System info:

Docker 1.12.6 on CoreOS 4.9.24
Telegraf 1.3

telegraf.log.txt

elctelegraf.conf.txt

Config and log attached.

We recently switched from using a Synology to CoreOS as our Docker host. We are using the same plugins and devices, but telegraf is crashing every few minutes from system stack errors.

danielnelson · 2017-05-24T17:52:27Z

Can you test if temporarily removing the ping input helps?

danielnelson · 2017-05-26T01:54:38Z

Also, are you using the same Telegraf version between OSs?

scjoiner · 2017-05-26T13:04:00Z

Disabled ping at your suggestion and it’s been stable for approx. 24 hours now. We have 2 CoreOS servers each running the same version, each experiencing the same issue.

…

On May 25, 2017, at 9:54 PM, Daniel Nelson ***@***.***> wrote: Also, are you using the same Telegraf version between OSs? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr75Jz3GF3BkoYov4HMPdwiMXWm6Yxks5r9jDjgaJpZM4NlJHf>.

scjoiner · 2017-05-26T13:50:31Z

I am also getting this error on our 2nd server (which has ping active still):

2017-05-25T16:16:10Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s)
2017-05-25T16:16:10Z E! Error in plugin [inputs.ping]: Fatal error processing ping output
---container restart---
2017-05-25T17:01:03Z E! Error in plugin [inputs.ping]: , Command timed out.
2017-05-25T17:01:03Z E! Error in plugin [inputs.ping]: Fatal error processing ping output

danielnelson · 2017-05-26T23:10:14Z

I tried to reproduce this on our CoreOS cluster today, which is running the same version of CoreOS as you, but was unsuccessful in recreating the crash.

Here are some follow up questions that could help us reproduce:

Are you running inside a container on CoreOS or on the host?
If a container, what docker image are you using? Is it an official one, a third party one, or a custom image?
If a custom docker image, what Telegraf package are you installing into it?
Can you describe the networking configuration of your container.

On the [inputs.influxdb] error, this means that the input did not finish processing by the start of the next interval. Make sure that both InfluxDB and Telegraf are not overloaded, and that you can fetch the output of the /debug/vars endpoint http://100.122.149.210:8086/debug/vars from the machine running telegraf. I would also check the size of that response, as sometimes it can become rather large.

Also, I'm pretty curious, is this you? https://www.youtube.com/user/scjoiner

scjoiner · 2017-05-27T02:56:50Z

Daniel- We are running the official telegraf:latest container on CoreOS. The server has a docker0 bridge and one outbound NIC that the ping is traversing to hit the ~72 servers we are pinging. Additionally, as you can see, there are a number of SNMP OIDs we GET. Do you think with the 10s interval we are gathering too much data? What is the best way to evaluate whether we are overtaxing telegraf? That is indeed my youtube — how did you become familiar with it? Thanks! Stephen

…

On May 26, 2017, at 7:10 PM, Daniel Nelson ***@***.***> wrote: I tried to reproduce this on our CoreOS cluster today, which is running the same version of CoreOS as you, but was unsuccessful in recreating the crash. Here are some follow up questions that could help us reproduce: Are you running inside a container on CoreOS or on the host? If a container, what docker image are you using? Is it an official one, a third party one, or a custom image? If a custom docker image, what Telegraf package are you installing into it? Can you describe the networking configuration of your container. On the [inputs.influxdb] error, this means that the input did not finish processing by the start of the next interval. Make sure that both InfluxDB and Telegraf are not overloaded, and that you can fetch the output of the /debug/vars endpoint http://100.122.149.210:8086/debug/vars from the machine running telegraf. I would also check the size of that response, as sometimes it can become rather large. Also, I'm pretty curious, is this you? https://www.youtube.com/user/scjoiner <https://www.youtube.com/user/scjoiner> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr73YjMmqTH44mL9YytznZ-l_1z1Rcks5r91vagaJpZM4NlJHf>.

danielnelson · 2017-05-27T04:13:18Z

Usually watching the cpu usage is enough, often the cpu is low but on each interval time it spikes up. However, even if the cpu is low, it is possible that telegraf is blocked on IO.

We have a few other tools for checking performance. You can enable the "internal" plugin which will output metrics about various internal processes, such as how long it took to gather the metrics for a plugin. Another option is to enable the profiling server with --pprof-addr :6060.

On the InfluxDB input, I recently looked at a case where the endpoint was returning 15MB of data (#2807). Since this is the input that is not complaining I would take a look at the response size if you fetch the /debug/vars endpoint with curl and check the number of metrics outputted when you run telegraf with the --test flag.

72 servers is on the high side for ping usage, since it forks a ping process for each server every interval. We have some ideas that would make this much better (#2833), and from looking at the code today I see some aspects that are poorly optimized, such as looking for the ping location every interval.

I think your usage of snmp is probably only a minor load, there have been some reports of snmp performance issues but it is scraping hundreds of hosts (#1665).

Another thing to check is how long it is taking to send to the outputs. If this takes too long it can block the input plugins.

You can always increase the interval for individual plugins as well.

I'm mostly familiar with your old Minecraft videos, back when everyone wanted their own EATS road. I remember Etho and Docm used to always link to your stuff.

scjoiner · 2017-05-30T13:39:00Z

Daniel — thanks for the tips, I’m going to look into this this week. I’m impressed you remember that far back. Those were the days!

…

On May 27, 2017, at 12:13 AM, Daniel Nelson ***@***.***> wrote: Usually watching the cpu usage is enough, often the cpu is low but on each interval time it spikes up. However, even if the cpu is low, it is possible that telegraf is blocked on IO. We have a few other tools for checking performance. You can enable the "internal" plugin which will output metrics about various internal processes, such as how long it took to gather the metrics for a plugin. Another option is to enable the profiling server <https://blog.golang.org/profiling-go-programs> with --pprof-addr :6060. On the InfluxDB input, I recently looked at a case where the endpoint was returning 15MB of data (#2807 <#2807>). Since this is the input that is not complaining I would take a look at the response size if you fetch the /debug/vars endpoint with curl and check the number of metrics outputted when you run telegraf with the --test flag. 72 servers is on the high side for ping usage, since it forks a ping process for each server every interval. We have some ideas that would make this much better (#2833 <#2833>), and from looking at the code today I see some aspects that are poorly optimized, such as looking for the ping location every interval. I think your usage of snmp is probably only a minor load, there have been some reports of snmp performance issues but it is scraping hundreds of hosts (#1665 <#1665>). Another thing to check is how long it is taking to send to the outputs. If this takes too long it can block the input plugins. You can always increase the interval for individual plugins as well. I'm mostly familiar with your old Minecraft videos, back when everyone wanted their own EATS road. I remember Etho and Docm used to always link to your stuff. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr79lUIqYPqcXbe1WC2oWEVpEkBKk0ks5r96LhgaJpZM4NlJHf>.

danielnelson · 2017-09-21T21:20:02Z

@scjoiner I recently updated master (the development branch) to use Go 1.9, and I wonder if this helps at all with your issue. Is this something you could test?

scjoiner · 2017-09-22T00:07:25Z

Yeah I’d be happy to. I’ll spin up an instance in the AM and let you know.

…

On Sep 21, 2017, at 5:20 PM, Daniel Nelson ***@***.***> wrote: @scjoiner <https://github.com/scjoiner> I recently updated master (the development branch) to use Go 1.9, and I wonder if this helps at all with your issue. Is this something you could test? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGr76z6I0yXraOq6wo0Ngslhxgju5qFks5sktMNgaJpZM4NlJHf>.

Hipska · 2021-01-04T10:16:21Z

Hi @scjoiner is this issue still relevant? If not, or no response, we will close it.

Hipska · 2021-01-25T08:03:24Z

Hi @scjoiner, this is a kind reminder to clarify if this issue is still relevant.

scjoiner · 2021-01-25T13:45:08Z

Unsure - go ahead and close

…

On Mon, Jan 25, 2021 at 3:03 AM Thomas Casteleyn ***@***.***> wrote: Hi @scjoiner <https://github.com/scjoiner>, this is a kind reminder to clarify if this issue is still relevant. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2849 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABA2X3YNT7GHV6TQVLLJMC3S3UQVVANCNFSM4DMUSHPQ> .

Hipska · 2021-01-25T14:32:31Z

Thanks, feel free to reopen or create a new issue if this problem occurs again.

danielnelson added bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf labels May 27, 2017

danielnelson mentioned this issue Oct 12, 2017

Documentation for inputs.ping: DNS timeouts not counted with the 1s ping timeout #3333

Closed

Hipska added the area/ping label Jan 4, 2021

Hipska self-assigned this Jan 4, 2021

Hipska closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fatal error: systemstack called from unexpected goroutine #2849

fatal error: systemstack called from unexpected goroutine #2849

scjoiner commented May 24, 2017

danielnelson commented May 24, 2017

danielnelson commented May 26, 2017

scjoiner commented May 26, 2017 via email

scjoiner commented May 26, 2017

danielnelson commented May 26, 2017

scjoiner commented May 27, 2017 via email

danielnelson commented May 27, 2017

scjoiner commented May 30, 2017 via email

danielnelson commented Sep 21, 2017

scjoiner commented Sep 22, 2017 via email

Hipska commented Jan 4, 2021

Hipska commented Jan 25, 2021

scjoiner commented Jan 25, 2021 via email

Hipska commented Jan 25, 2021

fatal error: systemstack called from unexpected goroutine #2849

fatal error: systemstack called from unexpected goroutine #2849

Comments

scjoiner commented May 24, 2017

System info:

danielnelson commented May 24, 2017

danielnelson commented May 26, 2017

scjoiner commented May 26, 2017 via email

scjoiner commented May 26, 2017

danielnelson commented May 26, 2017

scjoiner commented May 27, 2017 via email

danielnelson commented May 27, 2017

scjoiner commented May 30, 2017 via email

danielnelson commented Sep 21, 2017

scjoiner commented Sep 22, 2017 via email

Hipska commented Jan 4, 2021

Hipska commented Jan 25, 2021

scjoiner commented Jan 25, 2021 via email

Hipska commented Jan 25, 2021