-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telegraf just stops working #1230
Comments
Oh forgot, this happens with 0.12.1 and 0.13.0 |
can you provide your configuration? |
recently in 0.13 there were some safeguards added to prevent lockups when running exec commands, but these have not all been patched up as some of Telegraf's dependencies could still be running commands without timeouts. |
This is basically the default config with :
https://gist.github.com/RainerW/94c11ac7f4ce42ea17a12ee7e40257eb |
On the Server with 0.13.0
https://gist.github.com/RainerW/586508bc90bca51e9d8cab0dfe98d97a |
hint: Apache status check is not active on all instances. on one server i had default 0.12.1 config (Onle nfs and glusterfs excludes) https://gist.github.com/RainerW/d3a5f38c1b69d13f69c75e1b1778ccee with the same effect |
If you could SIGQUIT ( |
Not totally sure how to do that with a service. "kill -3 6413" just quits the process but the log does not contain a stracktrace (At least a restarted instance i tried, which was not hung but i would like to see a stracktrace first before using it on a hung instance ) : |
that's odd, when I send a SIGQUIT I get a stack trace in the logs.... I'm not sure then what's going on with yours, is the disk full or locked to where it can't even write? (or maybe a permissions problem?) |
since it's happening with the default config, it's likely related to #1215. It could even be the exact same problem. |
You could get hung child processes while the telegraf process is hung with something like:
|
currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log a "FATAL ERROR" and move on. This will at least give some visibility into the plugin that is acting up. Open questions: - should the telegraf process die and exit when this happens? This might be a better idea than leaving around the dead process. - should the input interface have a Kill() method? I suspect not, since most inputs wouldn't have a way of killing themselves anyways. closes #1230
been kicking this around in the back of my head for a bit, and this seems like as good a reason as any to get it implemented: #1235 |
lol, got a stracktrace after fixing an crashed nfs server/mount ... which simultaniously fixed the remaining not restarted servers. So now all telegraf insances are running again. Sadly the one server which had the problem, but did not have that nfs mount had beend restarted earlier. So I cannot provide a stacktrace of a hung instance. But it seems at least to be a problem while accessing the disk statistics, even i had beend excluding "nfs" (or because?) |
currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log a "FATAL ERROR" and move on. This will at least give some visibility into the plugin that is acting up. Open questions: - should the telegraf process die and exit when this happens? This might be a better idea than leaving around the dead process. - should the input interface have a Kill() method? I suspect not, since most inputs wouldn't have a way of killing themselves anyways. closes #1230
Now one telegraf stopped which had not that nfs mount: |
I'm seeing this problem as well. I am running kubernetes 1.2, have telegraf running as a daemonset (0.12.2-1). Last night 2 of 4 instances just stopped reporting. The logs are empty. Restarting it fixes the problem. I kept one hung process hanging around to debug. |
My restarted pod died and threw the following:
|
@jvalencia that problem is not related to this, that was fixed in Telegraf 0.13. |
I updated, will see if I get hanging behaviour again. |
Changing the internal behavior around running plugins. Each plugin will now have it's own goroutine with it's own ticker. This means that a hung plugin will not block any other plugins. When a plugin is hung, we will log an error message every interval, letting users know which plugin is hung. Currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log an "ERROR" and move on. This will give some visibility into the plugin that is acting up. closes #1230
Changing the internal behavior around running plugins. Each plugin will now have it's own goroutine with it's own ticker. This means that a hung plugin will not block any other plugins. When a plugin is hung, we will log an error message every interval, letting users know which plugin is hung. Currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log an "ERROR" and move on. This will give some visibility into the plugin that is acting up. closes #1230
Changing the internal behavior around running plugins. Each plugin will now have it's own goroutine with it's own ticker. This means that a hung plugin will not block any other plugins. When a plugin is hung, we will log an error message every interval, letting users know which plugin is hung. Currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log an "ERROR" and move on. This will give some visibility into the plugin that is acting up. closes #1230 fixes #479
Changing the internal behavior around running plugins. Each plugin will now have it's own goroutine with it's own ticker. This means that a hung plugin will not block any other plugins. When a plugin is hung, we will log an error message every interval, letting users know which plugin is hung. Currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log an "ERROR" and move on. This will give some visibility into the plugin that is acting up. closes #1230 fixes #479
Changing the internal behavior around running plugins. Each plugin will now have it's own goroutine with it's own ticker. This means that a hung plugin will not block any other plugins. When a plugin is hung, we will log an error message every interval, letting users know which plugin is hung. Currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log an "ERROR" and move on. This will give some visibility into the plugin that is acting up. closes #1230 fixes #479
Changing the internal behavior around running plugins. Each plugin will now have it's own goroutine with it's own ticker. This means that a hung plugin will not block any other plugins. When a plugin is hung, we will log an error message every interval, letting users know which plugin is hung. Currently the input interface does not have any methods for killing a running Gather call, so there is nothing we can do but log an "ERROR" and move on. This will give some visibility into the plugin that is acting up. closes #1230 fixes #479
I'm experiencing this too. version 1.0.1 EDIT: Disregard, appears to be a client networking issue. |
i'm testing telegraf on different systems and it seems to stop working after some time.
I guess tehere is a kind of network hickup, because it stops working on multipe servers at around the same time. But some Servers just continued to work. So I'm sure neither influx nor grafana had a problem.
The Systems in question are most Ubuntu 11,14 or 16. But one of the ubuntu 16 servers continue to work.
In all cases the logfiles just stopped containing anything, but the process continues to run. My guess is that there is no safeguard around the metrics, so when they start hanging for whatever reason telegraf stops working .?
Last Log entries are:
Which is the point in time where telegraf stops reporting.
It seems to be still running:
After an service restart all is working fine again. But this happend before, so i guess it will also happen again.
I know this ticket is very broad, but i have nothing to pin it down to. Only there should be safeguards in place to prevent telegraf from stopping completely.
The text was updated successfully, but these errors were encountered: