-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input execd - process are not closed when Telegraf (service) stops #7876
Comments
Telegraf closes the stdin of the process when it's done writing metrics out and wants the process to shut down. |
I can confirm the issue. There is something different happening when telegraf is stopped with CTRL+C or with stop service. |
k. I'm aware of some issues with the ctrl-c interrupt being passed on to the child process (which it should ideally ignore and wait for stdin to close), but I'd like to ignore the aspect of ctrl-c and focus on the process not shutting down correctly. I'll try to replicate the problem on windows and get back to you. |
I don't typically work in Windows. What do I need to do to "run Telegraf as a service (on windows)"? as far as i know the Telegraf binary isn't equipped by itself to run as a Windows service, so are you using something to bridge that gap? So far my theory is that something is happening that causes closing the process's input to not trigger the process to shut down, which would cause the shutdown process to hang waiting for the child processes to exit. This sounds like it's related to being run as a service; is stdin disconnected? Not sure why it would have no effect. |
Telegraf itself is well equipped to run as a Windows service. See the documentation I guess that the only way to discover what is happening is to debug it in windows once started as a service. I don't have enough knowledge in GO to attach to an existing process and debug it, I'm sorry. |
@spaghettidba 🏅 Thanks for the docs link. :D ❤️ I should be able to reproduce this. Will test and update. |
I am running into this same issue on 1.19.1, is there a good workaround? |
As a workaround, I run a powershell command (using the exec plugin) to end
all the processes that I started with the execd every n hours.
This way I ensure there are no zombie processes around.
When an execd process is killed it will restart in a minute by default.
This of course is not optimal, but in my case was a suitable solution
…On Tue, 10 Aug 2021, 19:35 Sumpter Smartt, ***@***.***> wrote:
I am running into this same issue on 1.19.1, is there a good workaround?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7876 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC5J5KVT4LGZ35YQPKZ6HNTT4FPM5ANCNFSM4PEYKENQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Thanks, I just wrote one myself that I think is pretty safe. You'd need to update it to match the command you're running, but,
Sorry for the one liner, but basically this script says "give me all of the processes named powershell.exe that don't have a parent process ID that matches any currently running a process matching "telegraf*". It then filters by sessionID=0 (don't use this if you're not running as a service, but I'm assuming you are if you're looking at this issue), and finally by the actual 'command line' value that should resemble the actual exec command in the telegraf config. Use at your own risk! |
I think I've found the issue here - when Telegraf runs interactively/console mode the main loop (reloadLoop) runs on the main thread/goroutine, but when running as a service the main telegraf loop (reloadLoop) runs in a child thread/goroutine. In interactive/console mode when you interrupt the process with Control-C the reloadLoop detects this and all plugins get to cleanly shutdown on the main thread/goroutine. However, when telegraf runs as a Windows service the main thread/goroutine is finishing without waiting for the reloadLoop thread/goroutine to complete, causing all child goroutines to stop running. So the plugins are not being cleanly shut down when stopping the telegraf service. The logs show that plugin shutdown sequence begins but it never completes, it just ends abruptly. As such, this issue affects all plugins and is not specific to execd. For example, judging by your example logs, it's likely that output plugins are also not being flushed and we are losing metrics. I have a fork of telegraf where I've tested a "fix" for this (additional channel-based signaling between the 2 goroutines), where the main goroutine waits for the reloadLoop goroutine to finish before it can exit. However I don't quite understand the reason for parts of the current telegraf concurrency implementation for the Windows service, so I suspect there may be a more "correct" solution. I'll submit a PR and get some feedback. |
Relevant telegraf.conf:
System info:
Steps to reproduce:
You will see that the started processes are still up and running.
The same does not happen when running telegraf as a console command (cmd), if the cmd command is stopped using ctrl+c all the started processes are also closed
Expected behavior:
If the telegraf service is stopped (on windows), the processes started by execd are closed
Actual behavior:
when the telegraf service is stopped (on windows) the execd started processes are not closed
Additional info:
Here are the logs generated by telegraf
Running as a service
Running as a cmd command
As you can see in the logs, the "service log" last message is
D! [agent] Stopping service inputs
, meanwhile in the "cmd command log" there are additional messages confirming that the child processes have been closedThe text was updated successfully, but these errors were encountered: