Input execd - process are not closed when Telegraf (service) stops #7876

Trovalo · 2020-07-22T14:21:21Z

Relevant telegraf.conf:

[[inputs.execd]]

   command = [
       'powershell',
       '-File',
       'C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1',
       '-ServerName',
       'SQLCSRV04\SQL2017'
   ]

  signal = "none"
  restart_delay = "1m"
  data_format = "influx"
  
[[inputs.execd]]

  command = [
      'powershell',
      '-File',
      'C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1',
      '-ServerName',
      'SQLCSRV04\ITEC'
  ]

  signal = "none"
  restart_delay = "1m"
  data_format = "influx"

System info:

Telegraf Nightly (last build)
Windows Server 2016

Steps to reproduce:

run Telegraf as a service (on windows)
use the execd plugin to start one or more processes
stop the telegraf service

You will see that the started processes are still up and running.

The same does not happen when running telegraf as a console command (cmd), if the cmd command is stopped using ctrl+c all the started processes are also closed

Expected behavior:

If the telegraf service is stopped (on windows), the processes started by execd are closed

Actual behavior:

when the telegraf service is stopped (on windows) the execd started processes are not closed

Additional info:

Here are the logs generated by telegraf
Running as a service

2020-07-22T14:10:02Z I! Loaded inputs: exec execd execd internal sqlserver sqlserver
2020-07-22T14:10:02Z I! Loaded aggregators: 
2020-07-22T14:10:02Z I! Loaded processors: strings converter strings
2020-07-22T14:10:02Z I! Loaded outputs: influxdb
2020-07-22T14:10:02Z I! Tags enabled: company=quantumdatis_demo host=QDSRVMONITOR
2020-07-22T14:10:02Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"QDSRVMONITOR", Flush Interval:10s
2020-07-22T14:10:02Z D! [agent] Initializing plugins
2020-07-22T14:10:02Z D! [agent] Connecting outputs
2020-07-22T14:10:02Z D! [agent] Attempting connection to [outputs.influxdb]
2020-07-22T14:10:02Z D! [agent] Successfully connected to outputs.influxdb
2020-07-22T14:10:02Z D! [agent] Starting service inputs
2020-07-22T14:10:02Z I! [inputs.execd] Starting process: C:\Program Files\XESmartTarget\XESmartTarget.exe [--File C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\telegraf.json --NoLogo --Quiet --GlobalVariables ServerName=SQLCSRV04\SQL2017]
2020-07-22T14:10:02Z I! [inputs.execd] Starting process: powershell [-File C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1 -ServerName SQLCSRV04\ITEC]
2020-07-22T14:10:12Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:10:22Z D! [outputs.influxdb] Wrote batch of 1230 metrics in 313.4507ms
2020-07-22T14:10:22Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:10:30Z D! [agent] Stopping service inputs

Running as a cmd command

2020-07-22T14:11:36Z I! [inputs.execd] Starting process: powershell [-File C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1 -ServerName SQLCSRV04\ITEC]
2020-07-22T14:11:47Z D! [outputs.influxdb] Wrote batch of 1229 metrics in 292.0335ms
2020-07-22T14:11:47Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:11:54Z D! [agent] Stopping service inputs
2020-07-22T14:11:57Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:11:59Z I! [inputs.execd] Process C:\Program Files\XESmartTarget\XESmartTarget.exe shut down
2020-07-22T14:12:04Z E! [inputs.execd] Error in plugin: read |0: file already closed
2020-07-22T14:12:04Z E! [inputs.execd] Error in plugin: error reading stderr: read |0: file already closed
2020-07-22T14:12:04Z I! [inputs.execd] Process C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe shut down
2020-07-22T14:12:04Z D! [agent] Input channel closed
2020-07-22T14:12:04Z D! [agent] Processor channel closed
2020-07-22T14:12:04Z D! [agent] Processor channel closed
2020-07-22T14:12:04Z D! [agent] Processor channel closed
2020-07-22T14:12:04Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-07-22T14:12:04Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:12:04Z D! [agent] Stopped Successfully

As you can see in the logs, the "service log" last message is D! [agent] Stopping service inputs, meanwhile in the "cmd command log" there are additional messages confirming that the child processes have been closed

The text was updated successfully, but these errors were encountered:

ssoroka · 2020-07-23T22:01:22Z

Telegraf closes the stdin of the process when it's done writing metrics out and wants the process to shut down.
When stdin closes, the process should finish processing metrics, write them to stdout, then shut itself down.
If this isn't happening, it sounds like powershell isn't managing the stdin/stdout pipes properly?

spaghettidba · 2020-07-28T14:07:45Z

I can confirm the issue. There is something different happening when telegraf is stopped with CTRL+C or with stop service.
@ssoroka I don't think it has to do with the I/O streams because the processes are shut down correctly when telegraf is stopped with CTRL+C. The child processes continue running when the service is stopped instead.

ssoroka · 2020-08-06T18:25:01Z

k. I'm aware of some issues with the ctrl-c interrupt being passed on to the child process (which it should ideally ignore and wait for stdin to close), but I'd like to ignore the aspect of ctrl-c and focus on the process not shutting down correctly.

I'll try to replicate the problem on windows and get back to you.

ssoroka · 2020-08-18T15:26:12Z

I don't typically work in Windows. What do I need to do to "run Telegraf as a service (on windows)"? as far as i know the Telegraf binary isn't equipped by itself to run as a Windows service, so are you using something to bridge that gap?

So far my theory is that something is happening that causes closing the process's input to not trigger the process to shut down, which would cause the shutdown process to hang waiting for the child processes to exit. This sounds like it's related to being run as a service; is stdin disconnected? Not sure why it would have no effect.

spaghettidba · 2020-08-18T16:08:59Z

Telegraf itself is well equipped to run as a Windows service. See the documentation

I guess that the only way to discover what is happening is to debug it in windows once started as a service. I don't have enough knowledge in GO to attach to an existing process and debug it, I'm sorry.

ssoroka · 2020-08-19T21:43:47Z

@spaghettidba 🏅 Thanks for the docs link. :D ❤️

I should be able to reproduce this. Will test and update.

sumptersmartt · 2021-08-10T17:34:59Z

I am running into this same issue on 1.19.1, is there a good workaround?

Trovalo · 2021-08-10T17:46:56Z

As a workaround, I run a powershell command (using the exec plugin) to end all the processes that I started with the execd every n hours. This way I ensure there are no zombie processes around. When an execd process is killed it will restart in a minute by default. This of course is not optimal, but in my case was a suitable solution

…

On Tue, 10 Aug 2021, 19:35 Sumpter Smartt, ***@***.***> wrote: I am running into this same issue on 1.19.1, is there a good workaround? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7876 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC5J5KVT4LGZ35YQPKZ6HNTT4FPM5ANCNFSM4PEYKENQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

sumptersmartt · 2021-08-10T18:13:31Z

Thanks, I just wrote one myself that I think is pretty safe. You'd need to update it to match the command you're running, but,

Get-WmiObject win32_process -Filter "name = 'powershell.exe'" | ? {$($_.ParentProcessID -notin $((Get-Process -Name telegraf).id)) -and $($_.SessionID -eq 0) -and $($_.commandline -like "C:\\scripts\\")} | stop-process

Sorry for the one liner, but basically this script says "give me all of the processes named powershell.exe that don't have a parent process ID that matches any currently running a process matching "telegraf*". It then filters by sessionID=0 (don't use this if you're not running as a service, but I'm assuming you are if you're looking at this issue), and finally by the actual 'command line' value that should resemble the actual exec command in the telegraf config.

Use at your own risk!

jhychan · 2021-08-11T13:56:29Z

I think I've found the issue here - when Telegraf runs interactively/console mode the main loop (reloadLoop) runs on the main thread/goroutine, but when running as a service the main telegraf loop (reloadLoop) runs in a child thread/goroutine.

In interactive/console mode when you interrupt the process with Control-C the reloadLoop detects this and all plugins get to cleanly shutdown on the main thread/goroutine.

However, when telegraf runs as a Windows service the main thread/goroutine is finishing without waiting for the reloadLoop thread/goroutine to complete, causing all child goroutines to stop running. So the plugins are not being cleanly shut down when stopping the telegraf service. The logs show that plugin shutdown sequence begins but it never completes, it just ends abruptly. As such, this issue affects all plugins and is not specific to execd. For example, judging by your example logs, it's likely that output plugins are also not being flushed and we are losing metrics.

I have a fork of telegraf where I've tested a "fix" for this (additional channel-based signaling between the 2 goroutines), where the main goroutine waits for the reloadLoop goroutine to finish before it can exit. However I don't quite understand the reason for parts of the current telegraf concurrency implementation for the Windows service, so I suspect there may be a more "correct" solution. I'll submit a PR and get some feedback.

@ssoroka

ssoroka self-assigned this Aug 6, 2020

ssoroka added the platform/windows label Jan 7, 2021

helenosheaa added the area/execd Issues related to execd or plugins that would be better suited to be used through execd label Jan 29, 2021

jhychan mentioned this issue Aug 11, 2021

fix: windows service - graceful shutdown of telegraf #9616

Merged

3 tasks

reimda closed this as completed in #9616 Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input execd - process are not closed when Telegraf (service) stops #7876

Input execd - process are not closed when Telegraf (service) stops #7876

Trovalo commented Jul 22, 2020

ssoroka commented Jul 23, 2020

spaghettidba commented Jul 28, 2020

ssoroka commented Aug 6, 2020

ssoroka commented Aug 18, 2020 •

edited

Loading

spaghettidba commented Aug 18, 2020

ssoroka commented Aug 19, 2020

sumptersmartt commented Aug 10, 2021

Trovalo commented Aug 10, 2021 via email

sumptersmartt commented Aug 10, 2021

jhychan commented Aug 11, 2021 •

edited

Loading

Input execd - process are not closed when Telegraf (service) stops #7876

Input execd - process are not closed when Telegraf (service) stops #7876

Comments

Trovalo commented Jul 22, 2020

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

ssoroka commented Jul 23, 2020

spaghettidba commented Jul 28, 2020

ssoroka commented Aug 6, 2020

ssoroka commented Aug 18, 2020 • edited Loading

spaghettidba commented Aug 18, 2020

ssoroka commented Aug 19, 2020

sumptersmartt commented Aug 10, 2021

Trovalo commented Aug 10, 2021 via email

sumptersmartt commented Aug 10, 2021

jhychan commented Aug 11, 2021 • edited Loading

ssoroka commented Aug 18, 2020 •

edited

Loading

jhychan commented Aug 11, 2021 •

edited

Loading