Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input execd - process are not closed when Telegraf (service) stops #7876

Closed
Trovalo opened this issue Jul 22, 2020 · 10 comments · Fixed by #9616
Closed

Input execd - process are not closed when Telegraf (service) stops #7876

Trovalo opened this issue Jul 22, 2020 · 10 comments · Fixed by #9616
Assignees
Labels
area/execd Issues related to execd or plugins that would be better suited to be used through execd platform/windows

Comments

@Trovalo
Copy link
Collaborator

Trovalo commented Jul 22, 2020

Relevant telegraf.conf:

[[inputs.execd]]

   command = [
       'powershell',
       '-File',
       'C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1',
       '-ServerName',
       'SQLCSRV04\SQL2017'
   ]

  signal = "none"
  restart_delay = "1m"
  data_format = "influx"
  
[[inputs.execd]]

  command = [
      'powershell',
      '-File',
      'C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1',
      '-ServerName',
      'SQLCSRV04\ITEC'
  ]

  signal = "none"
  restart_delay = "1m"
  data_format = "influx"

System info:

  • Telegraf Nightly (last build)
  • Windows Server 2016

Steps to reproduce:

  1. run Telegraf as a service (on windows)
  2. use the execd plugin to start one or more processes
  3. stop the telegraf service

You will see that the started processes are still up and running.

The same does not happen when running telegraf as a console command (cmd), if the cmd command is stopped using ctrl+c all the started processes are also closed

Expected behavior:

If the telegraf service is stopped (on windows), the processes started by execd are closed

Actual behavior:

when the telegraf service is stopped (on windows) the execd started processes are not closed

Additional info:

Here are the logs generated by telegraf
Running as a service

2020-07-22T14:10:02Z I! Loaded inputs: exec execd execd internal sqlserver sqlserver
2020-07-22T14:10:02Z I! Loaded aggregators: 
2020-07-22T14:10:02Z I! Loaded processors: strings converter strings
2020-07-22T14:10:02Z I! Loaded outputs: influxdb
2020-07-22T14:10:02Z I! Tags enabled: company=quantumdatis_demo host=QDSRVMONITOR
2020-07-22T14:10:02Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"QDSRVMONITOR", Flush Interval:10s
2020-07-22T14:10:02Z D! [agent] Initializing plugins
2020-07-22T14:10:02Z D! [agent] Connecting outputs
2020-07-22T14:10:02Z D! [agent] Attempting connection to [outputs.influxdb]
2020-07-22T14:10:02Z D! [agent] Successfully connected to outputs.influxdb
2020-07-22T14:10:02Z D! [agent] Starting service inputs
2020-07-22T14:10:02Z I! [inputs.execd] Starting process: C:\Program Files\XESmartTarget\XESmartTarget.exe [--File C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\telegraf.json --NoLogo --Quiet --GlobalVariables ServerName=SQLCSRV04\SQL2017]
2020-07-22T14:10:02Z I! [inputs.execd] Starting process: powershell [-File C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1 -ServerName SQLCSRV04\ITEC]
2020-07-22T14:10:12Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:10:22Z D! [outputs.influxdb] Wrote batch of 1230 metrics in 313.4507ms
2020-07-22T14:10:22Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:10:30Z D! [agent] Stopping service inputs

Running as a cmd command

2020-07-22T14:11:36Z I! [inputs.execd] Starting process: powershell [-File C:\Monitoring\telegraf\Configuration\input_scripts\xesmarttarget\xe_qmonitor.ps1 -ServerName SQLCSRV04\ITEC]
2020-07-22T14:11:47Z D! [outputs.influxdb] Wrote batch of 1229 metrics in 292.0335ms
2020-07-22T14:11:47Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:11:54Z D! [agent] Stopping service inputs
2020-07-22T14:11:57Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:11:59Z I! [inputs.execd] Process C:\Program Files\XESmartTarget\XESmartTarget.exe shut down
2020-07-22T14:12:04Z E! [inputs.execd] Error in plugin: read |0: file already closed
2020-07-22T14:12:04Z E! [inputs.execd] Error in plugin: error reading stderr: read |0: file already closed
2020-07-22T14:12:04Z I! [inputs.execd] Process C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe shut down
2020-07-22T14:12:04Z D! [agent] Input channel closed
2020-07-22T14:12:04Z D! [agent] Processor channel closed
2020-07-22T14:12:04Z D! [agent] Processor channel closed
2020-07-22T14:12:04Z D! [agent] Processor channel closed
2020-07-22T14:12:04Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-07-22T14:12:04Z D! [outputs.influxdb] Buffer fullness: 0 / 15000 metrics
2020-07-22T14:12:04Z D! [agent] Stopped Successfully

As you can see in the logs, the "service log" last message is D! [agent] Stopping service inputs, meanwhile in the "cmd command log" there are additional messages confirming that the child processes have been closed

@ssoroka
Copy link
Contributor

ssoroka commented Jul 23, 2020

Telegraf closes the stdin of the process when it's done writing metrics out and wants the process to shut down.
When stdin closes, the process should finish processing metrics, write them to stdout, then shut itself down.
If this isn't happening, it sounds like powershell isn't managing the stdin/stdout pipes properly?

@spaghettidba
Copy link
Contributor

I can confirm the issue. There is something different happening when telegraf is stopped with CTRL+C or with stop service.
@ssoroka I don't think it has to do with the I/O streams because the processes are shut down correctly when telegraf is stopped with CTRL+C. The child processes continue running when the service is stopped instead.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 6, 2020

k. I'm aware of some issues with the ctrl-c interrupt being passed on to the child process (which it should ideally ignore and wait for stdin to close), but I'd like to ignore the aspect of ctrl-c and focus on the process not shutting down correctly.

I'll try to replicate the problem on windows and get back to you.

@ssoroka ssoroka self-assigned this Aug 6, 2020
@ssoroka
Copy link
Contributor

ssoroka commented Aug 18, 2020

I don't typically work in Windows. What do I need to do to "run Telegraf as a service (on windows)"? as far as i know the Telegraf binary isn't equipped by itself to run as a Windows service, so are you using something to bridge that gap?

So far my theory is that something is happening that causes closing the process's input to not trigger the process to shut down, which would cause the shutdown process to hang waiting for the child processes to exit. This sounds like it's related to being run as a service; is stdin disconnected? Not sure why it would have no effect.

@spaghettidba
Copy link
Contributor

Telegraf itself is well equipped to run as a Windows service. See the documentation

I guess that the only way to discover what is happening is to debug it in windows once started as a service. I don't have enough knowledge in GO to attach to an existing process and debug it, I'm sorry.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 19, 2020

@spaghettidba 🏅 Thanks for the docs link. :D ❤️

I should be able to reproduce this. Will test and update.

@helenosheaa helenosheaa added the area/execd Issues related to execd or plugins that would be better suited to be used through execd label Jan 29, 2021
@sumptersmartt
Copy link

I am running into this same issue on 1.19.1, is there a good workaround?

@Trovalo
Copy link
Collaborator Author

Trovalo commented Aug 10, 2021 via email

@sumptersmartt
Copy link

Thanks, I just wrote one myself that I think is pretty safe. You'd need to update it to match the command you're running, but,

Get-WmiObject win32_process -Filter "name = 'powershell.exe'" | ? {$($_.ParentProcessID -notin $((Get-Process -Name telegraf).id)) -and $($_.SessionID -eq 0) -and $($_.commandline -like "C:\\scripts\\")} | stop-process

Sorry for the one liner, but basically this script says "give me all of the processes named powershell.exe that don't have a parent process ID that matches any currently running a process matching "telegraf*". It then filters by sessionID=0 (don't use this if you're not running as a service, but I'm assuming you are if you're looking at this issue), and finally by the actual 'command line' value that should resemble the actual exec command in the telegraf config.

Use at your own risk!

@jhychan
Copy link
Contributor

jhychan commented Aug 11, 2021

I think I've found the issue here - when Telegraf runs interactively/console mode the main loop (reloadLoop) runs on the main thread/goroutine, but when running as a service the main telegraf loop (reloadLoop) runs in a child thread/goroutine.

In interactive/console mode when you interrupt the process with Control-C the reloadLoop detects this and all plugins get to cleanly shutdown on the main thread/goroutine.

However, when telegraf runs as a Windows service the main thread/goroutine is finishing without waiting for the reloadLoop thread/goroutine to complete, causing all child goroutines to stop running. So the plugins are not being cleanly shut down when stopping the telegraf service. The logs show that plugin shutdown sequence begins but it never completes, it just ends abruptly. As such, this issue affects all plugins and is not specific to execd. For example, judging by your example logs, it's likely that output plugins are also not being flushed and we are losing metrics.

I have a fork of telegraf where I've tested a "fix" for this (additional channel-based signaling between the 2 goroutines), where the main goroutine waits for the reloadLoop goroutine to finish before it can exit. However I don't quite understand the reason for parts of the current telegraf concurrency implementation for the Windows service, so I suspect there may be a more "correct" solution. I'll submit a PR and get some feedback.

@ssoroka

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/execd Issues related to execd or plugins that would be better suited to be used through execd platform/windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants