Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: windows service - graceful shutdown of telegraf #9616

Merged
merged 2 commits into from
Dec 2, 2021

Conversation

jhychan
Copy link
Contributor

@jhychan jhychan commented Aug 11, 2021

Required for all PRs:

  • Updated associated README.md. (I don't think there's any relevant readme to update here)
  • Wrote appropriate unit tests. (not sure this is relevant either)
  • Pull request title or commits are in conventional commit format (e.g. feat: or fix:)

Resolves #7876, however the issue affects telegraf agent behaviour across all plugins when telegraf runs as service, not just the execd plugin.

The current telegraf service implementation does not appear to wait for the main telegraf agent loop to gracefully finish before the main thread exits. This causes an abrupt exit and plugins do not have a chance to shut down / clean up properly (ie. outputs are not flushed, inputs with external dependencies are not closed properly).

I'm submitting this PR initially as a review - I am not sure this is the right solution to cover all scenarios. However from testing, it does work as intended under ideal scenarios - the main goroutine thread is blocked until the agent thread runs to completion.

The code changes are fairly simple. When the service stop control signal is triggered in Windows, the service interface Stop is called (see https://github.com/kardianos/service/blob/v1.0.0/service_windows.go#L282). The Stop interface sends a struct to the stop channel so that the reloadLoop runs the context cancel function to commence agent shutdown. While shutdown is in progress, the Stop interface is blocked waiting for the stop channel to be closed. Once reloadLoop completes, we close the stop channel to allow the main thread to complete.

My hesitancy with this PR being a full fix is that I do not fully understand the windows service specific concurrency implementation, so I'm unsure if this PR opens up telegraf to other potential race condition or deadlocks.

@ssoroka

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Aug 11, 2021
@jhychan
Copy link
Contributor Author

jhychan commented Sep 7, 2021

Any maintainers able to review/comment on this draft PR?

@jhychan jhychan marked this pull request as ready for review September 14, 2021 15:21
@jhychan jhychan force-pushed the winservice-gracefulstop branch from 17d0f26 to d1c4047 Compare October 13, 2021 20:43
@jhychan jhychan force-pushed the winservice-gracefulstop branch from d1c4047 to ee0fc31 Compare October 13, 2021 21:09
@jhychan
Copy link
Contributor Author

jhychan commented Nov 20, 2021

Bump - can we get this reviewed?

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to play with this on Windows 10 and it does seem to do a better job of cleaning up the Telegraf and stopping more gracefully.

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Nov 22, 2021
@reimda reimda merged commit b954e56 into influxdata:master Dec 2, 2021
MyaLongmire pushed a commit that referenced this pull request Dec 8, 2021
powersj pushed a commit to powersj/telegraf that referenced this pull request Jan 21, 2022
phemmer added a commit to phemmer/telegraf that referenced this pull request Feb 18, 2022
* origin/master: (133 commits)
  chore: restart service if it is already running and upgraded via RPM (influxdata#9970)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10237)
  fix: Handle duplicate registration of protocol-buffer files gracefully. (influxdata#10188)
  fix(http_listener_v2): fix panic on close (influxdata#10132)
  feat: add Vault input plugin (influxdata#10198)
  feat: support aws managed service for prometheus (influxdata#10202)
  fix: Make telegraf compile on Windows with golang 1.16.2 (influxdata#10246)
  Update changelog
  feat: Modbus add per-request tags (influxdata#10231)
  fix: Implement NaN and inf handling for elasticsearch output (influxdata#10196)
  feat: add nomad input plugin (influxdata#10106)
  fix: Print loaded plugins and deprecations for once and test (influxdata#10205)
  fix: eliminate MIB dependency for ifname processor (influxdata#10214)
  feat: Optimize locking for SNMP MIBs loading. (influxdata#10206)
  feat: Add SMART plugin concurrency configuration option, nvme-cli v1.14+ support and lint fixes. (influxdata#10150)
  feat: update configs (influxdata#10236)
  fix(inputs/kube_inventory): set TLS server name config properly (influxdata#9975)
  fix: Sudden close of Telegraf caused by OPC UA input plugin (influxdata#10230)
  fix: bump github.com/eclipse/paho.mqtt.golang from 1.3.0 to 1.3.5 (influxdata#9913)
  fix: json_v2 parser timestamp setting (influxdata#10221)
  fix: ensure graylog spec fields not prefixed with '_' (influxdata#10209)
  docs: remove duplicate links in CONTRIBUTING.md (influxdata#10218)
  fix: pool detection and metrics gathering for ZFS >= 2.1.x (influxdata#10099)
  fix: parallelism fix for ifname processor (influxdata#10007)
  chore: Forbids "log" package only for aggregators, inputs, outputs, parsers and processors (influxdata#10191)
  docs: address documentation gap when running telegraf in k8s (influxdata#10215)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10211)
  fix: mqtt topic extracting no longer requires all three fields (influxdata#10208)
  fix: windows service - graceful shutdown of telegraf (influxdata#9616)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10201)
  feat: Modbus support multiple slaves (gateway feature) (influxdata#9279)
  fix: Revert unintented corruption of the Makefile from influxdata#10200. (influxdata#10203)
  chore: remove triggering update-config bot in CI (influxdata#10195)
  Update changelog
  feat: Implement deprecation infrastructure (influxdata#10200)
  fix: extra lock on init for safety (influxdata#10199)
  fix: resolve influxdata#10027 (influxdata#10112)
  fix: register bigquery to output plugins influxdata#10177 (influxdata#10178)
  fix: sysstat use unique temp file vs hard-coded (influxdata#10165)
  refactor: snmp to use gosmi (influxdata#9518)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Input execd - process are not closed when Telegraf (service) stops
3 participants