Close running outputs when reloading #8769

viperstars · 2021-01-28T13:48:29Z

closes #8726, closes #8185

As we know, telegraf will reload when getting a SIGHUP signal. If we go deeper into the reload loop, we can see "runAgent" is called without closing running outputs.

This will cause some issues like reload failure and connection leaking.

Reload failure

Add prometheus configuration

[[outputs.prometheus_client]]
  listen = ":9273"
  path = "/metrics"
  metric_version = 2

Send a SIGHUP to the process
We will get the "address already in use" error:

2021-01-28T08:47:59Z I! Starting Telegraf 
2021-01-28T08:47:59Z I! Loaded inputs: x509_cert
2021-01-28T08:47:59Z I! Loaded aggregators: 
2021-01-28T08:47:59Z I! Loaded processors: 
2021-01-28T08:47:59Z I! Loaded outputs: prometheus_client
2021-01-28T08:47:59Z I! Tags enabled: host=Zachary-MBP.local
2021-01-28T08:47:59Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"Zachary-MBP.local", Flush Interval:20s
2021-01-28T08:47:59Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2021-01-28T08:48:12Z I! Reloading Telegraf config
2021-01-28T08:48:12Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-01-28T08:48:12Z I! Starting Telegraf 
2021-01-28T08:48:12Z I! Loaded inputs: x509_cert
2021-01-28T08:48:12Z I! Loaded aggregators: 
2021-01-28T08:48:12Z I! Loaded processors: 
2021-01-28T08:48:12Z I! Loaded outputs: prometheus_client
2021-01-28T08:48:12Z I! Tags enabled: host=Zachary-MBP.local
2021-01-28T08:48:12Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"Zachary-MBP.local", Flush Interval:20s
2021-01-28T08:48:12Z E! [agent] Failed to connect to [outputs.prometheus_client], retrying in 15s, error was 'listen tcp :9273: bind: address already in use'
2021-01-28T08:48:27Z E! [telegraf] Error running agent: connecting output outputs.prometheus_client: Error connecting to output "outputs.prometheus_client": listen tcp :9273: bind: address already in use

Connection leaking

Add influxdb output

[[outputs.influxdb_v2]]
urls = ["http://127.0.0.1:8086"]

Sent SIGHUP several times.
We can find more than one tcp connection to influxdb server:

Zachary-MBP:telegraf zachary$ lsof -p 34680
COMMAND    PID    USER   FD     TYPE             DEVICE  SIZE/OFF                NODE NAME
telegraf 34680 zachary  cwd      DIR                1,4      1408            26952554 /Users/zachary/Projects/golang/telegraf
telegraf 34680 zachary  txt      REG                1,4 117061680            27081706 /Users/zachary/Projects/golang/telegraf/telegraf
telegraf 34680 zachary  txt      REG                1,4     34032            24991620 /Library/Preferences/Logging/.plist-cache.eRVHsOgp
telegraf 34680 zachary  txt      REG                1,4     62747            26934247 /private/var/db/analyticsd/events.whitelist
telegraf 34680 zachary  txt      REG                1,4   2528384 1152921500312783762 /usr/lib/dyld
telegraf 34680 zachary  txt      REG                1,4  30148944 1152921500312795328 /usr/share/icu/icudt66l.dat
telegraf 34680 zachary    0u     CHR               16,1   0t14589                1483 /dev/ttys001
telegraf 34680 zachary    1u     CHR               16,1   0t14589                1483 /dev/ttys001
telegraf 34680 zachary    2u     CHR               16,1   0t14589                1483 /dev/ttys001
telegraf 34680 zachary    3     PIPE 0xdbb151f6d2284bc1     16384                     ->0x500c3504a5abf149
telegraf 34680 zachary    4     PIPE 0x500c3504a5abf149     16384                     ->0xdbb151f6d2284bc1
telegraf 34680 zachary    5u  KQUEUE                                                  count=0, state=0xa
telegraf 34680 zachary    6     PIPE 0x7060bc4282b87de5     16384                     ->0x58c782c11bf250d
telegraf 34680 zachary    7     PIPE  0x58c782c11bf250d     16384                     ->0x7060bc4282b87de5
telegraf 34680 zachary    8u    unix 0xba652578907083cf       0t0                     ->0xba6525787cd608e7
telegraf 34680 zachary    9r     CHR               17,1    0t4096                 593 /dev/urandom
telegraf 34680 zachary   10u    IPv4 0xba6525789e21f5d7       0t0                 TCP localhost:64693->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   11u    IPv4 0xba65257897816267       0t0                 TCP localhost:64841->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   12u    IPv4 0xba65257888979267       0t0                 TCP localhost:64847->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   13u    IPv4 0xba6525789ad5b267       0t0                 TCP localhost:64853->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   14u    IPv4 0xba65257898eb0fef       0t0                 TCP localhost:64859->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   15u    IPv4 0xba6525789e129697       0t0                 TCP localhost:64865->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   16u    IPv4 0xba6525789ad5d0af       0t0                 TCP localhost:64870->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   17u    IPv4 0xba6525788bc6f267       0t0                 TCP localhost:64876->localhost:8086 (ESTABLISHED)
telegraf 34680 zachary   18u    IPv4 0xba6525789e13fc7f       0t0                 TCP localhost:64884->localhost:8086 (ESTABLISHED)

This issue was mentioned before. "client.Close()" was add in the influxdb output, but "(i *InfluxDB) Close()" method is never called during the reload.

I add some code to fix this.

Required for all PRs:

Associated README.md updated.
Has appropriate unit tests.

telegraf-tiger

🤝 ✅ CLA has been signed. Thank you!

cmd/telegraf/telegraf.go

srebhan

Please set the global agent variable directly from withinrunAgent() and please also think about a better name for the global variable. How about runningAgent or something.

viperstars · 2021-02-06T03:48:27Z

Please set the global agent variable directly from withinrunAgent() and please also think about a better name for the global variable. How about runningAgent or something.

I initalize a global variable named "runningAgent". Move closing outputs into the runAgent() before call agent.NewAgent(c)

cmd/telegraf/telegraf.go

srebhan

This looks very good already. Just one more minor request, please set the agent to a defined state in case of agent.NewAgent() fails. I know this call cannot fail currently, but it might fail in the future and I don't want to introduce later hard-to-find issues by assuming the above function to return something sensible in case of error.

srebhan

Looks good to me.

viperstars · 2021-02-10T06:38:07Z

related issue: #8726 #8185

srebhan · 2021-02-10T12:27:40Z

@viperstars can you please add closes #8726, #8185 to your PR description to allow automatic closing of the tickets!?

srebhan

Fine with me.

ssoroka

This is not the right approach. The change has to be internal to the agent after the output has finished processing its writes.

agent/agent.go

ssoroka

much better, thank you!

srebhan

Fine with me if you remove the extra log-message as @ssoroka requested/suggested.

srebhan

Sorry for one more request, but can you move the body of stopRunningOutputs() to the location of the function call. That is dissolve stopRunningOutputs() and fold into the gatherLoop() function. It's only a for loop that survived, wrapping it in a function is not worth it.

viperstars · 2021-03-16T09:34:41Z

Sorry for one more request, but can you move the body of stopRunningOutputs() to the location of the function call. That is dissolve stopRunningOutputs() and fold into the gatherLoop() function. It's only a for loop that survived, wrapping it in a function is not worth it.

I think we should keep it. I add "stopRunningOutputs" based on the existing function "stopServiceInputs". These two functions work in the same way, so the definition and call should be same.

srebhan

Agreed. Looks good to me. @ssoroka any comments?

(cherry picked from commit 71757e8)

Add closing running outputs while reloading

5149c4f

telegraf-tiger bot approved these changes Jan 28, 2021

View reviewed changes

ivorybilled reviewed Jan 29, 2021

View reviewed changes