Description
Relevant telegraf.conf:
[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
namedrop = ["_test"]
[outputs.influxdb.tagdrop]
influxdb_database = [""]
database = “telegraf”
[[outputs.influxdb]]
urls = [“http://192.168.x.y:8086/”]
namedrop = ["_test"]
metric_buffer_limit = 100000
[outputs.influxdb.tagdrop]
influxdb_database = [""]
database = “telegraf”
[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
database = “pmi”
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“pmi”]
[[outputs.influxdb]]
urls = [“http://192.168.x.y:8086/”]
database = “pmi”
metric_buffer_limit = 100000
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“pmi”]
[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
database = “telegraf”
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“idc”]
[[outputs.influxdb]]
urls = [“http://192.168.x.y:8086/”]
database = “telegraf”
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“idc”]
System info:
telegraf 1.12.1-1.12.4 (earlier probably affected as well, just tested with .1 and .4
Steps to reproduce:
- Create Multiple outputs for the same metric stream Metrics
- Start ingesting metrics to telegraf
- Take away one of the outputs ungracefully (eg. Drop by firewall, pull the plug etc)
Expected behavior:
I would expect all other outputs to continue functioning unaffected, and the output that went away to queue up metrics until the bufferlimit is reached.
Actual behavior:
If one output goes away ungracefully (In my case vpn tunnel down, a firewall drop does the same) and packets run into timeout, telegraf starts dropping metrics for all outputs. If the lost output comes back telegraf panics and gets restarted by systemd. All buffered metrics are lost for all outputs. Happens with influx and influx_v2 output.
Additional info:
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Wrote batch of 51 metrics in 61.231974ms
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Buffer fullness: 1 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 787 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 946 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=pmi: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 1000 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Wrote batch of 23 metrics in 96.375817ms
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 17 / 100000 metrics
This keeps happening and data is not written to any of the output, I see a gap in the Graphs.
This happens when the output comes back:
Nov 18 11:03:40 telegraf[11714]: panic: channel is full
Nov 18 11:03:40 telegraf[11714]: goroutine 10357 [running]:
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*trackingAccumulator).onDelivery(0xc000292780, 0x2c11e80, 0xc002bff860)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/accumulator.go:167 +0x7a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingData).notify(…)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:73
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:163 +0x9e
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).Accept(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:144 +0x3a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).metricWritten(0xc0001c2fa0, 0x2c72240, 0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:93 +0x72
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).Accept(0xc0001c2fa0, 0xc002092000, 0x30, 0x30)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:179 +0xa6
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc0001aa280, 0x0, 0xc000560660)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:190 +0xf7
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1(0xc001755b00, 0xc0016d7bc0)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:597 +0x27
Nov 18 11:03:40 telegraf[11714]: created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:596 +0xc8
Nov 18 11:03:40 systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 18 11:03:40 systemd[1]: telegraf.service: Unit entered failed state.
Nov 18 11:03:40 systemd[1]: telegraf.service: Failed with result ‘exit-code’.
Nov 18 11:03:40 systemd[1]: telegraf.service: Service hold-off time over, scheduling restart.
Activity