Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.mqtt_consumer reconnects after network interruption but only yields timeouts thereafter #8722

Closed
mkarlesky opened this issue Jan 20, 2021 · 4 comments · Fixed by #8821
Assignees
Labels
area/mqtt bug unexpected problem or unintended behavior

Comments

@mkarlesky
Copy link

Relevant telegraf.conf:

Note: The MQTT Broker is AWS IoT Core.

[[inputs.mqtt_consumer]]
  servers = ["ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883"]

  ## Topics that will be subscribed to.
  topics = [
  "<snip>",
  "<snip>",
  ]

  qos = 0

  connection_timeout = "30s"

  client_id = "<snip>"

  tls_ca = "/etc/telegraf/ca.pem"
  tls_cert = "/etc/telegraf/cert.pem"
  tls_key = "/etc/telegraf/key.pem"

  name_override = "Relative Humidity"

  data_format = "json"

  fieldpass = ["value"]

  json_time_key = "timestamp"
  json_time_format = "unix_ms"

  [inputs.mqtt_consumer.tags]
    Units = "%"

System info:

Telegraf 1.17
Ubuntu 14.04.5

Docker

docker-compose.yaml:

  telegraf:
    image: telegraf:1.17-alpine
    restart: always
    environment:
      - INFLUXDB_V2_URL=https://us-west-2-1.aws.cloud2.influxdata.com
      - INFLUX_TOKEN=<snip>
      - OUR_ORGANIZATION=<snip>
      - INFLUXDB_BUCKET=<snip>
    volumes:
      - …/aws-mqtt-influxcloud-v2.conf:/etc/telegraf/telegraf.conf:ro
      - …/AmazonRootCA1.pem:/etc/telegraf/ca.pem
      - …/certificate.pem.crt:/etc/telegraf/cert.pem
      - …/private.pem.key:/etc/telegraf/key.pem

Steps to reproduce:

  1. Start Telegraf with MQTT consumer configuration for AWS IoT Core and allow to run successfully for a day.
  2. LAN device automatically reboots early in the morning, and Telegraf disconnects.
  3. Telegraf reconnects to MQTT broker after LAN device recovers.
  4. Telegraf MQTT collection timeouts take the place of metrics gathering thereafter.

Expected behavior:

Recovery of MQTT subscriptions after broker reconnection following network interruption.

Actual behavior:

After a connection interruption Telegraf successfully reconnects to the MQTT broker but collection results in timeouts. Only restarting Telegraf fixes the problem.

This problem was reliably reproducible each morning at 6am when a network device automatically rebooted and interrupted the connection to the AWS IoT Core MQTT Broker. Removing the automatic device reboot solved the problem.

Additional info:

 | 2021-01-13T12:23:16Z I! Starting Telegraf 1.17.0
 | 2021-01-13T12:23:16Z I! Using config file: /etc/telegraf/telegraf.conf
 | 2021-01-13T12:23:16Z I! Loaded inputs: mqtt_consumer (2x)
 | 2021-01-13T12:23:16Z I! Loaded aggregators:
 | 2021-01-13T12:23:16Z I! Loaded processors: rename
 | 2021-01-13T12:23:16Z I! Loaded outputs: influxdb_v2
 | 2021-01-13T12:23:16Z I! Tags enabled:
 | 2021-01-13T12:23:16Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"738565c8ccdc", Flush Interval:10s
 | 2021-01-13T12:23:16Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-13T12:23:16Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]

 …

 | 2021-01-14T11:00:12Z E! [outputs.influxdb_v2] when writing to [https://us-west-2-1.aws.cloud2.influxdata.com]: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:00:12Z E! [agent] Error writing to outputs.influxdb_v2: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:00:22Z E! [outputs.influxdb_v2] when writing to [https://us-west-2-1.aws.cloud2.influxdata.com]: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:00:22Z E! [agent] Error writing to outputs.influxdb_v2: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:01:21Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting

 …

 | 2021-01-14T11:02:40Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:02:41Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:04:00Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
 | 2021-01-14T11:04:01Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
 | 2021-01-14T11:04:20Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:04:20Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:04:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:04:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
@mkarlesky mkarlesky added the bug unexpected problem or unintended behavior label Jan 20, 2021
@AndreasUfert
Copy link

AndreasUfert commented Jan 20, 2021

This bug affects us, too.
Here, three mqtt_consumer inputs connect to one mqtt broker reading three different topics. At one time or another all three inputs get stuck leading to this log output:

Jan 19 02:37:40 myfqdn telegraf[253559]: 2021-01-19T01:37:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
Jan 19 02:37:40 myfqdn telegraf[253559]: 2021-01-19T01:37:40Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
Jan 19 02:37:40 myfqdn telegraf[253559]: 2021-01-19T01:37:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
Jan 19 02:37:40 myfqdn telegraf[253559]: 2021-01-19T01:37:40Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
Jan 19 02:37:40 myfqdn telegraf[253559]: 2021-01-19T01:37:40Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
Jan 19 02:37:40 myfqdn telegraf[253559]: 2021-01-19T01:37:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
Jan 19 02:37:46 myfqdn telegraf[253559]: 2021-01-19T01:37:46Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jan 19 02:37:46 myfqdn telegraf[253559]: 2021-01-19T01:37:46Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jan 19 02:37:46 myfqdn telegraf[253559]: 2021-01-19T01:37:46Z D! [outputs.socket_writer] Buffer fullness: 0 / 10000 metrics

We can't tell if it is caused by any network interruptions. Restarting telegraf helps but telegraf doesn't immediately react to SIGTERM, instead it must be terminated with SIGKILL.

The Problem might be related to #8562 which had been merged into 1.17.

@netmax79
Copy link

netmax79 commented Jan 25, 2021

I'm also affected by this bug. I can confirm exact same behaviour from @AndreasUfert

Edit: i just hit it again, here is a recent log output:

2021-01-25T14:55:09Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF 2021-01-25T14:55:09Z D! [inputs.mqtt_consumer] Disconnected [ssl://mqtt.XYZ:8883] 2021-01-25T14:55:10Z D! [inputs.mqtt_consumer] Connecting [ssl://mqtt.XYZ:8883] 2021-01-25T14:55:10Z I! [inputs.mqtt_consumer] Connected [ssl://mqtt.XYZ:8883] 2021-01-25T14:56:25Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting 2021-01-25T14:56:25Z D! [inputs.mqtt_consumer] Disconnected [ssl://mqtt.XYZ:8883] 2021-01-25T14:56:30Z D! [inputs.mqtt_consumer] Connecting [ssl://mqtt.XYZ:8883] 2021-01-25T14:56:30Z I! [inputs.mqtt_consumer] Connected [ssl://mqtt.XYZ:8883] 2021-01-25T14:56:40Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped 2021-01-25T14:56:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s 2021-01-25T14:56:50Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped 2021-01-25T14:56:50Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s 2021-01-25T14:57:00Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped

@zjl1985
Copy link

zjl1985 commented Jan 27, 2021

Yes, I also encountered it. only restarting telegraf can solve it

2021-01-13T07:52:05Z I! [inputs.mqtt_consumer] Connected [tcp://xx3]
2021-01-21T09:48:26Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
2021-01-21T09:48:30Z I! [inputs.mqtt_consumer] Connected [tcp://xxx]
2021-01-21T09:50:45Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
2021-01-21T09:50:50Z I! [inputs.mqtt_consumer] Connected [tcp://xxx]
2021-01-21T09:51:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:51:10Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:51:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:51:30Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:51:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:51:50Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:52:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:52:10Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:52:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:52:30Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:52:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:52:50Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:53:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:53:10Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:53:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
2021-01-21T09:53:30Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s

@mkarlesky
Copy link
Author

mkarlesky commented Feb 3, 2021

To add on to my own initial report at the top of this issue, I can say we've also seen Telegraf fall into this timeout behavior after simply running for an extended period of time. Our issue is reliably reproducible upon a specific network interruption (daily network device reboot). But, we've since noticed that if Telegraf runs for 4 or 5 days even without a network interruption, MQTT metric subscriptions eventually fall into a mode of consistent, repeating timeouts exactly as what happens when a network interruption occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mqtt bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants