Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KNX Listener stops receiving messages #14695

Closed
alexander-zimmermann opened this issue Feb 7, 2024 · 15 comments · Fixed by #14959
Closed

KNX Listener stops receiving messages #14695

alexander-zimmermann opened this issue Feb 7, 2024 · 15 comments · Fixed by #14959
Assignees
Labels
bug unexpected problem or unintended behavior

Comments

@alexander-zimmermann
Copy link
Contributor

Relevant telegraf.conf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  debug = true

[[inputs.knx_listener]]
  ## Type of KNX-IP interface.
  ## Can be either "tunnel" or "router".
  service_type = "tunnel_tcp"
   
  ## Address of the KNX-IP interface.
  service_address = "192.168.1.173:3671"

  [[inputs.knx_listener.measurement]]
    name = "Versorgungstechnik.Energiezähler.Strom.Verbrauch.Zählerstand-Tagtarif"
    dpt = "13.013"
    addresses = ["13/1/0"]

  ...

Logs from Telegraf

...
2024-02-04T20:51:15Z D! [inputs.knx_listener] Matched GA "13/3/22" to measurement "Versorgungstechnik.KWL.Luftstrom" with value 242 m^3/h
2024-02-04T20:51:15Z D! [inputs.knx_listener] Matched GA "13/1/3" to measurement "Versorgungstechnik.Energiezähler.Strom.Verbrauch.Wirkleistung-L1" with value 231.00 W
2024-02-04T20:51:16Z D! [inputs.knx_listener] Matched GA "13/3/22" to measurement "Versorgungstechnik.KWL.Luftstrom" with value 240 m^3/h
2024-02-04T20:51:16Z D! [inputs.knx_listener] Matched GA "7/2/125" to measurement "Sensorik.EG.Esszimmer.Sensor.Luftfeuchte-Alarm-1" with value Off
2024-02-04T20:51:16Z D! [inputs.knx_listener] Matched GA "2/2/147" to measurement "Strom.EG.Küche.K4-L1.Geschirrspüler.Stromwert" with value 502 mA
2024-02-04T20:51:17Z D! [inputs.knx_listener] Matched GA "13/1/5" to measurement "Versorgungstechnik.Energiezähler.Strom.Verbrauch.Wirkleistung-L3" with value 521.00 W
2024-02-04T20:51:17Z D! [inputs.knx_listener] Matched GA "13/1/14" to measurement "Versorgungstechnik.Energiezähler.Strom.Stromstärke-L3" with value 2.47 A
2024-02-04T20:51:17Z D! [inputs.knx_listener] Matched GA "2/2/147" to measurement "Strom.EG.Küche.K4-L1.Geschirrspüler.Stromwert" with value 312 mA
2024-02-04T20:51:18Z D! [inputs.knx_listener] Matched GA "13/3/22" to measurement "Versorgungstechnik.KWL.Luftstrom" with value 241 m^3/h
2024-02-04T20:51:21Z D! [inputs.knx_listener] Matched GA "7/5/53" to measurement "Sensorik.A.Dach.Wetterstation.Regen" with value On
2024-02-04T20:51:22Z D! [outputs.influxdb_v2] Wrote batch of 21 metrics in 13.314648ms
2024-02-04T20:51:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-04T20:51:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-04T20:51:42Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-04T20:51:52Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-04T20:52:02Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
...

And today the same
...
-Gregory.Sensor.Luftfeuchte" with value 51.32 %
2024-02-06T20:59:17Z D! [inputs.knx_listener] Matched GA "5/3/21" to measurement "Heizung.OG.Abstellraum.FBH.Stellwert-Status" with value 27.84%
2024-02-06T20:59:17Z D! [inputs.knx_listener] Matched GA "7/3/200" to measurement "Sensorik.OG.Begehbarer-Schrank.BWM.Temperatur" with value 19.48 °C
2024-02-06T20:59:17Z D! [inputs.knx_listener] Matched GA "13/3/22" to measurement "Versorgungstechnik.KWL.Luftstrom" with value 241 m^3/h
2024-02-06T20:59:17Z D! [inputs.knx_listener] Matched GA "7/1/73" to measurement "Sensorik.KG.Technikraum.Sensor.CO2-Alarm-2" with value Off
2024-02-06T20:59:18Z D! [inputs.knx_listener] Matched GA "5/3/101" to measurement "Heizung.OG.Schlafzimmer-Eltern.FBH.Stellwert-Status" with value 7.45%
2024-02-06T20:59:18Z D! [inputs.knx_listener] Matched GA "5/2/101" to measurement "Heizung.EG.Küche.FBH.Stellwert-Status" with value 46.27%
2024-02-06T20:59:18Z D! [inputs.knx_listener] Matched GA "7/2/95" to measurement "Sensorik.EG.Wohnzimmer.Sensor.Luftfeuchte-Alarm-1" with value Off
2024-02-06T20:59:22Z D! [outputs.influxdb_v2] Wrote batch of 20 metrics in 13.403163ms
2024-02-06T20:59:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T20:59:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T20:59:42Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T20:59:52Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T21:00:02Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T21:00:12Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T21:00:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T21:00:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2024-02-06T21:00:42Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
...

System info

Host Windows 11 Pro Hyper-V, Ubuntu 22.04 VM, Docker 25.0.2

Docker

telegraf:
image: telegraf
container_name: telegraf
hostname: telegraf
restart: unless-stopped
volumes:
- ./telegraf/config/telegraf.conf:/etc/telegraf/telegraf.conf:ro
- ./telegraf/config/inputs.knx.conf:/etc/telegraf/telegraf.d/inputs.knx.conf:ro
- /etc/ssl/certs/ca-certificates.crt:/etc/ssl/ca-certificates.crt:ro
secrets:
- telegraf_influxdb_token
command:
- --config=/etc/telegraf/telegraf.conf
- --config-directory=/etc/telegraf/telegraf.d/
depends_on:
- influxdb

Steps to reproduce

  1. docker start telegraf
  2. wait couple of hours
  3. docker logs -f telegraf
    ...

Expected behavior

See incoming messages...

Actual behavior

The KNX bus is up-and-running. I also run xknx on the Ubuntu VM in parallel . No issues.

Additional info

No response

@alexander-zimmermann alexander-zimmermann added the bug unexpected problem or unintended behavior label Feb 7, 2024
@alexander-zimmermann alexander-zimmermann changed the title KNX Listener stops receives messages KNX Listener stops receiving messages Feb 7, 2024
@powersj
Copy link
Contributor

powersj commented Feb 7, 2024

Hi,

Are you using any processors? This is why we ask for your full config ;)

I am asking because the listener uses tracking metrics and will only consume up to the max number of messages config option. If a processor is not correctly handling tracking metrics, then it can result in this apparent hang.

@powersj
Copy link
Contributor

powersj commented Feb 7, 2024

Also what version of telegraf?

@powersj powersj added the waiting for response waiting for response from contributor label Feb 7, 2024
@alexander-zimmermann
Copy link
Contributor Author

Version: docker latest (see my docker compose):

alexander@ubuntu:~$ docker exec -it telegraf bash
root@telegraf:/# telegraf --version
Telegraf 1.29.4 (git: HEAD@4441c4ed)

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 7, 2024
@alexander-zimmermann
Copy link
Contributor Author

Hi,

Are you using any processors? This is why we ask for your full config ;)

I am asking because the listener uses tracking metrics and will only consume up to the max number of messages config option. If a processor is not correctly handling tracking metrics, then it can result in this apparent hang.

No processors.

Full config: https://gist.github.com/alexander-zimmermann/e663d8810416555374dc05a63074d8d1
KNX Input config: https://gist.github.com/alexander-zimmermann/070d6add7219328d2c0607af442724fe

@powersj
Copy link
Contributor

powersj commented Feb 7, 2024

I am going to have to build you a debug version to understand what is going on then. If I put up a PR with build artifacts, can I have you try using it?

@powersj powersj self-assigned this Feb 7, 2024
@alexander-zimmermann
Copy link
Contributor Author

I am going to have to build you a debug version to understand what is going on then. If I put up a PR with build artifacts, can I have you try using it?

sure.

@alexander-zimmermann
Copy link
Contributor Author

@powersj do you have something I can test?

@powersj
Copy link
Contributor

powersj commented Feb 28, 2024

@alexander-zimmermann thanks for the ping, I completely forgot about this. You can find a debug build artifacts at #14906 in 20-30mins.

Essentially, I have added some additional logging around message handling to see if we were silently dropping something.

Because this is a listener you may need to get a network trace to see if any messages are actually coming in should this debug build not provide any additional information.

I would also suggest trying to run this outside of docker if possible to ensure that is also not an issue.

@powersj powersj added the waiting for response waiting for response from contributor label Feb 28, 2024
@alexander-zimmermann
Copy link
Contributor Author

@powersj I run your build artifact for +- 24 hours now. No problems at all. Any idea how we can further debug the problem?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 4, 2024
@powersj
Copy link
Contributor

powersj commented Mar 4, 2024

Any idea how we can further debug the problem?

Hmm that is interesting as all that PR does is add some additionally logging. Would you be willing to share the logs? You don't have to post them here, you can email me at jpowers at influxdata.com.

Are you seeing any ignoring message logs?

@powersj powersj added the waiting for response waiting for response from contributor label Mar 4, 2024
@alexander-zimmermann
Copy link
Contributor Author

I shared the logs w/ you.

Cron will restart my docker telegraf every hour currently. Let me disable that over the weekend and see if the problem still exist w/ telegraf 1.29.5.

My currently guess is that knx-go looses the connection and does not reestablish it. See also: vapourismo/knx-go#70

But if this is the case, why I don't have it w/o docker not (plain telegraf).

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 8, 2024
@powersj
Copy link
Contributor

powersj commented Mar 8, 2024

Thanks for the logs. It does seem to continue to get messages, unlike the original logs. Thanks also for the knx-go thread. I found this message interesting. Sounded like there was an opportunity to improve the retry connection logic in knx-go and/or somehow for telegraf to tell the Inbound has closed and attempt to reopen?

Let me know how the tests go over the weekend.

@alexander-zimmermann
Copy link
Contributor Author

It's stable since Friday. I continuously get messages

@srebhan
Copy link
Member

srebhan commented Mar 11, 2024

@alexander-zimmermann and @powersj I think I know what happened here... The plugin connects to the tunnel in Start() and then starts listening. If, for any reason, the TCP connection gets interrupted (e.g. due to a switch restart as in my case) listen() will exit and the plugin is left dead because we do not make any attempt to reconnect to the tunnel device... :-(

@srebhan
Copy link
Member

srebhan commented Mar 11, 2024

@alexander-zimmermann please try the binary in #14959 available once CI finished the tests. You can trigger the reconnect by pulling and replugging the network cable or otherwise interrupt the network connection between Telegraf and the KNX interface. Let me know if this fixes your issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants