Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inputs.opcua] Panic on sudden disconnect #10595

Closed
R290 opened this issue Feb 6, 2022 · 8 comments · Fixed by #10974
Closed

[inputs.opcua] Panic on sudden disconnect #10595

R290 opened this issue Feb 6, 2022 · 8 comments · Fixed by #10974
Assignees
Labels
area/opcua bug unexpected problem or unintended behavior

Comments

@R290
Copy link
Contributor

R290 commented Feb 6, 2022

Relevant telegraf.conf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = true

[[outputs.file]]
  files = ["stdout"]
  data_format = "influx"

[[inputs.opcua]]
  endpoint = "opc.tcp://opcua:4840"
  security_policy = "None"
  nodes = [
    {name="the_answer", namespace="1", identifier_type="s", identifier="the.answer"},
    {name="float", namespace="1", identifier_type="i", identifier="51028"},
  ]

Logs from Telegraf

2022-02-06T19:18:52Z I! Starting Telegraf 1.21.3
2022-02-06T19:18:52Z I! Using config file: /etc/telegraf/telegraf.conf
2022-02-06T19:18:52Z I! Loaded inputs: opcua
2022-02-06T19:18:52Z I! Loaded aggregators: 
2022-02-06T19:18:52Z I! Loaded processors: 
2022-02-06T19:18:52Z I! Loaded outputs: file
2022-02-06T19:18:52Z I! Tags enabled: host=12b9589239aa
2022-02-06T19:18:52Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"12b9589239aa", Flush Interval:10s
2022-02-06T19:18:52Z D! [agent] Initializing plugins
2022-02-06T19:18:52Z W! [inputs.opcua] Failed to load certificate: open /etc/telegraf/cert.pem: no such file or directory
2022-02-06T19:18:52Z D! [agent] Connecting outputs
2022-02-06T19:18:52Z D! [agent] Attempting connection to [outputs.file]
2022-02-06T19:18:52Z D! [agent] Successfully connected to outputs.file
2022-02-06T19:18:52Z D! [agent] Starting service inputs
2022-02-06T19:19:02Z D! [outputs.file] Wrote batch of 2 metrics in 235.1µs
2022-02-06T19:19:02Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
opcua,host=12b9589239aa the_answer=42i,Quality="OK (0x0)" 1644175140000000000
opcua,host=12b9589239aa float=0,Quality="OK (0x0)" 1644175140000000000
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x26e76f6]

goroutine 117 [running]:
github.com/gopcua/opcua/uacp.(*Conn).Close(0xc0000d5d10)
        /go/pkg/mod/github.com/gopcua/opcua@v0.2.3/uacp/conn.go:211 +0x36
github.com/gopcua/opcua.(*Client).Close(0xc0000d5c70)
        /go/pkg/mod/github.com/gopcua/opcua@v0.2.3/client.go:523 +0x7d
github.com/influxdata/telegraf/plugins/inputs/opcua.disconnect(0xc000406b60)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/opcua_client.go:525 +0x74
github.com/influxdata/telegraf/plugins/inputs/opcua.(*OpcUA).Gather(0xc000406b60, {0x5960588, 0xc0006a8280})
        /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/opcua_client.go:551 +0xbc
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc000dba320, {0x5960588, 0xc0006a8280})
        /go/src/github.com/influxdata/telegraf/models/running_input.go:117 +0x5a
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:469 +0x2e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce
        /go/src/github.com/influxdata/telegraf/agent/agent.go:468 +0x12f

System info

Telegraf 1.21.3, Ubuntu 20.04.3, Docker 20.10.12

Docker

services:
  telegraf:
    image: telegraf:1.21.3
    volumes:
      - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
    depends_on:
      - opcua
  opcua:
    image: open62541/open62541:latest
    container_name: opcua
    ports:
      - '4840:4840'

Steps to reproduce

  1. docker-compose up -d
  2. docker kill opcua

The docker kill is meant to mimic a sudden disconnect event. This could also be something physical like unplugging a network cable.

Expected behavior

Doing something similar using the regread.go example from github.com/gopcua/opcua does not result in panic and shows the following error:

RegisterNodes failed: The operation could not complete because the client is not connected to the server. StatusBadServerNotConnected (0x800D0000)

Actual behavior

Panic, see the Telegraf log.

Additional info

The cause is probably somewhere in the code of the Telegraf plugin since the panic is not reproducible with the example from the gopcua library.

@R290 R290 added the bug unexpected problem or unintended behavior label Feb 6, 2022
@powersj
Copy link
Contributor

powersj commented Feb 7, 2022

Hi,

When you tried to reproduce this with the client library what version of the library did you use? Were you using the same v0.2.3 Telegraf uses?

To break down the last three calls that were made:

github.com/influxdata/telegraf/plugins/inputs/opcua.disconnect(0xc000406b60)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/opcua_client.go:525 +0x74

This is the last call from Telegraf code and it is from here. Telegraf is attempting to disconnect and calls the OPC UA library close function.

github.com/gopcua/opcua.(*Client).Close(0xc0000d5c70)
        /go/pkg/mod/github.com/gopcua/opcua@v0.2.3/client.go:523 +0x7d

We are now outside telegraf in the OPC UA client.go also trying to close the connection.

github.com/gopcua/opcua/uacp.(*Conn).Close(0xc0000d5d10)
        /go/pkg/mod/github.com/gopcua/opcua@v0.2.3/uacp/conn.go:211 +0x36

The final line of code in the trace is the UACP connection close.

The path has had a few updates, including one only a couple of days ago, which is why I was interested in what version of the library you used.

Thanks!

@R290
Copy link
Contributor Author

R290 commented Feb 10, 2022

I tried both v0.2.3 and the latest v0.3.1, same result. The example does not panic since it does not call client.Close() (it does not contain any state management). After killing the container o.client.conn is nil which results in the expected panic. The main cause for this seems to be the assumption that is made here:

err := o.getData()
if err != nil && o.state == Connected {
o.state = Disconnected
// Ignore returned error to not mask the original problem
//nolint:errcheck,revive
disconnect(o)
return err
}

We assume that if we get an error, we still need to close the connection manually, while it is probably already closed. The error set by o.getData() matches the error from the example. A possible solution is to not close the client at all or create an issue at the opcua library to add a check for c.conn != nil at: https://github.com/gopcua/opcua/blob/486ed60873a3c6ae265036adc18d0ea7280e8663/client.go#L588

@powersj
Copy link
Contributor

powersj commented Feb 11, 2022

@magiconair what do you think about this issue? I am inclined to request that this get fixed upstream, but if you think Telegraf should not try to disconnect, we can also make that change.

Thanks!

@powersj
Copy link
Contributor

powersj commented Mar 7, 2022

I am of the opinion that this should be done in the library as others could run into this very issue. I have put up gopcua/opcua#562 (after fighting some weird spacing tab/space issue).

@powersj powersj closed this as completed Mar 7, 2022
@magiconair
Copy link

magiconair commented Apr 8, 2022

I've merged the fix today and released https://github.com/gopcua/opcua/releases/tag/v0.3.3

@powersj
Copy link
Contributor

powersj commented Apr 8, 2022

Thank you!

@R290
Copy link
Contributor Author

R290 commented Apr 10, 2022

@powersj should we open a pull request to update gopcua to v0.3.3? In the current state the issue is not really fixed from the Telegraf perspective.

@powersj
Copy link
Contributor

powersj commented Apr 10, 2022

@R290 please do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/opcua bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants