Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPCUA Plugin crashes Telegraf (again) #13834

Closed
default-student opened this issue Aug 28, 2023 · 4 comments · Fixed by #13840
Closed

OPCUA Plugin crashes Telegraf (again) #13834

default-student opened this issue Aug 28, 2023 · 4 comments · Fixed by #13840
Labels
bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue

Comments

@default-student
Copy link

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false


[[outputs.influxdb_v2]]	
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ## urls exp: http://127.0.0.1:8086
  urls = ["http://influxdb:8086"]
  ## works because of the docker network dns and container name tag in docker compose

  ## Token for authentication.
  token = "${DOCKER_INFLUXDB_INIT_ADMIN_TOKEN}"
  
  ## Organization is the name of the organization you wish to write to; must exist.
  organization = "${DOCKER_INFLUXDB_INIT_ORG}"
  
  ## Destination bucket to write into.
  bucket = "${DOCKER_INFLUXDB_INIT_BUCKET}"

  insecure_skip_verify = true

[[inputs.mqtt_consumer]]
  servers = ["mqtt://172.16.160.171:1883"]
  topics = [
    "#",
  ]
  ## The message topic will be stored in a tag specified by this value.  If set
  ## to the empty string no topic tag will be created.
  topic_tag = "mqtt"
#   username = "mqtt"
#   password = "mccutee"
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  # data_format = "influx"
  data_format = "value"
  data_type = "string"
  ## Enable extracting tag values from MQTT topics
  ## _ denotes an ignored entry in the topic path
  # [[inputs.mqtt_consumer.topic_parsing]]
  #   topic = ""
  #   measurement = ""
  #   tags = ""
  #   fields = ""
  ## Value supported is int, float, unit
  #   [[inputs.mqtt_consumer.topic.types]]
  #      key = type# Read metrics about cpu usage# Read metrics about cpu usage
  
# Retrieve data from OPCUA devices
[[inputs.opcua]]
  ## Metric name
  # name = "opcua"
  #
  ## OPC UA Endpoint URL
  endpoint = "opc.tcp://172.16.184.15:4840"
  security_policy = "None"
  security_mode = "None"
  #
  ## Path to cert.pem. Required when security mode or policy isn't "None".
  ## If cert path is not supplied, self-signed cert and key will be generated.
  certificate = "None"
  #
  ## Path to private key.pem. Required when security mode or policy isn't "None".
  ## If key path is not supplied, self-signed cert and key will be generated.
  # private_key = "/etc/telegraf/key.pem"
  #
  ## Authentication Method, one of "Certificate", "UserName", or "Anonymous".  To
  ## authenticate using a specific ID, select 'Certificate' or 'UserName'
  auth_method = "Anonymous"
  #
  ## Username. Required for auth_method = "UserName"
  username = "None"
  #
  ## Password. Required for auth_method = "UserName"
  password = "None"
  #
  ## Option to select the metric timestamp to use. Valid options are:
  ##     "gather" -- uses the time of receiving the data in telegraf
  ##     "server" -- uses the timestamp provided by the server
  ##     "source" -- uses the timestamp provided by the source
  # timestamp = "gather"
  #
  ## Node ID configuration
  ## name              - field name to use in the output
  ## namespace         - OPC UA namespace of the node (integer value 0 thru 3)
  ## identifier_type   - OPC UA ID type (s=string, i=numeric, g=guid, b=opaque)
  ## identifier        - OPC UA ID (tag as shown in opcua browser)
  ## tags              - extra tags to be added to the output metric (optional)
  ## Example:
  ## {name="ProductUri", namespace="0", identifier_type="i", identifier="2262", tags=[["tag1","value1"],["tag2","value2]]}
  # nodes = [
  #  {name="", namespace="", identifier_type="", identifier=""},
  #  {name="", namespace="", identifier_type="", identifier=""},
  #]
  #
  ## Node Group
  ## Sets defaults for OPC UA namespace and ID type so they aren't required in
  ## every node.  A group can also have a metric name that overrides the main
  ## plugin metric name.
  ##
  ## Multiple node groups are allowed
  #[[inputs.opcua.group]]
  ## Group Metric name. Overrides the top level name.  If unset, the
  ## top level name is used.
  # name =
  #
  ## Group default namespace. If a node in the group doesn't set its
  ## namespace, this is used.
  # namespace =
  #
  ## Group default identifier type. If a node in the group doesn't set its
  ## namespace, this is used.
  # identifier_type =
  #
  ## Node ID Configuration.  Array of nodes with the same settings as above.
  # nodes = [
  #  {name="", namespace="", identifier_type="", identifier=""},
  #  {name="", namespace="", identifier_type="", identifier=""},
  #]

Logs from Telegraf

influxdb-telegraf | running telegraf from the config with the id: $id
influxdb-telegraf | 2023-08-28T13:10:20Z I! Loading config: http://influxdb:8086/api/v2/telegrafs/0bbb1067d9354000
influxdb-telegraf | 2023-08-28T13:10:20Z I! Starting Telegraf 1.27.4
influxdb-telegraf | 2023-08-28T13:10:20Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
influxdb-telegraf | 2023-08-28T13:10:20Z I! Loaded inputs: mqtt_consumer opcua
influxdb-telegraf | 2023-08-28T13:10:20Z I! Loaded aggregators: 
influxdb-telegraf | 2023-08-28T13:10:20Z I! Loaded processors: 
influxdb-telegraf | 2023-08-28T13:10:20Z I! Loaded secretstores: 
influxdb-telegraf | 2023-08-28T13:10:20Z I! Loaded outputs: influxdb_v2
influxdb-telegraf | 2023-08-28T13:10:20Z I! Tags enabled: host=fa058f66ce32
influxdb-telegraf | 2023-08-28T13:10:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"fa058f66ce32", Flush Interval:10s
influxdb-telegraf | 2023-08-28T13:10:20Z I! [inputs.mqtt_consumer] Connected [mqtt://172.16.160.171:1883]
influxdb2   | ts=2023-08-28T13:10:29.661448Z lvl=info msg="index opened with 8 partitions" log_id=0jw46I9G000 service=storage-engine index=tsi
influxdb2   | ts=2023-08-28T13:10:29.662652Z lvl=info msg="loading changes (start)" log_id=0jw46I9G000 service=storage-engine engine=tsm1 op_name="field indices" op_event=start
influxdb2   | ts=2023-08-28T13:10:29.662966Z lvl=info msg="loading changes (end)" log_id=0jw46I9G000 service=storage-engine engine=tsm1 op_name="field indices" op_event=end op_elapsed=0.316ms
influxdb2   | ts=2023-08-28T13:10:29.663916Z lvl=info msg="Reindexing TSM data" log_id=0jw46I9G000 service=storage-engine engine=tsm1 db_shard_id=1
influxdb2   | ts=2023-08-28T13:10:29.663928Z lvl=info msg="Reindexing WAL data" log_id=0jw46I9G000 service=storage-engine engine=tsm1 db_shard_id=1
influxdb2   | ts=2023-08-28T13:10:29.702091Z lvl=info msg="saving field index changes (start)" log_id=0jw46I9G000 service=storage-engine engine=tsm1 op_name=MeasurementFieldSet op_event=start
influxdb2   | ts=2023-08-28T13:10:29.704149Z lvl=info msg="saving field index changes (end)" log_id=0jw46I9G000 service=storage-engine engine=tsm1 op_name=MeasurementFieldSet op_event=end op_elapsed=2.075ms
influxdb-telegraf | 2023-08-28T13:10:30Z W! [inputs.opcua] Failed to load certificate: open None: no such file or directory
influxdb-telegraf | 2023-08-28T13:10:30Z E! [inputs.opcua] Error in plugin: registering nodes failed: There was nothing to do because the client passed a list of operations with no elements. StatusBadNothingToDo (0x800F0000)
influxdb-telegraf | panic: runtime error: invalid memory address or nil pointer dereference
influxdb-telegraf | [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x2fb5055]
influxdb-telegraf | 
influxdb-telegraf | goroutine 74 [running]:
influxdb-telegraf | github.com/gopcua/opcua.(*Client).ReadWithContext(0x10?, {0x805dcc8, 0xc0001c0040}, 0x0)
influxdb-telegraf |     /go/pkg/mod/github.com/gopcua/opcua@v0.4.0/client.go:1046 +0x75
influxdb-telegraf | github.com/gopcua/opcua.(*Client).Read(0x10?, 0xc000101800?)
influxdb-telegraf |     /go/pkg/mod/github.com/gopcua/opcua@v0.4.0/client.go:1040 +0x2a
influxdb-telegraf | github.com/influxdata/telegraf/plugins/inputs/opcua.(*ReadClient).read(0xc000136a00)
influxdb-telegraf |     /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/read_client.go:136 +0x33
influxdb-telegraf | github.com/influxdata/telegraf/plugins/inputs/opcua.(*ReadClient).CurrentValues(0xc000136a00)
influxdb-telegraf |     /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/read_client.go:112 +0xd2
influxdb-telegraf | github.com/influxdata/telegraf/plugins/inputs/opcua.(*OpcUA).Gather(0xc001ebe780?, {0x808a380, 0xc000ac79a0})
influxdb-telegraf |     /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/opcua.go:38 +0x2e
influxdb-telegraf | github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc00012c460, {0x808a380, 0xc000ac79a0})
influxdb-telegraf |     /go/src/github.com/influxdata/telegraf/models/running_input.go:144 +0x5a
influxdb-telegraf | github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
influxdb-telegraf |     /go/src/github.com/influxdata/telegraf/agent/agent.go:575 +0x2e
influxdb-telegraf | created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce
influxdb-telegraf |     /go/src/github.com/influxdata/telegraf/agent/agent.go:574 +0x12a
influxdb-telegraf exited with code 2

System info

Telegraf 1.27.4, Docker version 20.10.17, build 100c701

Docker

influxdata/influxdata-docker#703

Steps to reproduce

Start my docker compose.

Expected behavior

Collections of metrics from the opcua server

Actual behavior

Error in plugin: registering nodes failed: There was nothing to do because the client passed a list of operations with no elements. StatusBadNothingToDo (0x800F0000)

Telegraf crashes

Additional info

Same fault, but because of un and replugging of the network connection here:
#13260
Similar fault, solved by downgrading (doesnt work):
#10140

@default-student default-student added the bug unexpected problem or unintended behavior label Aug 28, 2023
@powersj
Copy link
Contributor

powersj commented Aug 28, 2023

Hi,

Error in plugin: registering nodes failed: There was nothing to do because the client passed a list of operations with no elements. StatusBadNothingToDo (0x800F0000)

As the error message states, you have not provided the client with anything to do. There is no nodes or groups to montior. That said, we should catch this during the Init phase to avoid any type of crash. And instead fail to start.

@powersj powersj added help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue labels Aug 28, 2023
@default-student
Copy link
Author

default-student commented Aug 29, 2023

Thank you, the crash is quite bad for production but atleast your hint for the missing config fixed my issue.
However I should note that there seem to be many cases that make telegraf exit as well as missing error handling around the opcua plugin. We will try to work around those.

For future reference; although its not explicitly written in the example config many lines of the config are necessary

[[inputs.opcua]]
  name = "opcua"
  endpoint = "opc.tcp://172.16.184.15:4840"
  security_policy = "None"
  security_mode = "None"
  auth_method = "Anonymous"
  password = "None"
  timestamp = "gather"
  [[inputs.opcua.group]]
    name = "Flow Sensor"
    namespace = "2"
    identifier_type = "i"
    nodes = [
        {name="Energy", identifier="6012"},
        {name="FlowVelocity", identifier="6003"},
        {name="Mass", identifier="6004"},
        {name="MassFlowRate", identifier="6002"},
        {name="Pressure", identifier="6013"},
        {name="Temperature", identifier="6014"},
        {name="Volume", identifier="6038"},
        {name="VolumentricFlowRate", identifier="6010"},
    ]

@powersj
Copy link
Contributor

powersj commented Aug 29, 2023

However I should note that there seem to be many cases that crash telegraf completely.

Many?

powersj added a commit to powersj/telegraf that referenced this issue Aug 29, 2023
Provides a check to a user's config to ensure that they have provided a
group and or root node to collect data from.

fixes: influxdata#13834
@default-student
Copy link
Author

default-student commented Aug 30, 2023

Many?

Excuse my generalization, telegraf is an impressive feat with extreme amounts of options. I was referring to the missing error catching around the whole opcua plugin, as mentioned in other issues, as well as the exit on error ( discussed #11313 (comment) and here #10694 (comment) ) . I will correct this.

Although the fact that Telegraf crashes the plugin meets an error persists, i would agree to close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants