When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. #14941

maintain3r · 2024-03-05T21:25:08Z

Relevant telegraf.conf

[[outputs.prometheus_client]]
  listen = ":9303"
  path = "/metrics"
  collectors_exclude = ["gocollector", "process"]
  export_timestamp = false

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "5s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false

[[inputs.dns_query]]
  servers = ["1.1.1.1"]
  network = "udp"
  timeout = 1
  port = 53
  record_type = "A"
  domains = [
              "qwe.example.com",
            ]

Logs from Telegraf

2024-03-05T21:09:20Z E! [inputs.dns_query] Error in plugin: Invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com

System info

Telegraf 1.12.6

Docker

No response

Steps to reproduce

Get Telegraf 1.12.6 and use config provided in this ticket.
Run telegraf with the config provided. It can be a telegraf docker img.
Check for the logs coming out of telegraf.

Expected behavior

Expected behaviour is just to reflect the fact in the metric dns_query_rcode_value with rcode="NXDOMAIN" and result="error".
Maybe there's a way to prevent the plugin from logging?

Actual behavior

2024-03-05T21:09:20Z E! [inputs.dns_query] Error in plugin: Invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com

Additional info

Many thanks!

The text was updated successfully, but these errors were encountered:

powersj · 2024-03-05T21:29:17Z

Hi,

Get Telegraf 1.12.6 and use config provided in this ticket.

That version is many, many years old. Can you please update the version you are using. Based on my local build of master I get the following results:

dns_query,domain=qwe.example.com,host=ryzen,rcode=NXDOMAIN,record_type=A,result=error,server=1.1.1.1 query_time_ms=105.99883,result_code=2i,rcode_value=3i 1709674110000000000
2024-03-05T21:28:29Z D! [outputs.file] Wrote batch of 1 metrics in 21.2µs
2024-03-05T21:28:29Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics

maintain3r · 2024-03-08T17:11:52Z

Hi @powersj thanks for getting back. I checked with the same config and the latest version of telegraf and get the same results. Please have a look at the following:

root@node_exp:~# cat /tmp/aaa.log
2024-03-08T17:08:00Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB
2024-03-08T17:08:00Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-03-08T17:08:00Z I! Loaded inputs: dns_query (2x)
2024-03-08T17:08:00Z I! Loaded aggregators:
2024-03-08T17:08:00Z I! Loaded processors:
2024-03-08T17:08:00Z I! Loaded secretstores:
2024-03-08T17:08:00Z I! Loaded outputs: prometheus_client
2024-03-08T17:08:00Z I! Tags enabled: host=node_exp
2024-03-08T17:08:00Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"node_exp", Flush Interval:5s
2024-03-08T17:08:00Z I! [outputs.prometheus_client] Listening on http://[::]:4444/metrics
2024-03-08T17:08:10Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:20Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:30Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:40Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:50Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:09:00Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com

$ telegraf --version
Telegraf 1.29.5 (git: HEAD@138d0d54)

powersj · 2024-03-08T17:28:26Z

To remind myself of this issue I re-ran this again and do now see:

2024-03-08T17:25:20Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
dns_query,domain=qwe.example.com,rcode=NXDOMAIN,record_type=A,result=error,server=1.1.1.1 query_time_ms=101.723123,result_code=2i,rcode_value=3i 1709918720000000000
2024-03-08T17:25:25Z D! [outputs.file] Wrote batch of 1 metrics in 49.11µs
2024-03-08T17:25:25Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-03-08T17:25:30Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
dns_query,domain=qwe.example.com,rcode=NXDOMAIN,record_type=A,result=error,server=1.1.1.1 query_time_ms=4.939483,result_code=2i,rcode_value=3i 1709918730000000000
2024-03-08T17:25:35Z D! [outputs.file] Wrote batch of 1 metrics in 45.97µs
2024-03-08T17:25:35Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics

It appears that the plugin will produce an error anytime there is a non-success response code. I'm still not clear this is actually an issue though as we are giving the opportunity to tell the user that something is up. In this case, you are trying to look something up that doesn't exist, it probably makes sense to error no?

srebhan · 2024-03-11T17:55:51Z

@powersj maybe we should log the error only once?

powersj · 2024-03-11T18:03:05Z

@powersj maybe we should log the error only once?

I am still not clear on what the use-case of this issue is. Does @maintain3r know the endpoint does not exist and does not want errors in the logs? Or does this domain come and go? Why would you not want the error in the first place?

If anything, my thought initial thought was to potentially allow filtering out certain error codes from the check. So if the user did not want to see NXDOMAIN, then we would ignore printing an error?

maintain3r · 2024-03-13T14:21:08Z

Hello Team,
Sorry if I wasn't clear, will do better! My goal is to avoid polluting the log file with lines where telegraf complains that it's not able to resolve a specific name. Like int the config I provided above, let's pretend I have qwe.example.com that I want to keep track of, simply put ... I want to make sure that this A RR always exists. And in the case where this RR get's deleted I should not rely on logs, I want to rely on a value my metrics provide. If I see a change from A->B I trigger an alert.
I'll also attach some gifs to make things clear.

root@node_exp:/etc/telegraf# telegraf --version
Telegraf 1.31.0-f9d24e96 (git: pull/14979@f9d24e96)

powersj · 2024-03-13T15:11:08Z

My goal is to avoid polluting the log file with lines where telegraf complains that it's not able to resolve a specific name.

Why?

From my perspective it is just another line in the log that you can ignore if you don't need to worry about. From other users who typo a domain name they may really need to see that log message, go in and fix the hostname.

If I see a change from A->B I trigger an alert.

Are your alerts based on logs or metrics?

maintain3r · 2024-03-13T15:23:36Z

Unless Im missing smth but I see no value in periodically throwing an error string in the log file repeating basically the same thing that's already seen in the metric. In the example I provided I put only CF dns srv and just one name to check, but in case I want to test the same record against a 3-4 different external dns servers I get even more noise, not to mention the fact that I can have more RRs which by the time or by mistake may get deleted by dns zone admins therefore making things even more noisy.
And, yeah, in my case alerts are based on the metrics I extract.

Hope it makes sense :)

powersj · 2024-03-13T15:29:09Z

3-4 different external dns servers I get even more noise,

I can understand, except this noise to you is a legitimate call to action for another user.

Even if we only logged the error once my concern is this starts to hide or make it more difficult for others to view legit issues because they see it once in the logs and then go "well it must have gone away and isn't an issue anymore".

I am not convinced we should remove the messages, nor should we log only once. I would consider a filter, because that is opt-in.

maintain3r · 2024-03-13T15:34:45Z

@powersj That's fair! And maybe having a knob that will turn the logging on and off (on the plugin level only) will make things better?

Allows the user to specify ignoring certain error types from printing in the logs. fixes: influxdata#14941

fixes: influxdata#14941

maintain3r added the bug unexpected problem or unintended behavior label Mar 5, 2024

maintain3r changed the title ~~When a RR does not exist anymore telegraf produces errors in its logfile.~~ When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. Mar 5, 2024

powersj added the waiting for response waiting for response from contributor label Mar 5, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 8, 2024

powersj added the waiting for response waiting for response from contributor label Mar 8, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 11, 2024

powersj added the waiting for response waiting for response from contributor label Mar 12, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 13, 2024

powersj added the waiting for response waiting for response from contributor label Mar 13, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 13, 2024

powersj added a commit to powersj/telegraf that referenced this issue Mar 14, 2024

feat(inputs.dns_query): Allow ignoring errors of specific types

ac306bd

Allows the user to specify ignoring certain error types from printing in the logs. fixes: influxdata#14941

powersj mentioned this issue Mar 14, 2024

feat(inputs.dns_query): Allow ignoring errors of specific types #14992

Merged

1 task

powersj added a commit to powersj/telegraf that referenced this issue Mar 19, 2024

fix(inputs.dns_query): Omit certain errors from logs

534df73

fixes: influxdata#14941

DStrand1 closed this as completed in #14992 Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. #14941

When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. #14941

maintain3r commented Mar 5, 2024

powersj commented Mar 5, 2024

maintain3r commented Mar 8, 2024

powersj commented Mar 8, 2024

srebhan commented Mar 11, 2024

powersj commented Mar 11, 2024

maintain3r commented Mar 13, 2024 •

edited

Loading

powersj commented Mar 13, 2024

maintain3r commented Mar 13, 2024

powersj commented Mar 13, 2024

maintain3r commented Mar 13, 2024

When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. #14941

When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. #14941

Comments

maintain3r commented Mar 5, 2024

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Mar 5, 2024

maintain3r commented Mar 8, 2024

powersj commented Mar 8, 2024

srebhan commented Mar 11, 2024

powersj commented Mar 11, 2024

maintain3r commented Mar 13, 2024 • edited Loading

powersj commented Mar 13, 2024

maintain3r commented Mar 13, 2024

powersj commented Mar 13, 2024

maintain3r commented Mar 13, 2024

maintain3r commented Mar 13, 2024 •

edited

Loading