Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. #14941

Closed
maintain3r opened this issue Mar 5, 2024 · 10 comments · Fixed by #14992
Closed
Labels
bug unexpected problem or unintended behavior

Comments

@maintain3r
Copy link

Relevant telegraf.conf

[[outputs.prometheus_client]]
  listen = ":9303"
  path = "/metrics"
  collectors_exclude = ["gocollector", "process"]
  export_timestamp = false

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "5s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false

[[inputs.dns_query]]
  servers = ["1.1.1.1"]
  network = "udp"
  timeout = 1
  port = 53
  record_type = "A"
  domains = [
              "qwe.example.com",
            ]

Logs from Telegraf

2024-03-05T21:09:20Z E! [inputs.dns_query] Error in plugin: Invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com

System info

Telegraf 1.12.6

Docker

No response

Steps to reproduce

Get Telegraf 1.12.6 and use config provided in this ticket.
Run telegraf with the config provided. It can be a telegraf docker img.
Check for the logs coming out of telegraf.

Expected behavior

Expected behaviour is just to reflect the fact in the metric dns_query_rcode_value with rcode="NXDOMAIN" and result="error".
Maybe there's a way to prevent the plugin from logging?

Actual behavior

2024-03-05T21:09:20Z E! [inputs.dns_query] Error in plugin: Invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com

Additional info

Many thanks!

@maintain3r maintain3r added the bug unexpected problem or unintended behavior label Mar 5, 2024
@maintain3r maintain3r changed the title When a RR does not exist anymore telegraf produces errors in its logfile. When a RR does not exist anymore telegraf inputs.dns_query produces errors in its logfile. Mar 5, 2024
@powersj
Copy link
Contributor

powersj commented Mar 5, 2024

Hi,

Get Telegraf 1.12.6 and use config provided in this ticket.

That version is many, many years old. Can you please update the version you are using. Based on my local build of master I get the following results:

dns_query,domain=qwe.example.com,host=ryzen,rcode=NXDOMAIN,record_type=A,result=error,server=1.1.1.1 query_time_ms=105.99883,result_code=2i,rcode_value=3i 1709674110000000000
2024-03-05T21:28:29Z D! [outputs.file] Wrote batch of 1 metrics in 21.2µs
2024-03-05T21:28:29Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics

@powersj powersj added the waiting for response waiting for response from contributor label Mar 5, 2024
@maintain3r
Copy link
Author

Hi @powersj thanks for getting back. I checked with the same config and the latest version of telegraf and get the same results. Please have a look at the following:

root@node_exp:~# cat /tmp/aaa.log
2024-03-08T17:08:00Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB
2024-03-08T17:08:00Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-03-08T17:08:00Z I! Loaded inputs: dns_query (2x)
2024-03-08T17:08:00Z I! Loaded aggregators:
2024-03-08T17:08:00Z I! Loaded processors:
2024-03-08T17:08:00Z I! Loaded secretstores:
2024-03-08T17:08:00Z I! Loaded outputs: prometheus_client
2024-03-08T17:08:00Z I! Tags enabled: host=node_exp
2024-03-08T17:08:00Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"node_exp", Flush Interval:5s
2024-03-08T17:08:00Z I! [outputs.prometheus_client] Listening on http://[::]:4444/metrics
2024-03-08T17:08:10Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:20Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:30Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:40Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:08:50Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
2024-03-08T17:09:00Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com

$ telegraf --version
Telegraf 1.29.5 (git: HEAD@138d0d54)

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 8, 2024
@powersj
Copy link
Contributor

powersj commented Mar 8, 2024

To remind myself of this issue I re-ran this again and do now see:

2024-03-08T17:25:20Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
dns_query,domain=qwe.example.com,rcode=NXDOMAIN,record_type=A,result=error,server=1.1.1.1 query_time_ms=101.723123,result_code=2i,rcode_value=3i 1709918720000000000
2024-03-08T17:25:25Z D! [outputs.file] Wrote batch of 1 metrics in 49.11µs
2024-03-08T17:25:25Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-03-08T17:25:30Z E! [inputs.dns_query] Error in plugin: invalid answer (NXDOMAIN) from 1.1.1.1 after A query for qwe.example.com
dns_query,domain=qwe.example.com,rcode=NXDOMAIN,record_type=A,result=error,server=1.1.1.1 query_time_ms=4.939483,result_code=2i,rcode_value=3i 1709918730000000000
2024-03-08T17:25:35Z D! [outputs.file] Wrote batch of 1 metrics in 45.97µs
2024-03-08T17:25:35Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics

It appears that the plugin will produce an error anytime there is a non-success response code. I'm still not clear this is actually an issue though as we are giving the opportunity to tell the user that something is up. In this case, you are trying to look something up that doesn't exist, it probably makes sense to error no?

@powersj powersj added the waiting for response waiting for response from contributor label Mar 8, 2024
@srebhan
Copy link
Member

srebhan commented Mar 11, 2024

@powersj maybe we should log the error only once?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 11, 2024
@powersj
Copy link
Contributor

powersj commented Mar 11, 2024

@powersj maybe we should log the error only once?

I am still not clear on what the use-case of this issue is. Does @maintain3r know the endpoint does not exist and does not want errors in the logs? Or does this domain come and go? Why would you not want the error in the first place?

If anything, my thought initial thought was to potentially allow filtering out certain error codes from the check. So if the user did not want to see NXDOMAIN, then we would ignore printing an error?

@powersj powersj added the waiting for response waiting for response from contributor label Mar 12, 2024
@maintain3r
Copy link
Author

maintain3r commented Mar 13, 2024

Hello Team,
Sorry if I wasn't clear, will do better! My goal is to avoid polluting the log file with lines where telegraf complains that it's not able to resolve a specific name. Like int the config I provided above, let's pretend I have qwe.example.com that I want to keep track of, simply put ... I want to make sure that this A RR always exists. And in the case where this RR get's deleted I should not rely on logs, I want to rely on a value my metrics provide. If I see a change from A->B I trigger an alert.
I'll also attach some gifs to make things clear.
telegraf-dns
telegraf-dns2

root@node_exp:/etc/telegraf# telegraf --version
Telegraf 1.31.0-f9d24e96 (git: pull/14979@f9d24e96)

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 13, 2024
@powersj
Copy link
Contributor

powersj commented Mar 13, 2024

My goal is to avoid polluting the log file with lines where telegraf complains that it's not able to resolve a specific name.

Why?

From my perspective it is just another line in the log that you can ignore if you don't need to worry about. From other users who typo a domain name they may really need to see that log message, go in and fix the hostname.

If I see a change from A->B I trigger an alert.

Are your alerts based on logs or metrics?

@powersj powersj added the waiting for response waiting for response from contributor label Mar 13, 2024
@maintain3r
Copy link
Author

Unless Im missing smth but I see no value in periodically throwing an error string in the log file repeating basically the same thing that's already seen in the metric. In the example I provided I put only CF dns srv and just one name to check, but in case I want to test the same record against a 3-4 different external dns servers I get even more noise, not to mention the fact that I can have more RRs which by the time or by mistake may get deleted by dns zone admins therefore making things even more noisy.
And, yeah, in my case alerts are based on the metrics I extract.

Hope it makes sense :)

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 13, 2024
@powersj
Copy link
Contributor

powersj commented Mar 13, 2024

3-4 different external dns servers I get even more noise,

I can understand, except this noise to you is a legitimate call to action for another user.

Even if we only logged the error once my concern is this starts to hide or make it more difficult for others to view legit issues because they see it once in the logs and then go "well it must have gone away and isn't an issue anymore".

I am not convinced we should remove the messages, nor should we log only once. I would consider a filter, because that is opt-in.

@maintain3r
Copy link
Author

@powersj That's fair! And maybe having a knob that will turn the logging on and off (on the plugin level only) will make things better?

powersj added a commit to powersj/telegraf that referenced this issue Mar 14, 2024
Allows the user to specify ignoring certain error types from printing in
the logs.

fixes: influxdata#14941
powersj added a commit to powersj/telegraf that referenced this issue Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants