-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204
Comments
Thanks for your report. I assume you use the hidden flag I would be in favor of not changing the default DNS resolver. Many users have not changed the resolver and perhaps changing the default would mean new issues for others. In which they have to change the new default to their previous used resolver. However, it might be valuable to unhide the flag and make this more known that this is a thing. Especially for Openshift users. Would love to hear your thoughts 👍 |
Yes, my rationale is that currently the Go native resolver has known limitations with SRV records (as stated in #1015 already) and that people deploying Thanos on Kubernetes probably need to use miekgdns as the default. Having said that, I fully understand that changing the default might trigger issues for anoother fraction of users.
That would definitely work for me. FWIW we had to switch the default resolver to miekgdns in OpenShift because the prometheus operator has no support for the |
Just describing a the two options that IMO should be discussed: Question 1 is: Should Pro's: Seems like better integration, less quirks/buggy Question 2: Should we keep Pro's: Less config overhead for regular usage |
We've finally identified the change that caused the issue with the Thanos ruler pods. It was the cluster DNS operator that enabled the bufsize plugin for CoreDNS. As a result, the server started to send UDP messages > 512 bytes which the native Go resolver can't handle (see openshift/cluster-dns-operator#266, golang/go#6464 and golang/go#13561). The plugin configuration was eventually modified to avoid breaking clients using the native Go resolver in openshift/cluster-dns-operator#276 but I bet that other DNS server implementations (e.g. Bind, PowerDNS) may also be configured in a way that would irritate the native Go resolver. |
Nice, thanks for this! Just to recap: Miekg resolver was working fine for everyone, but Go native not, right? In this case, I would indeed unhide it and make it a default TBH - it has proven to work well for many. We can actually make even better use of it e.g finally use TTL metadata. |
This is my understanding. I'd be 👍 on your idea @bwplotka |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closed by #4519 |
Related to issue: #4204 Problem: in the PR #4519 the default sd-dns-resolver for querier was set to miekgdns but this change was never ported to ruler Solution: this PR brings this default to ruler as well to make it consistent Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com> Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
Related to issue: thanos-io#4204 Problem: in the PR thanos-io#4519 the default sd-dns-resolver for querier was set to miekgdns but this change was never ported to ruler Solution: this PR brings this default to ruler as well to make it consistent Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com> Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
Related to issue: thanos-io#4204 Problem: in the PR thanos-io#4519 the default sd-dns-resolver for querier was set to miekgdns but this change was never ported to ruler Solution: this PR brings this default to ruler as well to make it consistent Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com> Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
Thanos, Prometheus and Golang version used:
Thanos v0.19.0 and Prometheus v2.26.0.
Object Storage Provider:
N/A
What happened:
The Thanos ruler pods fail to start because Thanos can't resolve the SRV records for Alertmanager endpoints (
cannot unmarshal DNS message
error message in the logs). The situation doesn't resolve by itself and the pods keep crashlooping.What you expected to happen:
Thanos ruler pods .
How to reproduce it (as minimally and precisely as possible):
The issue can be seen with OpenShift 4.7 and 4.8 (at least) but I couldn't reproduce with vanilla Kubernetes v1.20 (installed with minikube).
Full logs to relevant components:
Anything else we need to know:
As far as I can tell, this isn't a Thanos bug by itself since the root cause is probably due to golang/go#36718 (TL;DR: the Go resolver rejects DNS responses that don't comply exactly with the RFCs, e.g. compressed names in SRV resource records). It's also worth noting that standard tools like
dig
have no issue to resolve the records.We've worked around the issue by switching the DNS resolver to miekgdns (instead of the default native Go resolver) but it might be interesting to use the miekgdns resolver as the default?
Additional details: https://bugzilla.redhat.com/show_bug.cgi?id=1953518
More details
The text was updated successfully, but these errors were encountered: