Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Query for SRV Answer Truncated to Nothing #1931

Closed
bkendall opened this issue Apr 10, 2016 · 7 comments
Closed

DNS Query for SRV Answer Truncated to Nothing #1931

bkendall opened this issue Apr 10, 2016 · 7 comments
Assignees
Labels
type/bug Feature does not function as expected

Comments

@bkendall
Copy link

I was working with a small cluster this evening and registered 9 redis machines into my consul services catalog. When I went to dig for the SRV records, however, I noticed that I did not receive any answer section:

$ dig @127.0.0.1 -p 8600 redis-cache-redis.service.consul SRV

; <<>> DiG 9.9.5-3ubuntu0.6-Ubuntu <<>> @127.0.0.1 -p 8600 redis-cache-redis.service.consul SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52069
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 9
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;redis-cache-redis.service.consul. IN   SRV

;; AUTHORITY SECTION:
consul.                 0       IN      SOA     ns.consul. postmaster.consul. 1460264240 3600 600 86400 0

;; ADDITIONAL SECTION:
ip-10-0-1-185.node.dc1.consul. 0 IN     A       10.0.1.185
ip-10-0-1-16.node.dc1.consul. 0 IN      A       10.0.1.16
ip-10-0-1-59.node.dc1.consul. 0 IN      A       10.0.1.59
ip-10-0-1-59.node.dc1.consul. 0 IN      A       10.0.1.59
ip-10-0-1-185.node.dc1.consul. 0 IN     A       10.0.1.185
ip-10-0-1-59.node.dc1.consul. 0 IN      A       10.0.1.59
ip-10-0-1-16.node.dc1.consul. 0 IN      A       10.0.1.16
ip-10-0-1-185.node.dc1.consul. 0 IN     A       10.0.1.185
ip-10-0-1-16.node.dc1.consul. 0 IN      A       10.0.1.16

;; Query time: 5 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Sun Apr 10 04:57:20 UTC 2016
;; MSG SIZE  rcvd: 517

All of those machines (nodes) in the ADDITIONAL section are correct. I did a bit of searching and found the enable_truncate flag, which I set, and got the same lack of answer when querying using UDP (however, now that the truncate flag was set, dig became smart enough to query using TCP and received the entire ANSWER section).

This leads me to the following question: is removing the entire ANSWER section appropriate? I found the logic in the code that sees that the length of the message is too long and removes ANSWER entries until there are no more ANSWERs or it's under the size limit (dns.go), but I don't know enough about the DNS and it's implementation to know the correct answer to that. It seems to me that completely removing the ANSWER section in favor of the ADDITIONAL section is bizarre. I would (maybe naïvely) expect both sections to be trimmed down the same hosts/entries so there a maximum, equal number of them.

Any thoughts?

For what it's worth, here's the response after enable_truncate is in place. Working perfectly now:

$ dig @127.0.0.1 -p 8600 redis-cache-redis.service.consul SRV
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.9.5-3ubuntu0.6-Ubuntu <<>> @127.0.0.1 -p 8600 redis-cache-redis.service.consul SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11627
;; flags: qr aa rd; QUERY: 1, ANSWER: 9, AUTHORITY: 0, ADDITIONAL: 9
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;redis-cache-redis.service.consul. IN   SRV

;; ANSWER SECTION:
redis-cache-redis.service.consul. 0 IN  SRV     1 1 36518 ip-10-0-1-185.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 31562 ip-10-0-1-59.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 29888 ip-10-0-1-185.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 32688 ip-10-0-1-16.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 48075 ip-10-0-1-59.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 56648 ip-10-0-1-59.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 28523 ip-10-0-1-185.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 33573 ip-10-0-1-16.node.dc1.consul.
redis-cache-redis.service.consul. 0 IN  SRV     1 1 27304 ip-10-0-1-16.node.dc1.consul.

;; ADDITIONAL SECTION:
ip-10-0-1-185.node.dc1.consul. 0 IN     A       10.0.1.185
ip-10-0-1-59.node.dc1.consul. 0 IN      A       10.0.1.59
ip-10-0-1-185.node.dc1.consul. 0 IN     A       10.0.1.185
ip-10-0-1-16.node.dc1.consul. 0 IN      A       10.0.1.16
ip-10-0-1-59.node.dc1.consul. 0 IN      A       10.0.1.59
ip-10-0-1-59.node.dc1.consul. 0 IN      A       10.0.1.59
ip-10-0-1-185.node.dc1.consul. 0 IN     A       10.0.1.185
ip-10-0-1-16.node.dc1.consul. 0 IN      A       10.0.1.16
ip-10-0-1-16.node.dc1.consul. 0 IN      A       10.0.1.16

;; Query time: 3 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Sun Apr 10 05:07:44 UTC 2016
;; MSG SIZE  rcvd: 1172
@slackpad slackpad added the type/bug Feature does not function as expected label Apr 10, 2016
@slackpad
Copy link
Contributor

Hi @bkendall can you confirm which version of Consul you are using? There were some recent changes in 0.6.4 in this area but I don't think they would have caused this. You are correct we should probably trim the additional section in parallel to prevent a situation like this.

@bkendall
Copy link
Author

I first found this behavior on 0.6.4, and then I was getting the same behavior on master @ 16f34bf when I was building it locally to debug further.

@slackpad
Copy link
Contributor

Ah ok this is the trimming behavior added in 0.6.4, since we used to just trim the answer list to a fixed size before. Thanks for the report - we will get this fixed!

@bkendall
Copy link
Author

You're very welcome. For those interested in possibly making a PR (I may be), would the desired behavior be to trim down the ADDITIONAL section to match the ANSWER section? Or, de-duplicate the ADDITIONAL section and remove hosts from both when trying to trim down to the desired length?

@slackpad
Copy link
Contributor

This is a little tricky because you may legitimately have multiple entries for the same host in the ANSWER section, so you have to do some bookkeeping in order to clean ADDITIONAL out. De-duping is a great idea. I think you'd maybe want to do something like this in the trim routine:

  1. De-dup the ADDITIONAL section. If the message fits now you are good.
  2. Make one pass through the ANSWER records and make an index with a count of how many times that host appears.
  3. While the message doesn't fit, remove the next ANSWER record, decrement the count for that host, and if the count is now 0, remove the corresponding ADDITIONAL record.

@slackpad slackpad added the dns label Apr 13, 2016
@xytis
Copy link

xytis commented Jul 22, 2016

Are there any developments to this bug?
Is the current 'solution' to downgrade consul?

@slackpad
Copy link
Contributor

Hi @xytis this will be fixed for sure by the next release of Consul. Current workaround would be to use DNS over TCP or an older version of the agent. Sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants