Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NS records and A records for each server. Constructs ns host name… #3353

Merged
merged 14 commits into from
Aug 8, 2017

Conversation

preetapan
Copy link
Contributor

…s using the advertise address of the server.

@preetapan
Copy link
Contributor Author

Example output

preetha@preetha-work ~/go/src/github.com/hashicorp/consul (issue_1301) $dig @127.0.0.1 -p 8600 redis.service.mydc.consul. 

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 redis.service.mydc.consul.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31418
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;redis.service.mydc.consul.	IN	A

;; ANSWER SECTION:
redis.service.mydc.consul. 0	IN	A	172.17.0.1

;; AUTHORITY SECTION:
consul.			0	IN	NS	ns.172.17.0.1.consul.

;; ADDITIONAL SECTION:
ns.172.17.0.1.consul.	0	IN	A	172.17.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Aug 02 16:46:13 CDT 2017
;; MSG SIZE  rcvd: 114

agent/dns.go Outdated
Name: d.domain,
Rrtype: dns.TypeNS,
Class: dns.ClassINET,
Ttl: 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open question around TTL for this part. Should it be > 0?

Note that the A record below with the ip address mapped to the ns record name does set a TTL

@preetapan
Copy link
Contributor Author

Reached out to person that reported this first to ask for testing it -
#1301 (comment)

@magiconair
Copy link
Contributor

Still need to read the RFC on what this needs to look like but at a first glance I'd use a different name for the NS name since dots usually denote subdomains. Maybe like this:

;; AUTHORITY SECTION:
consul.			0	IN	NS	server-172-17-0-1.consul. 
consul.			0	IN	NS	server-2001-db8--1.consul.

;; ADDITIONAL SECTION:
server-172-17-0-1.consul.	0	IN	A	172.17.0.1
server-2001-db8--1.consul.	0	IN	AAAA	2001:db8::1

Copy link
Contributor

@magiconair magiconair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on style and format. Still need to read RFCs on whether the solution is correct.

}

var ret []string
// Try each manager until we get a server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop comment or explain why not every manager is a server

agent/dns.go Outdated
@@ -673,6 +682,40 @@ RPC:
}
}

// addNSAndARecordsForDomain uses the agent's advertise address to
func (d *DNSServer) addNSAndARecordsForDomain(msg *dns.Msg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/addNSAndARecordsForDomain/addAuthority/g

Also the comment isn't done yet.

agent/dns.go Outdated
func (d *DNSServer) addNSAndARecordsForDomain(msg *dns.Msg) {
serverAddrs := d.agent.delegate.ServerAddrs()
for _, addr := range serverAddrs {
ipAddrStr := strings.Split(addr, ":")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See separate comment on nsName format.

agent/dns.go Outdated
}
msg.Ns = append(msg.Ns, ns)

//add an A record for the NS record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space between // and comment

@magiconair
Copy link
Contributor

magiconair commented Aug 4, 2017

TL;DR

  • The response to the SOA query is wrong IMO since the SOA record should be in the Answer section instead of Authority
  • Consul should provide an NS record
  • The SOA response can contain NS records in Authority and glue records in AdditionalExtra. It should contain CNAMEs for ns.<domain> in Extra
  • The Authority section should not be added to every response since it makes the responses too big for large clusters
  • We should make the values for refresh, retry, expire and minttl configurable.
  • We should also make the RNAME (mailbox name) configurable but default to hostmaster.<domain>postmaster.<domain>
  • I'd also add a table driven test for the DNS name regex
  • My patch does not handle AAAA glue records for IPv6 yet (!)

I still need to test proper delegation from bind/named. So the last patch is my idea of the correct behavior of consul for this use case after reviewing the RFCs and comparing to other DNS responses. Therefore, additional eyes are welcome.

Long version

I've reviewed the RFCs and some additional documents to review the implemented solution.

The IANA nameserver guidelines [1] are useful in how to configure an authoritative nameserver but not all requirements apply since consul is usually not deployed as an internet facing DNS server. For example, we will usually not be able to provide name servers in different BGP networks and ip addresses are usually in private networks. The main purpose for consul is to handle DNS resolution for registered services on an internal network.

To embed consul into a larger DNS setup it should be possible to delegate a zone authoritavely to consul. For this the responses need to have the AA bit (authoritatvie answer) set and provide name servers to the parent zone.

The main problem is that consul does not provide NS (nameserver) records at all. We should at least respond to NS queries with a list of name servers and we can add them to the Authority section of an SOA response. It is not necessary IMO to add the NS records and A glue records to every response for several reasons: It makes the response larger which could become an issue (512 byte limit) for larger server clusters. In a seven node cluster with consul listening on both IPv4 and IPv6 the additional response size could be around 500 bytes.

Therefore, responding to NS queries should be sufficient and adding the NS records (plus glue) to the SOA response is helpful.

In the current implementation I've limited the number of servers returned for NS queries to three and also randomized them. I am not sure whether this is a good idea for a delegated zone since the parent server needs to know about the slave servers' ip addresses. Therefore, they cannot change at will since the parent server config would have to be updated as well.

Setting the TTLs to zero is a good idea for services since we don't want to cache this. Whether this is also a good idea for SOA and NS records is a different discussion. In any case, we should make the zone parameters refresh, retry, expire and minttl as well as the MBOX username configurable.

I've also dropped the server-<ip address>.<domain> approach for the servers since we already have a node name that we can use. Unfortunately, the node name already contains the datacenter but the canonical name for the server in consul DNS is NAME.node.DC.DOMAIN so we have to do some mangling.

The repsonse for the SOA query was provided in the wrong section IMO and I've refactored it to be in the Answer section when asking for the SOA record. It also contains a CNAME for the ns.<domain> entry we put in the MNAME field although I'm not sure whether that is strictly necessary. The RFC says that the MNAME is the name of the primary name server for this zone so this seems reasonable. This refactor also removes the duplicate SOA record in the SOA response.

This needs to be tested by us/the community whether it works with zone delegation.

References

Consul 0.9.0 behavior:

~/consul-0.9.0/consul agent -dev

$ dig @127.0.0.1 -p 8600 soa consul. +norecurse

; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 soa consul. +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 33579
;; flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;consul.				IN	SOA

;; AUTHORITY SECTION:
consul.			0	IN	SOA	ns.consul. postmaster.consul. 1501845830 3600 600 86400 0
consul.			0	IN	SOA	ns.consul. postmaster.consul. 1501845830 3600 600 86400 0

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Fri Aug  4 13:23:50 2017
;; MSG SIZE  rcvd: 110
$ dig @127.0.0.1 -p 8600 ns consul. +norecurse

; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 ns consul. +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13735
;; flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;consul.				IN	NS

;; AUTHORITY SECTION:
consul.			0	IN	SOA	ns.consul. postmaster.consul. 1501846384 3600 600 86400 0

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Fri Aug  4 13:33:04 2017
;; MSG SIZE  rcvd: 74
$ dig @127.0.0.1 -p 8600 srv consul.service.consul. +norecurse

; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 srv consul.service.consul. +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59653
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;consul.service.consul.		IN	SRV

;; ANSWER SECTION:
consul.service.consul.	0	IN	SRV	1 1 8300 hashibook.node.dc1.consul.

;; ADDITIONAL SECTION:
hashibook.node.dc1.consul. 0	IN	A	127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Fri Aug  4 13:33:27 2017
;; MSG SIZE  rcvd: 94

Latest patch behavior

$ dig @127.0.0.1 -p 8600 soa consul. +norecurse

; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 soa consul. +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50880
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; QUESTION SECTION:
;consul.				IN	SOA

;; ANSWER SECTION:
consul.			0	IN	SOA	ns.consul. postmaster.consul. 1501846452 3600 600 86400 0

;; AUTHORITY SECTION:
consul.			0	IN	NS	hashibook.node.dc1.consul.

;; ADDITIONAL SECTION:
hashibook.node.dc1.consul. 0	IN	A	127.0.0.1
ns.consul.		0	IN	CNAME	hashibook.node.dc1.consul.

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Fri Aug  4 13:34:12 2017
;; MSG SIZE  rcvd: 137
$ dig @127.0.0.1 -p 8600 ns consul. +norecurse

; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 ns consul. +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54832
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;consul.				IN	NS

;; ANSWER SECTION:
consul.			0	IN	NS	hashibook.node.dc1.consul.

;; ADDITIONAL SECTION:
hashibook.node.dc1.consul. 0	IN	A	127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Fri Aug  4 13:34:32 2017
;; MSG SIZE  rcvd: 73
$ dig @127.0.0.1 -p 8600 srv consul.service.consul. +norecurse

; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 srv consul.service.consul. +norecurse
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34561
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;consul.service.consul.		IN	SRV

;; ANSWER SECTION:
consul.service.consul.	0	IN	SRV	1 1 8300 hashibook.node.dc1.consul.

;; ADDITIONAL SECTION:
hashibook.node.dc1.consul. 0	IN	A	127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Fri Aug  4 13:34:44 2017
;; MSG SIZE  rcvd: 94

@preetapan
Copy link
Contributor Author

@magiconair This looks close to being right, except I don't think we need the additional section when its queried for the ns record, as long as the record returned resolves correctly in a subsequent query. Example with google:

preetha@preetha-work ~/go/src/github.com/hashicorp/raft (issue_229) $dig -t ns google.com

; <<>> DiG 9.10.3-P4-Ubuntu <<>> -t ns google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55731
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.			IN	NS

;; ANSWER SECTION:
google.com.		4134	IN	NS	ns3.google.com.
google.com.		4134	IN	NS	ns4.google.com.
google.com.		4134	IN	NS	ns1.google.com.
google.com.		4134	IN	NS	ns2.google.com.

;; Query time: 30 msec
;; SERVER: 127.0.1.1#53(127.0.1.1)
;; WHEN: Fri Aug 04 10:02:30 CDT 2017
;; MSG SIZE  rcvd: 111

Note the lack of an additional section above^. But if I query for one of the name servers like below, I do get an iP back.

preetha@preetha-work ~/go/src/github.com/hashicorp/raft (issue_229) $dig -t ANY ns1.google.com

; <<>> DiG 9.10.3-P4-Ubuntu <<>> -t ANY ns1.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14515
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ns1.google.com.			IN	ANY

;; ANSWER SECTION:
ns1.google.com.		319172	IN	A	216.239.32.10

;; Query time: 41 msec
;; SERVER: 127.0.1.1#53(127.0.1.1)
;; WHEN: Fri Aug 04 10:02:34 CDT 2017
;; MSG SIZE  rcvd: 59

@preetapan
Copy link
Contributor Author

preetapan commented Aug 4, 2017

I pushed another commit, that does the following, I think this is pretty close now:

Fixes the tests
Added a table driven test for InvalidDNSRegexp
Removed the A record from the additional section when Type NS is requested ** - this is because I looked at Google (see above) and a few other sites, and it doesn't seem necessary because now we return a node name that already resolves correctly.
I made the loop that looks for 3 servers ignore any servers whose names don't pass the valid dns check. Seems like we should fail much faster (like during join, so that when dns is enabled and the node name has special characters, we refuse to join.). That's a much bigger change. As implemented, it now it tries to find other server nodes that are valid, and warns on invalid nodes.
I don't think we need to worry about adding AAAA glue records given my changes, because that's handled already when you do a node lookup on the name we returned in the answer for the NS record.

I would suggest that the other configurability changes you suggested (like ttl and postmaster address being configurable) should be addressed in a followup issue, to keep the scope of this small. The SOA changes looked right to me after I read the specs above, so I left them alone.

@magiconair
Copy link
Contributor

The difference is that the google NS records have a TTL > 0. They are also running a real DNS server on the internet with fixed addresses rather static databases. Consul is a different use case and I don't think that you can compare them one-to-one.

Adding the A records to the Extra section means that the caller does not have to make another request to resolve the records. Also, if you are providing more than one NS record the caller has to make multiple calls to resolve them. Adding the glue records reduces latency and traffic. The main point is not to add the NS records and the glue records to all responses. I'd leave them in the SOA and NS responses.

@preetapan
Copy link
Contributor Author

@magiconair I added the glue records back, and expanded on the test.

One note about unit testing - I couldn't come up with a way to create a test agent server that has a ipV6 address, I tried converting 127.0.0.1 into ipv6, but net.ParseIP parses it back so the agent sees it as a iPv4 address (https://play.golang.org/p/7F1wflOlQZ). (This was so that we could verify AAAA records in the NS response).

@magiconair
Copy link
Contributor

@preetapan Wouldn't ::1 work? I can have a look tomorrow

t.Fatalf("err: %v", err)
}

if len(in.Answer) != 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using verify.Values works better in this case than comparing individual fields. I can have a quick look.

in string
invalid bool
}{
{"Valid Hostname", "testnode", false},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to include edge cases like "", ".", "a.b", " ", ...

I usually try to use generic values like a, ab, a-b and so forth to encode a pattern of what I'm testing. This way you don't have to invent names and you can use the pattern to group certain cases.

Feel free to disregard both suggestions if you feel that this is sufficient.

@preetapan
Copy link
Contributor Author

@magiconair I didn't know about ::1, thanks. That does work, and is a good test case because it exposes that our parsing of the address returned by memberlist doesn't handle the `[ ]' brackets added by the String() method of TCPAddr ( https://golang.org/src/net/ipsock.go#L186). I am working on a IPV6 specific test case now.

Preetha Appan and others added 12 commits August 7, 2017 11:11
…s using the advertise address of the server.
This patch changes the behavior of the DNS server as follows:

* The SOA response contains the SOA record in the Answer section instead
  of the Authority section. It also contains NS records in the Authority
  and the corresponding A glue records in the Extra section.
  In addition, CNAMEs are added to the Extra section to make the
  MNAME of the SOA record resolvable.

  AAAA glue records are not yet supported.

* The NS response returns up to three random servers from the
  consul cluster in the Answer section and the glue A
  records in the Extra section.

  AAAA glue records are not yet supported.
…d already resolves correctly. Also fixed all the unit tests, and ignored hostnames that don't meet valid dns hostname criteria
…ed same function used in node lookup for adding A/AAAA records in the extra section of the NS response
…to use verify library and other code review feedback
@magiconair
Copy link
Contributor

I've rebased the branch and did a force push. Tests with running a slave zone on bind showed that bind had a format error FORMERR for the SOA record. I've dropped the CNAMEs and the complex node names in the node.DC.DOMAIN and then things looked better. Since we don't support zone transfers (AXFR) and shouldn't in the future I've added a NOTIMP response to AXFR queries.

I didn't manage to get a bind server configured with forwarding but the following worked:

zone "consul.example.com" in {
  type static-stub;
  server-addresses { 192.168.33.13; }; // this doesn't allow custom ports though.
};

@preetapan The changes broke the tests and I didn't get around fixing them yet.

@magiconair
Copy link
Contributor

I've completed the tests and delegation via static-stub already works with consul 0.9.0 and bind 9.10.3. However, the responses the consul DNS server provides can be better which this PR fixes. It also removes the FORMERR errors if you setup consul with slave delegation which you should not do.

I did not manage to setup a forward zone. The following config is not sufficient to get this to work although this is what the documentation says should do it:

zone "consul.example.com" {
    type forward;
    forward only;
    forwarders { 192.168.33.13; };
};

I ran a tcpdump on consul dns port but there was no traffic. I'm sure I am missing something trivial but I don't see it.

I've also found out that he Primary Master in the SOA record (ns.DOMAIN) is only relevant for Dynamic DNS updates since this is the server a caller should use to update the DNS database. Hence, any server should be fine but my suggestion is to leave it at ns.DOMAIN since we don't support dynamic DNS updates.

test setup

bind setup on 192.168.33.11

# install bind
sudo apt-get -y install bind9

configure bind (disable DNSSEC for this test)

sudo bash -c 'cat << EOF > /etc/bind/named.conf.options
options {
  directory "/var/cache/bind";
  dnssec-validation no;
  auth-nxdomain no;    # conform to RFC1035
  listen-on-v6 { any; };
};
EOF'
cat /etc/bind/named.conf.options

configure master and consul zone

sudo bash -c 'cat << EOF > /etc/bind/named.conf.local
zone "example.com" in {
  type master;
  file "master/example.com";
};
zone "consul.example.com" in {
  type static-stub;
  server-addresses { 192.168.33.13; };
};
EOF'
cat /etc/bind/named.conf.local

create master zone

sudo mkdir -p /var/cache/bind/master
sudo bash -c 'cat << EOF > /var/cache/bind/master/example.com
\$TTL 2d
\$ORIGIN example.com.
@   IN  SOA ns1.example.com. hostmaster.example.com. (
    2017070100
    2h
    15M
    3w12h
    2h20M
    )

    IN  NS  ns1.example.com.
ns1 IN  A   192.168.33.11
EOF'
cat /var/cache/bind/master/example.com
sudo chown -R bind:bind /var/cache/bind/master
ls -la /var/cache/bind/master/example.com

reload changes

sudo service bind9 restart
systemctl status bind9

test bind setup

$ dig @192.168.33.11 ns example.com.

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @192.168.33.11 ns example.com.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8309
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;example.com.           IN  NS

;; ANSWER SECTION:
example.com.        172800  IN  NS  ns1.example.com.

;; ADDITIONAL SECTION:
ns1.example.com.    172800  IN  A   192.168.33.11

;; Query time: 0 msec
;; SERVER: 192.168.33.11#53(192.168.33.11)
;; WHEN: Tue Aug 08 09:11:34 UTC 2017
;; MSG SIZE  rcvd: 74

consul setup

$ consul agent -server -data-dir data -domain consul.example.com -bootstrap-expect 3 -retry-join 192.168.33.13 -retry-interval 1s -bind 192.168.33.11 -client 192.168.33.11

$ consul agent -server -data-dir data -domain consul.example.com -bootstrap-expect 3 -retry-join 192.168.33.13 -retry-interval 1s -bind 192.168.33.12 -client 192.168.33.12

# on 192.168.33.13 add `-dns-port 53` since `static-stub` does not allow custom ports
$ sudo consul agent -server -data-dir data -domain consul.example.com -bootstrap-expect 3 -retry-join 192.168.33.13 -retry-interval 1s -bind 192.168.33.13 -client 192.168.33.13 -dns-port 53

# wait for "consul: New leader elected: consulXXX"

test consul direct

query SOA

$ dig @192.168.33.13 soa consul.example.com.

# consul 0.9.0: buggy response

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  SOA

    ;; AUTHORITY SECTION:
    consul.example.com. 0   IN  SOA ns.consul.example.com. postmaster.consul.example.com. 1502184707 3600 600 86400 0
    consul.example.com. 0   IN  SOA ns.consul.example.com. postmaster.consul.example.com. 1502184707 3600 600 86400 0

# consul 0.9.1rc1: OK

    ;; ANSWER SECTION:
    consul.example.com. 0   IN  SOA ns.consul.example.com. hostmaster.consul.example.com. 1502185620 3600 600 86400 0

    ;; AUTHORITY SECTION:
    consul.example.com. 0   IN  NS  consul2.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul3.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul1.node.dc1.consul.example.com.

    ;; ADDITIONAL SECTION:
    consul2.node.dc1.consul.example.com. 0 IN A 192.168.33.12
    consul3.node.dc1.consul.example.com. 0 IN A 192.168.33.13
    consul1.node.dc1.consul.example.com. 0 IN A 192.168.33.11

query NS

$ dig @192.168.33.13 ns consul.example.com.

# consul 0.9.0: FAIL

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  NS

    ;; AUTHORITY SECTION:
    consul.example.com. 0   IN  SOA ns.consul.example.com. postmaster.consul.example.com. 1502184728 3600 600 86400 0

# consul 0.9.1rc1: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  NS

    ;; ANSWER SECTION:
    consul.example.com. 0   IN  NS  consul3.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul1.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul2.node.dc1.consul.example.com.

    ;; ADDITIONAL SECTION:
    consul3.node.dc1.consul.example.com. 0 IN A 192.168.33.13
    consul1.node.dc1.consul.example.com. 0 IN A 192.168.33.11
    consul2.node.dc1.consul.example.com. 0 IN A 192.168.33.12

query service

$ dig @192.168.33.13 consul.service.consul.example.com.

# consul 0.9.0: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.service.consul.example.com. IN  A

    ;; ANSWER SECTION:
    consul.service.consul.example.com. 0 IN A   192.168.33.11
    consul.service.consul.example.com. 0 IN A   192.168.33.13
    consul.service.consul.example.com. 0 IN A   192.168.33.12

# consul 0.9.1rc1: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.service.consul.example.com. IN  A

    ;; ANSWER SECTION:
    consul.service.consul.example.com. 0 IN A   192.168.33.12
    consul.service.consul.example.com. 0 IN A   192.168.33.11
    consul.service.consul.example.com. 0 IN A   192.168.33.13

test consul via bind

query SOA

$ dig @192.168.33.11 soa consul.example.com.

# consul 0.9.0: buggy response

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  SOA

    ;; AUTHORITY SECTION:
    consul.example.com. 0   IN  SOA ns.consul.example.com. postmaster.consul.example.com. 1502184799 3600 600 86400 0
    consul.example.com. 0   IN  SOA ns.consul.example.com. postmaster.consul.example.com. 1502184799 3600 600 86400 0

# consul 0.9.1rc1: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  SOA

    ;; ANSWER SECTION:
    consul.example.com. 0   IN  SOA ns.consul.example.com. hostmaster.consul.example.com. 1502185704 3600 600 86400 0

    ;; AUTHORITY SECTION:
    consul.example.com. 0   IN  NS  consul2.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul1.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul3.node.dc1.consul.example.com.

    ;; ADDITIONAL SECTION:
    consul1.node.dc1.consul.example.com. 0 IN A 192.168.33.11
    consul2.node.dc1.consul.example.com. 0 IN A 192.168.33.12
    consul3.node.dc1.consul.example.com. 0 IN A 192.168.33.13

query NS

$ dig @192.168.33.11 ns consul.example.com.

# consul 0.9.0: FAIL

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  NS

    ;; AUTHORITY SECTION:
    consul.example.com. 0   IN  SOA ns.consul.example.com. postmaster.consul.example.com. 1502184819 3600 600 86400 0

# consul 0.9.1rc1: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.example.com.        IN  NS

    ;; ANSWER SECTION:
    consul.example.com. 0   IN  NS  consul3.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul1.node.dc1.consul.example.com.
    consul.example.com. 0   IN  NS  consul2.node.dc1.consul.example.com.

    ;; ADDITIONAL SECTION:
    consul1.node.dc1.consul.example.com. 0 IN A 192.168.33.11
    consul2.node.dc1.consul.example.com. 0 IN A 192.168.33.12
    consul3.node.dc1.consul.example.com. 0 IN A 192.168.33.13

query service

$ dig @192.168.33.11 consul.service.consul.example.com.

# consul 0.9.0: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.service.consul.example.com. IN  A

    ;; ANSWER SECTION:
    consul.service.consul.example.com. 0 IN A   192.168.33.12
    consul.service.consul.example.com. 0 IN A   192.168.33.13
    consul.service.consul.example.com. 0 IN A   192.168.33.11

    ;; AUTHORITY SECTION:
    consul.example.com. 86400   IN  NS  consul.example.com.

# consul 0.9.1rc1: OK

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;consul.service.consul.example.com. IN  A

    ;; ANSWER SECTION:
    consul.service.consul.example.com. 0 IN A   192.168.33.13
    consul.service.consul.example.com. 0 IN A   192.168.33.12
    consul.service.consul.example.com. 0 IN A   192.168.33.11

    ;; AUTHORITY SECTION:
    consul.example.com. 86400   IN  NS  consul.example.com.

@danparsons
Copy link

Did this patch ever make it into consul? I'm running v1.0.1 here and desperately need this functionality, but the above functionality is definitely NOT in place.

@slackpad
Copy link
Contributor

@danparsons this should be in 1.0.1 for sure - can you provide some more details about what you are expecting that's not there?

@preetapan
Copy link
Contributor Author

Did a quick test with a local Consul agent, and definitely see NS records:

preetha@preetha-work ~ $dig @127.0.0.1 -p 8600 websvc.service.consul NS

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 websvc.service.consul NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65463
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;websvc.service.consul.		IN	NS

;; ANSWER SECTION:
consul.			0	IN	NS	preetha-work.node.dc1.consul.

;; ADDITIONAL SECTION:
preetha-work.node.dc1.consul. 0	IN	A	127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue Dec 12 10:46:04 CST 2017
;; MSG SIZE  rcvd: 102

@rodbellg
Copy link

@preetapan , @danparsons ,

Hi,
I had the same problem with NS resolution. I just do the two following tests on the same vm/ same consul configuration : one test with the v0.9.1 version and the other test with the 1.0.1.
It's OK with the older version (0.9.1)
but it does not work with the newer version (1.0.1)

Below the results of the dig tests I did

Consul v0.9.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)


[root@vm-glbsquid-1-cs-2 ~]# dig @127.0.0.1 -p 8600 pxy.service.consul A

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7 <<>> @127.0.0.1 -p 8600 pxy.service.consul A
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38846
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;pxy.service.consul.		IN	A

;; ANSWER SECTION:
pxy.service.consul.	0	IN	A	10.19.20.95

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Dec 13 14:09:33 CET 2017
;; MSG SIZE  rcvd: 63

[root@vm-glbsquid-1-cs-2 ~]# dig @127.0.0.1 -p 8600 pxy.service.consul NS

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7 <<>> @127.0.0.1 -p 8600 pxy.service.consul NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51396
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;pxy.service.consul.		IN	NS

;; ANSWER SECTION:
consul.			0	IN	NS	vm-glbsquid-1-cs-2.node.opsk35.consul.

;; ADDITIONAL SECTION:
vm-glbsquid-1-cs-2.node.opsk35.consul. 0 IN A	192.168.0.108

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Dec 13 14:09:36 CET 2017
;; MSG SIZE  rcvd: 108
Consul v1.0.1
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

[root@vm-glbsquid-1-cs-2 ~]# dig @127.0.0.1 -p 8600 pxy.service.consul A

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7 <<>> @127.0.0.1 -p 8600 pxy.service.consul A
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52848
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;pxy.service.consul.		IN	A

;; ANSWER SECTION:
pxy.service.consul.	0	IN	A	10.19.20.95

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Dec 13 14:07:05 CET 2017
;; MSG SIZE  rcvd: 63


[root@vm-glbsquid-1-cs-2 ~]# dig @127.0.0.1 -p 8600 pxy.service.consul NS

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7 <<>> @127.0.0.1 -p 8600 pxy.service.consul NS
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

@danparsons
Copy link

@rodbellg Fascinating! Thanks for telling me it works on v0.9.1. I'm going to try that version right now and see if I get better results.

@preetapan
Copy link
Contributor Author

;; connection timed out; no servers could be reached in the output above from @rodbellg is suspicious. Was the agent temporarily down at that time?

@danparsons
Copy link

@preetapan Perhaps I have a mistake in my configuration that is causing this problem for me? Here's my config:

{
"addresses": {
"dns": "0.0.0.0",
"http": "0.0.0.0"
},
"bind_addr": "0.0.0.0",
"bootstrap_expect": 3,
"data_dir": "/var/consul",
"datacenter": "us-west-2",
"domain": "consul.my.companys.domain",
"enable_debug": true,
"encrypt": "13 bytes blah blah",
"log_level": "info",
"ports": {
"dns": 53
},
"recursors": [
"my.aws.dns.server"
],
"retry_interval": "15s",
"retry_join": [
"provider=aws tag_key=consul tag_value=usw2"
],
"server": true,
"ui": true
}

I get this output:

$ dig @consul-a.my.companys.domain consul.my.companys.domain

; <<>> DiG 9.9.7-P3 <<>> @consul-a.my.companys.domain consul.my.companys.domain
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36405
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.my.companys.domain. IN A

;; AUTHORITY SECTION:
consul.my.companys.domain. 0 IN SOA ns.consul.my.companys.domain. hostmaster.consul.my.companys.domain. 1513211804 3600 600 86400 0

;; Query time: 50 msec
;; SERVER: 10.45.98.119#53(10.45.98.119)
;; WHEN: Wed Dec 13 16:36:44 PST 2017
;; MSG SIZE rcvd: 97

See where it says IN SOA "ns.consul.my.companys.domain"? Where is it getting that value from? None of my consul servers have that hostname. Their names are consul-a.my.companys.domain etc.

In addition to the wrong SOA line, Consul will also not respond to queries for the NS records for consul.my.companys.domain:

$ host -t ns consul.my.companys.domain consul-a.my.companys.domain
Using domain server:
Name: consul-a.my.companys.domain
Address: 10.45.98.119#53
Aliases:

consul.my.companys.domain has no NS record

This is completely counter to the RFC for how NS records work, and because of this, Consul breaks NS delegation. The way it's supposed to work:

(1) DNS server (Route53) for my.companys.domain has 3 NS record for "consul.my.companys.domain". One for each of my 3 consul servers. This part works correctly because it's Route53.

(2) Consul ALSO has to serve (authoritatively) NS records for each of the 3 consul servers in the cluster. Above where it says "has no NS record", it should be showing e.g.:

consul.my.companys.domain. 300 IN NS consul-a.my.companys.domain.
consul.my.companys.domain. 300 IN NS consul-b.my.companys.domain.
consul.my.companys.domain. 300 IN NS consul-c.my.companys.domain.

But it doesn't. And because the two servers don't have identical output, all well-written DNS libraries ignore the NS records entirely, and then delegation to Consul fails.

What am I doing wrong here? Thanks for reading!!!

@danparsons
Copy link

BTW, I tried on 0.9.1 and saw identical behavior.

@preetapan
Copy link
Contributor Author

The "ns" in that SOA record is coming from a hardcoded default value. It picks that as a prefix to whatever you set your 'domain' to in the config. We haven't made that part configurable yet

For your second question about responding to ns records - Consul currently only responds to node, service and prepared queries as per the documentation here. So it won't understand "consul-a.my.companys.domain" but it will respond to ns records if you ask about "consul-a.node.consul.companys.domain" or "my-service.consul.companys.domain". This works like that because Consul is only responsible for dns lookups for things it knows about which are nodes, services and prepared queries.

Consul also does respond with NS records for any prefix of what you set "domain" to. So in your example you should be able to get NS records if you looked up `"consul-a.consul.my.companys.domain."

See my examples below, I used a local agent again and configured the agents domain to "consul.my.test.domain"

Example with node lookup:

$dig @127.0.0.1 -p 8600 my.node.consul.my.test.domain NS 

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 my.node.consul.my.test.domain NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46219
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;my.node.consul.my.test.domain.	IN	NS

;; ANSWER SECTION:
consul.my.test.domain.	0	IN	NS	preetha-work.node.mydc.consul.my.test.domain.

;; ADDITIONAL SECTION:
preetha-work.node.mydc.consul.my.test.domain. 0	IN A 127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Dec 13 20:27:29 CST 2017
;; MSG SIZE  rcvd: 111

Example with service lookup:

$dig @127.0.0.1 -p 8600 my.service.consul.my.test.domain NS 

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 my.service.consul.my.test.domain NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19752
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;my.service.consul.my.test.domain. IN	NS

;; ANSWER SECTION:
consul.my.test.domain.	0	IN	NS	preetha-work.node.mydc.consul.my.test.domain.

;; ADDITIONAL SECTION:
preetha-work.node.mydc.consul.my.test.domain. 0	IN A 127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Dec 13 20:27:40 CST 2017
;; MSG SIZE  rcvd: 114

Example with a random prefix "consul-test", Consul does not return anything for this one

$dig @127.0.0.1 -p 8600 consul-test.my.test.domain NS 

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 consul-test.my.test.domain NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 26345
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;consul-test.my.test.domain.	IN	NS

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Dec 13 20:28:20 CST 2017
;; MSG SIZE  rcvd: 44

@danparsons
Copy link

No matter what I do, I can't get consul to emit a single NS record, ever. Even following your examples.

For reference purposes, here's my 'consul members':

[root@danvpn ~]# consul members
Node                  Address             Status  Type    Build  Protocol  DC         Segment
consul-a.my.companys.domain  10.45.98.119:8301   alive   server  1.0.2  2         us-west-2  <all>
consul-b.my.companys.domain  10.45.152.128:8301  alive   server  1.0.2  2         us-west-2  <all>
consul-c.my.companys.domain  10.45.185.39:8301   alive   server  1.0.2  2         us-west-2  <all>
danvpn.my.companys.domain    10.45.126.195:8301  alive   client  1.0.2  2         us-west-2  <default>

Now, some digs:

First, I'd like to demonstrate that consul will at least return A records:

[root@danvpn ~]# dig @localhost -p 8600 consul-a.my.companys.domain.node.consul.my.companys.domain

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.56.amzn1 <<>> @localhost -p 8600 consul-a.my.companys.domain.node.consul.my.companys.domain
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47946
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;consul-a.my.companys.domain.node.consul.my.companys.domain. IN A

;; ANSWER SECTION:
consul-a.my.companys.domain.node.consul.my.companys.domain. 0	IN A 10.45.98.119

;; Query time: 1 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Dec 18 02:08:49 2017
;; MSG SIZE  rcvd: 78

Following your first example in your last post produces no NS records:

[root@danvpn ~]# dig @localhost -p 8600 consul-a.my.companys.domain.node.consul.my.companys.domain NS

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.56.amzn1 <<>> @localhost -p 8600 consul-a.my.companys.domain.node.consul.my.companys.domain NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27712
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;consul-a.my.companys.domain.node.consul.my.companys.domain. IN NS

;; Query time: 1 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Dec 18 02:09:15 2017
;; MSG SIZE  rcvd: 62

Why is this happening? Why is your consul able to emit NS records, but mine isn't?

@danparsons
Copy link

Furthermore, how does your consul know to emit an NS record with "preetha-work.node.mydc.consul.my.test.domain."?

@TheVendiniPhil
Copy link

TheVendiniPhil commented Jan 18, 2018

I've just upgraded my Consul to 1.0.2 and I'm not receiving NS answers.

Yes I'm using the default .consul domain

I do see SOA

dig @a002 -p 8600 consul.service.consul SOA

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @a002 -p 8600 consul.service.consul SOA
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45562
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.consul.		IN	SOA

;; ANSWER SECTION:
consul.			0	IN	SOA	ns.consul. hostmaster.consul. 1516267799 3600 600 86400 0

;; Query time: 1 msec
;; SERVER: 10.10.1.108#8600(10.10.1.108)
;; WHEN: Thu Jan 18 09:29:59 UTC 2018
;; MSG SIZE  rcvd: 100

But not NS

dig @a002 -p 8600 consul.service.consul NS

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @a002 -p 8600 consul.service.consul NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19740
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.consul.		IN	NS

;; Query time: 1 msec
;; SERVER: 10.10.1.108#8600(10.10.1.108)
;; WHEN: Thu Jan 18 09:33:28 UTC 2018
;; MSG SIZE  rcvd: 50

@slackpad
Copy link
Contributor

Hi @TheVendiniPhil I'm not sure what's different about your setup, other than my version of dig on the Mac is a little older. If I run consul agent -dev with no other configuration then I get this:

$ dig @127.0.0.1 -p 8600 consul.service.consul NS

; <<>> DiG 9.9.7-P3 <<>> @127.0.0.1 -p 8600 consul.service.consul NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24137
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.consul.         IN      NS

;; ANSWER SECTION:
consul.                 0       IN      NS      workpad.node.dc1.consul.

;; ADDITIONAL SECTION:
workpad.node.dc1.consul. 0      IN      A       127.0.0.1

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Jan 24 17:37:05 PST 2018
;; MSG SIZE  rcvd: 97

@TheVendiniPhil
Copy link

TheVendiniPhil commented Jan 25, 2018

From one of the Consul Servers (the cluster leader):

$ dig @127.0.0.1 -p 8600 consul.service.consul NS

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 consul.service.consul NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12764
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.consul.		IN	NS

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Thu Jan 25 16:39:18 UTC 2018
;; MSG SIZE  rcvd: 50

$ consul --version
Consul v1.0.2
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

$ uname -a
Linux a002 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Consul is run from systemd

$ cat /etc/systemd/system/consul.service
[Unit]
Description=consul agent
Documentation=https://github.com/hashicorp/consul
Requires=network-online.target
After=network-online.target

[Service]
User=consul
Environment=GOMAXPROCS=2
Restart=on-failure
ExecStart=/usr/local/sbin/consul agent -config-dir=/etc/consul.d
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

And the consul configuration is:

$ cat /etc/consul.d/*
{
  "enable_script_checks": true,
  "data_dir": "/mnt/consul",
  "ui": true,
  "log_level": "INFO",
  "enable_syslog": true,
  "client_addr": "0.0.0.0"
}
{
  "datacenter": "<dc name>",
  "bootstrap_expect": 3
}
{
  "server": true,
  "dns_config": {
      "allow_stale": true,
      "node_ttl": "15s",
      "service_ttl": {
        "*": "15s"
        },
      "enable_truncate": true,
      "only_passing": true
    },
    "performance": {
      "raft_multiplier": 1
    }
}
{
  "retry_join": [<lan join IPs>],
  "retry_join_wan": [<WAN join IPs>]
}
{
  "bind_addr": "<my IP>",
  "node_name": "<my fqdn>"
}

@TheVendiniPhil
Copy link

TheVendiniPhil commented Jan 25, 2018

So I went to another server (in our lab) , stopped the running , configured consul service there.

Started consul as a standalone -dev instance, zero configuration as per your example.

/usr/local/sbin/consul agent -dev

And I still get the same lack of NS answer.

dig @127.0.0.1 -p 8600 consul.service.consul NS

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 -p 8600 consul.service.consul NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34020
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.consul.		IN	NS

;; Query time: 4 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Thu Jan 25 16:58:25 UTC 2018
;; MSG SIZE  rcvd: 50

@TheVendiniPhil
Copy link

TheVendiniPhil commented Jan 25, 2018

Further investigation shows that Consul is wanting to provide an NS answer, but is unhappy about something somewhere.

Jan 25 17:07:57 a002 consul[10929]: 2018/01/25 17:07:57 [WARN] dns: Skipping invalid node "a004.oak.vendini.com" for NS records

What's invalid about that?

It's the actual hostname, that hostname IS the nodename as configured in consul, and it's a valid name in DNS.

aargh! apparently "node_name" is NOT allowed to be the FQDN, it needs to be the value from "hostname -s" (sometimes the Consul documentation could be a little more explicit)

@preetapan
Copy link
Contributor Author

preetapan commented Feb 2, 2018

@TheVendiniPhil We validate that the node name has to match RFC1123. We internally use . to separate out other aspects of the name (like adding the datacenter name to the node name, so having dots in the name breaks things.

You are right that the documentation does not call this out and the agent accepts node names with dots in them. I have created issue #3854 to track this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants