Description
TL;DR, I think recursive DNS limitations with landrush can cause pain on Linux when using dnsmasq with libvirt and NetworkManager, and the default of guest redirection via iptables to use the landrush.
Key issue from VM guest with landrush defaults
$ dig -p 10053 @127.0.0.1 www.google.com
...
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 11678
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
...
Workaround
config.landrush.guest_redirect_dns = false
avoids the pain!
But then, when set to false:
- VM guest can't find the VM host IP.
- resolving the VM host's name will now incorrectly return a localhost address, not the real VM
interface! (talks to itself instead of the VM host) - previously, when set to true, VM guest failed to resolve VM host instead of incorrectly returning the localhost address for the VM host.
- resolving the VM host's name will now incorrectly return a localhost address, not the real VM
- Host can still find VMs
- resolving the guest VM's FQDN from the host works.
- VMs can still find each other
- resolving another guest VM's FQDN from within a guest worked.
Workaround side-effect on VM server host resolution
Here's an example where, with false
, form within a guest, it doesn't resolve the VM host server IP correctly.
$ nslookup <VM server hostname>
Server: 192.168.121.1
Address: 192.168.121.1#53
Name: <VM server hostname>
Address: 127.0.1.1
And the default true
:
$ nslookup <VM server hostname>
Server: 192.168.121.1
Address: 192.168.121.1#53
** server can't find <VM server hostname>: SERVFAIL
Potentially Related Issues
Originally, I had the same symptoms as #198. No matter which host I ping, landrush seemed to end up 'wildcarding' the FQDNs of external hosts and appending the configured 'local' TLD (in my case, vagrant.test). Might be to do with search vagrant.test
being put into /etc/resolve.conf
for guests...
And then there are extra complications noted... which relate more to #252 and possibly #174.
More or less default / minimal config causes this upstream DNS resolution bug
A fair bit of verbose context/info - jump down to the dig command that backs up what I saw in network packet captures. landrush DNS (with my stack) can't handle recursive queries.
Vagrantfile
:
config.landrush.enabled = true
config.landrush.tld = 'vagrant.test'
/etc/NetworkManager/dnsmasq.d/vagrant-landrush
(because Ubuntu, like Fedora, ships with NetworkManager, which already has dnsmasq plugged in)
server=/vagrant.test/127.0.0.1#10053
libvirt provides DNS on the virbr1
network spooled up by the vagrant libvirt provider. On the guest VM:
$ cat /etc/resolv.conf
# Generated by NetworkManager
search vagrant.test
nameserver 192.168.121.1
libvirt is also using dnsmasq... So yay, three layers of dnsmasq that need to play nice together, landrush -> libvirt -> NetworkManager :-/
On the host, various DNS services are listening
$ sudo netstat -lntp | grep 53
tcp 0 0 0.0.0.0:10053 0.0.0.0:* LISTEN 5179/ruby
tcp 0 0 192.168.121.1:53 0.0.0.0:* LISTEN 5155/dnsmasq
tcp 0 0 127.0.1.1:53 0.0.0.0:* LISTEN 3321/dnsmasq
By the way, not sure why landrush decides to run on all interfaces!? 0.0.0.0? Why not just the network vagrant is provisioning (i.e. 192.168.121.1). Maybe something to do with config.landrush.host_redirect_dns
(and I should probably file a separate bug for this, I digress)
Checking what happened with iptables on the VM host shows another potential mess with multiple allows for both UDP and TCP.
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:53
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:53
...
And on the guest
# iptables -t nat -L -n
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DNAT tcp -- 0.0.0.0/0 192.168.121.1 tcp dpt:53 to:192.168.121.1:10053
DNAT udp -- 0.0.0.0/0 192.168.121.1 udp dpt:53 to:192.168.121.1:10053
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
On the host, www.google.com resovles fine via libvirts dns, e.g.
$ nslookup www.google.co.za 192.168.121.1
Server: 192.168.121.1
Address: 192.168.121.1#53
Non-authoritative answer:
Name: www.google.co.za
Address: 216.58.223.3
On the libvirt guest, it fails, oddly, with the TLD appended:
# nslookup www.google.com
Server: 192.168.121.1
Address: 192.168.121.1#53
** server can't find www.google.com.vagrant.test: SERVFAIL
When doing a packet trace on the virbr1
(vagrant provisioned) interface of the VM host, with nslookup
from the guest (192.168.121.102), I observed multiple DNS query attempts:
- 1st go (doesn't append the landrush TLD), e.g. 192.168.121.102 -> 192.168.121.1:10053
- DNS query from guest IP for
www.google.com: type A, class IN
to landrush DNS on host (port 10053) listening on all interfaces, including the VM host interface (192.168.121.1)- has
0x0100
flags - asking for recursion
- indicating non-authenticated data is unacceptable
- has
- DNS response from landrush DNS on host seems to suggest that a recursive DNS query is not permitted
- has
0x8502
flags - recursion not allowed
- answer not authenticated
- has
- DNS query from guest IP for
- 2nd go (does append the landrush TLD)
- Same as above, except now DNS query from guest IP for
www.google.com.vagrant.test: type A, class IN
- probably default cold logic to try append the landrush TLD if the first attempt failed?
- Same as above, except now DNS query from guest IP for
Quereis didn't make it to 127.0.1.1:53 (NetworkManager's dnsmasq, and later I also test upstream)
When using nslookup
, from the host, I noticed this (working) behaviour where queries did make it to 127.0.1.1:53 (the NetworkManager's dnsmasq):
- DNS query from host via host to itself on the virbr1 interface 192.168.121.1 -> 192.168.121.1:53
- flags in response from DNS service say recursion is allowed!
- Triggers a forwarded (recursive) DNS query from the dnsmasq part on 127.0.0.1 to 127.0.1.1:53
- 127.0.1.1 must have then quired the upstream DNS (as managed by NetworkManager) and responded correclty
- 192.168.121.1 reponds to itself.
Reading the man page for dnsmasq, I noticed the following:
Dnsmasq is a DNS query forwarder: it it not capable of recursively answering arbitrary queries starting from the root servers but forwards such queries to a fully recursive upstream DNS server which is typically provided by an ISP
So at a guess, landrush -> libvirt -> NetworkManager causes issues with a recursive DNS query? To confirm this, I also poked at landrush from the VM host:
$ dig -p 10053 @127.0.0.1 www.google.com
; <<>> DiG 9.10.3-P4-Ubuntu <<>> -p 10053 @127.0.0.1 www.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 11678
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;www.google.com. IN A
;; Query time: 2003 msec
;; SERVER: 127.0.0.1#10053(127.0.0.1)
;; WHEN: Thu Oct 20 22:59:54 SAST 2016
;; MSG SIZE rcvd: 32
I also hacked in config.landrush.upstream '127.0.1.1'
to explicitly get landrush to target NetworkManager's dnsmasq, but no luck. Also tried real upstream DNS servers found via:
for d in $(nmcli device show | grep -E "^IP4.DNS" | grep -oP '(\d{1,3}\.){3}\d{1,3}'); do echo $d; done
Doesn't work. Seems landrush doesn't pass on recursive DNS, even directly to upstream!
All the above, was with the following setup (I try keep to base/stable repo's as far as possible):
- Ubuntu 16.04.1 LTS
- landrush (1.1.2)
- Vagrant 1.8.1
- Libvirt version: 1.3.1