Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution fails if default search domain has a wildcard match #17316

Open
ikus060 opened this issue Nov 14, 2017 · 13 comments
Open

DNS resolution fails if default search domain has a wildcard match #17316

ikus060 opened this issue Nov 14, 2017 · 13 comments
Assignees
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/P2

Comments

@ikus060
Copy link

ikus060 commented Nov 14, 2017

Name resolution from inside the pod seams to be broken because of multiple factor.

Version
# oc version
oc v3.7.0-rc.0+e92d5c5
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.7.0-rc.0+e92d5c5
kubernetes v1.7.6+a08f5eeb62
Steps To Reproduce

Look like the /etc/resolv.conf file generated by openshift is not working in every scenario.

Just to show it's working with something...

# cat /etc/resolv.conf
nameserver 8.8.8.8
search patrikdufresne.com

# nslookup -debug dl-cdn.alpinelinux.org
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	dl-cdn.alpinelinux.org, type = A, class = IN
    ANSWERS:
    ->  dl-cdn.alpinelinux.org
	canonical name = global.prod.fastly.net.
	ttl = 59
    ->  global.prod.fastly.net
	internet address = 151.101.0.249
	ttl = 19
    ->  global.prod.fastly.net
	internet address = 151.101.64.249
	ttl = 19
    ->  global.prod.fastly.net
	internet address = 151.101.128.249
	ttl = 19
    ->  global.prod.fastly.net
	internet address = 151.101.192.249
	ttl = 19
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
dl-cdn.alpinelinux.org	canonical name = global.prod.fastly.net.
Name:	global.prod.fastly.net
Address: 151.101.0.249
Name:	global.prod.fastly.net
Address: 151.101.64.249
Name:	global.prod.fastly.net
Address: 151.101.128.249
Name:	global.prod.fastly.net
Address: 151.101.192.249

This is the /etc/resolv.conf generated in the pod. not working

# cat /etc/resolv.conf 
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local patrikdufresne.com
options ndots:5

# nslookup -debug dl-cdn.alpinelinux.org
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	dl-cdn.alpinelinux.org.default.svc.cluster.local, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
	origin = a.root-servers.net
	mail addr = nstld.verisign-grs.com
	serial = 2017111401
	refresh = 1800
	retry = 900
	expire = 604800
	minimum = 86400
	ttl = 86385
    ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.default.svc.cluster.local: NXDOMAIN
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	dl-cdn.alpinelinux.org.svc.cluster.local, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
	origin = a.root-servers.net
	mail addr = nstld.verisign-grs.com
	serial = 2017111401
	refresh = 1800
	retry = 900
	expire = 604800
	minimum = 86400
	ttl = 86394
    ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.svc.cluster.local: NXDOMAIN
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	dl-cdn.alpinelinux.org.cluster.local, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
	origin = a.root-servers.net
	mail addr = nstld.verisign-grs.com
	serial = 2017111401
	refresh = 1800
	retry = 900
	expire = 604800
	minimum = 86400
	ttl = 86378
    ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.cluster.local: NXDOMAIN
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	dl-cdn.alpinelinux.org.patrikdufresne.com, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  patrikdufresne.com
	origin = ns2.no-ip.com
	mail addr = hostmaster.no-ip.com
	serial = 2010091255
	refresh = 10800
	retry = 1800
	expire = 604800
	minimum = 1800
	ttl = 1799
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
*** Can't find dl-cdn.alpinelinux.org: No answer

If I remove my domain name patrikdufresne.com. working

# cat /etc/resolv.conf 
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
root@tymara:/home/ikus060# nslookup dl-cdn.alpinelinux.org
Server:		8.8.8.8
Address:	8.8.8.8#53

Non-authoritative answer:
dl-cdn.alpinelinux.org	canonical name = global.prod.fastly.net.
Name:	global.prod.fastly.net
Address: 151.101.0.249
Name:	global.prod.fastly.net
Address: 151.101.64.249
Name:	global.prod.fastly.net
Address: 151.101.128.249
Name:	global.prod.fastly.net
Address: 151.101.192.249

Also working if I remove ndots:5.

# cat /etc/resolv.conf 
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local patrikdufresne.com
root@tymara:/home/ikus060# nslookup dl-cdn.alpinelinux.org
Server:		8.8.8.8
Address:	8.8.8.8#53

Non-authoritative answer:
dl-cdn.alpinelinux.org	canonical name = global.prod.fastly.net.
Name:	global.prod.fastly.net
Address: 151.101.0.249
Name:	global.prod.fastly.net
Address: 151.101.64.249
Name:	global.prod.fastly.net
Address: 151.101.128.249
Name:	global.prod.fastly.net
Address: 151.101.192.249
@pweil- pweil- added component/networking kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Nov 15, 2017
@johnfosborneiii
Copy link

I ran into this exact same issue with a fresh installation of OCP 3.7 on a RHEL 7.4 VM.

The outbound networking worked from the VM. The outbound networking also worked when I ran a container out of band from Kubernetes (using docker run). OCP ran the container, the outbound networking broke but it could be fixed by removing the options ndots:5 or "search josborne.com". I couldn't figure out where "search josborne.com" was even coming from because I didn't set that anywhere in the Ansible advanced installation. I changed my /etc/hostname file from openshift.josborne.com to openshift and rebooted. At that point "search josborne.com" was removed from the pod /etc/resolv.conf and everything started working. Is this user error or a bug? I've installed every release of OCP from scratch using a FQDN in my /etc/hostname file and it first broke in either 3.6 or 3.7 so I think something has changed in the platform.

@danwinship danwinship changed the title DNS resolution is failing DNS resolution fails if default search domain has a wildcard match Feb 13, 2018
@danwinship
Copy link
Contributor

Right, so the problem is that if the domain that gets listed in the search line does wildcard matching, then because of the ndots:5, basically all hostnames will end up being treated as subdomains of the default domain. Eg, *.josbourne.com appears to resolve to a particular AWS hostname, so if you look up, say, github.com, it ends up matching as github.com.josbourne.com which resolves to the AWS IP.

I guess the search field in the pod resolv.conf is set automatically from the node hostname?

What we really want is to make service name lookups behave like ndots:5, but make other lookups not do that. We can't make the libc resolver do that, but in cases where we're running a DNS server inside the cluster, we could do the ndots-like special-casing inside that server, and then we could give the pods a resolv.conf without ndots.

The other possibility would be to stop including the node's domain in the pod resolv.conf's search field, but that would break any existing pods that were depending on the current behavior, so we'd need some sort of compatibility option.

@ikus060
Copy link
Author

ikus060 commented Feb 13, 2018

Since the way to install openshift is to go with ansible playbook. I would add extra validation in ansible to make sure the provided DNS domain is behaving as you like. If not, the playbook should fail and warn the user.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 13, 2018
@gbraad
Copy link
Contributor

gbraad commented Jul 8, 2018

This is still an issue.
/remove-lifecycle rotten

@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 8, 2018
@gbraad
Copy link
Contributor

gbraad commented Jul 8, 2018

For minishift this is an issue with some Hypervisor that force a search entry from the DHCP offer. Eg. HyperV on the "default switch" uses search mshome.net and can cause lookups during S2i to github.com to fail

@gbraad
Copy link
Contributor

gbraad commented Jul 9, 2018

Note: the options ndots:5 is part of Kubernetes since about 2015 => kubernetes/kubernetes@23caf44#diff-0db82891d463ba14dd59da9c77f4776eR66 (ref: kubernetes/kubernetes#10266)

@xpflying
Copy link

Same issue with ansible install openshift 3.10

@shadowlord017
Copy link

Same for me:
ndots:5 makes it substitute domain name (from search line) before checking original address

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2019
@danwinship
Copy link
Contributor

/remove-lifecycle stale
/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2019
@sponte
Copy link

sponte commented Oct 26, 2020

Hello, is there a workaround for this? I seem to be facing the same issue with k8s 1.19, coredns and my external domain which is part of the DNS search path, having wildcard match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/P2
Projects
None yet
Development

No branches or pull requests