Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot resolve domain/FQDN while joining a exist cluster #1507

Open
subchen opened this issue Dec 16, 2015 · 13 comments
Open

cannot resolve domain/FQDN while joining a exist cluster #1507

subchen opened this issue Dec 16, 2015 · 13 comments
Assignees
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected

Comments

@subchen
Copy link

subchen commented Dec 16, 2015

I use following CLI to start a consul server to join a exist cluster.

$ ping node1
PING node1 (192.168.1.21): 56 data bytes
64 bytes from 192.168.1.21: icmp_seq=0 ttl=64 time=0.053 ms
64 bytes from 192.168.1.21: icmp_seq=0 ttl=64 time=0.053 ms
...

$ consul agent -server -node node2 -retry-join node1

errorlogs (the resolved ip address is wrong)

2015/12/14 06:45:23 [INFO] agent: (LAN) joined: 0 Err: dial tcp 220.250.64.225:8301: i/o timeout
2015/12/14 06:45:23 [WARN] agent: Join failed: dial tcp 220.250.64.225:8301: i/o timeout, retrying in 30s

That is failed to join the node2 to node1 cluster.
If I changes node1 to 192.168.1.21, that does work.

$ consul agent -server -node node2 -retry-join 192.168.1.21

My consul version is 0.5.2

Also, consul join <FQDN> does not work.

@slackpad
Copy link
Contributor

Hi @subchen the DNS resolution is being done down in here - https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L218-L234. Is it possible that host has multiple IPs registered with DNS and maybe Go is picking a different one to try?

@subchen
Copy link
Author

subchen commented Dec 21, 2015

Hi @slackpad, I only added some records in /etc/hosts to resolve the domain/FQDN.

192.168.1.21    node1
192.168.1.22    node2

I don't know whether the DNS resolve library only uses real DNS and skips the /etc/hosts.

@slackpad slackpad self-assigned this Jan 9, 2016
@slackpad
Copy link
Contributor

slackpad commented Jan 9, 2016

Will have to dig into Go a little bit to see what it does.

@kaskavalci
Copy link
Contributor

@slackpad We experience the same issue. It seems tcpLookupIP goes to dns server directly and bypasses /etc/hosts. IMO this is wrong because you should respect resolv.conf in host lookups. This behavior makes Consul to fail connecting other agents in Azure environment. (In our case azure FQDN's are resolved to public IP address from the DNS server, but internal IPs are saved under /etc/hosts. When Consul resolves FQDN from DNS, it gets a public IP where it is firewalled and Consul is not bound to. Hence, Consul fails to join.)

Is there a true benefit from performingtcpLookupIP instead of net.LookupIP? Can we simply ignore that logic and perform go's net package?

@slackpad
Copy link
Contributor

slackpad commented Feb 3, 2017

@kaskavalci ok this makes sense now. We want to keep the behavior of using TCP to get the largest possible list of hosts, but you are right that it breaks /etc/hosts. I think the best thing here would be to use Go's lookup and then tcpLookupIP and then merge + dedup the lists.

@kaskavalci
Copy link
Contributor

@slackpad hmm, wouldn't that include multiple IP addresses for the same host? Assume the following /etc/hosts file:

127.0.0.1 google.com

We expect loopback address when we use go's lookup only but tcpLookupIP will return google's address too which will cause Join errors.

@slackpad
Copy link
Contributor

slackpad commented Feb 3, 2017

That's true for that example, though you'd get both addresses so the join would still work. Maybe we just need a way to turn off this TCP behavior.

@kaskavalci
Copy link
Contributor

Yes Join will work but with error messages because of this line https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L190 . Maybe errors for the same host will not be printed as long as one working IP is found? Or just go back to Go implementation.

@kaskavalci
Copy link
Contributor

Hi @slackpad , are you OK with using only go implementation? I can send a PR for that as well.

@slackpad
Copy link
Contributor

We had added the TCP feature in response to folks who needed the full list of severs to join, so I don't think we want to take that away. I think if we changed the code in resolveAddr() to skip this clause - https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L308-L310, and then add a dedup pass it will work. It's ok if there are join errors as long as any of the joins worked, so the pathological Google example should still be ok.

@kaskavalci
Copy link
Contributor

Sounds OK to me. Is there a ETA for this?

@AlexLov
Copy link

AlexLov commented Apr 6, 2017

Hi,
Any news about fix the issue?

@kaskavalci
Copy link
Contributor

I did the following change myself for an easy fix https://github.com/kaskavalci/memberlist. confirmed to work.

@slackpad slackpad added the type/bug Feature does not function as expected label May 5, 2017
@slackpad slackpad added the theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization label May 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants