Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Talking to agent with downward API fails when hosts have multiple IPs (ex. EKS w/ aws-cniI) #43

Closed
popopanda opened this issue Oct 17, 2018 · 9 comments
Labels
area/sync Related to catalog sync bug Something isn't working theme/host-network Questions or PRs about enabling host networking for Consul clients

Comments

@popopanda
Copy link

Hello,

I have an EKS cluster with 3 workers, and deployed Consul and Rabbitmq via Helm. Both services are running, and I am able to access the Consul UI.

Each eks-worker in AWS has a private IP, as well as a few secondary IPs, which are used for Pods.

I am attempting to get syncCatalog working, but I am getting connection refused.

2018-10-17T23:52:23.572Z [WARN ] to-consul/sink: error registering service: node-name=ip-10-20-132-50.ec2.internal service-name=halting-penguin-rabbitmq err="Put http://10.20.62.49:8500/v1/catalog/register: dial tcp 10.20.62.49:8500: connect: connection refused"

IP - 10.20.62.49 is a primary ip of one of my eks-workers, however the pod that runs consul server (8500) is actually running on a secondary IP - 10.20.61.120 on that same eks-worker.

kubectl describe pod consul-server-2
Name:           consul-server-2
Namespace:      default
Node:           ip-10-20-62-49.ec2.internal/10.20.62.49
Start Time:     Wed, 17 Oct 2018 16:12:01 -0700
Labels:         app=consul
                chart=consul-0.1.0
                component=server
                controller-revision-hash=consul-server-66c8b8459c
                hasDNS=true
                release=consul
                statefulset.kubernetes.io/pod-name=consul-server-2
Annotations:    consul.hashicorp.com/connect-inject: false
Status:         Running
IP:             10.20.61.120

How can I register services the Consul cluster via the pod IP address which is on my eks-worker's secondary IP, and not connect to the primary ip?

Please let me know if I need to provide more information

Thank you

@popopanda
Copy link
Author

Also, this is my values.yml I used

global:
  enabled: true

  # Domain to register the Consul DNS server to listen for.
  domain: consul
  image: "consul:1.3.0"
  imageK8S: "hashicorp/consul-k8s:0.2.0"
  datacenter: dc1

server:
  enabled: true
  image: null
  replicas: 3
  bootstrapExpect: 3 # Should <= replicas count

  storage: 10Gi
  storageClass: null

  connect: true

  resources: {}
  updatePartition: 0
  disruptionBudget:
    enabled: true
    maxUnavailable: null
  extraConfig: |
    {}
  extraVolumes: []
    # - type: secret (or "configMap")
    #   name: my-secret
    #   load: false # if true, will add to `-config-dir` to load by Consul
client:
  enabled: "-"
  image: null
  join: null
  grpc: false
  resources: {}
  extraConfig: |
    {}
  extraVolumes: []
    # - type: secret (or "configMap")
    #   name: my-secret
    #   load: false # if true, will add to `-config-dir` to load by Consul

dns:
  enabled: "-"

ui:
  enabled: true
  service:
    enabled: true
    type: NodePort

syncCatalog:
  # True if you want to enable the catalog sync. "-" for default.
  enabled: true
  image: null
  toConsul: true
  toK8S: true

  k8sPrefix: null

connectInject:
  enabled: false
  image: null # image for consul-k8s that contains the injector
  default: false # true will inject by default, otherwise requires annotation
  imageConsul: null
  imageEnvoy: null

  namespaceSelector: null

  certs:
    secretName: null
    caBundle: ""
    certName: tls.crt
    keyName: tls.key

@jipperinbham
Copy link

The issue here is the consul-sync-catalog deployment is being configured with the hostIP of the pod when it should just be using the consul-server Service. I manually changed the deployment to use the following command:

        - /bin/sh
        - -ec
        - |
          consul-k8s sync-catalog \
            -http-addr=consul-server:8500 \
            -consul-domain=consul \
            -k8s-write-namespace=${NAMESPACE}

I'm not sure what the thinking is behind using the hostIP for the -http-addr flag.

@mitchellh
Copy link
Contributor

@jipperinbham Consul tooling should use the local agent wherever possible. The servers aren't meant to field cluster-wide API requests on their own (though they can to a fairly high scale). The local agent performs aggressive API batching, pipelining, and TCP multiplexing along with local caching to make Consul a lot more performant. Additionally, in the case of the catalog sync, it provides a local agent to register a service against in some cases which provides proper node health checking.

Please ensure you have the Consul clients installed using the Helm chart. This will expose the agent HTTP API on port 8500 on the host IP.

@jipperinbham
Copy link

Gotcha, that definitely makes sense and it looks like the consul DaemonSet is currently missing the following in the spec:

dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true

so it's not actually binding to the host ports specified and explains why my setup as well as @popopanda are experiencing the issue.

@ptariche
Copy link

ptariche commented Feb 6, 2019

This same issue occurs with Digital Ocean's k8 offering. The only way to resolve is to change the host ip the sync catalog deployment to reflect the internal domain of the consul server service rather than piggy backing off of the clusterIp which is injected via hostip

https://github.com/hashicorp/consul-helm/blob/master/templates/sync-catalog-deployment.yaml#L36
https://github.com/hashicorp/consul-helm/blob/master/templates/sync-catalog-deployment.yaml#L49

            -http-addr=consul-server:8500 \

Perhaps this should be optional in the values.yml

@mitchellh

For others looking, a present fix is to override the sync-catalog-deployment.yaml file with the change to reflect the consul service name as the hostIp.

@lkysow
Copy link
Member

lkysow commented Jun 7, 2019

Just for me to refer to later, the issue is that we're using the downward API to inject the HOST_IP environment variable so we can talk to the agent running on our node. In some deployments, ex. EKS with the aws-cni plugin that uses multiple enis, the HOST_IP is only one IP of multiple and so we might not be able to talk to the agent on that specific HOST_IP.

I'm not sure why we're not using hostNetwork and will need to ask the team.

@lkysow lkysow added the bug Something isn't working label Jun 7, 2019
@lkysow lkysow changed the title Consul on EKS, receiving connection refused Talking to agent with downward API fails when hosts have multiple IPs (ex. EKS w/ aws-cniI) Jun 7, 2019
@lkysow lkysow added the area/sync Related to catalog sync label Sep 17, 2019
@lkysow
Copy link
Member

lkysow commented Oct 11, 2019

Gotcha, that definitely makes sense and it looks like the consul DaemonSet is currently missing the following in the spec:

dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true

so it's not actually binding to the host ports specified and explains why my setup as well as @popopanda are experiencing the issue.

@jipperinbham we are running with hostPort: 8500 so hostNetwork should not be required

@lkysow
Copy link
Member

lkysow commented Oct 11, 2019

This same issue occurs with Digital Ocean's k8 offering. The only way to resolve is to change the host ip the sync catalog deployment to reflect the internal domain of the consul server service rather than piggy backing off of the clusterIp which is injected via hostip

https://github.com/hashicorp/consul-helm/blob/master/templates/sync-catalog-deployment.yaml#L36
https://github.com/hashicorp/consul-helm/blob/master/templates/sync-catalog-deployment.yaml#L49

            -http-addr=consul-server:8500 \

Perhaps this should be optional in the values.yml

@mitchellh

For others looking, a present fix is to override the sync-catalog-deployment.yaml file with the change to reflect the consul service name as the hostIp.

@ptariche I just tested this chart with Digital Ocean and the sync deployment could reach the local consul agent using its hostPort. Maybe something has changed on DO's side?

@lkysow
Copy link
Member

lkysow commented Oct 11, 2019

@popopanda from my testing, this should be working. Since this ticket is 1 year old, it's likely things have changed. It's totally our fault that this has sat here for so long but I'm going to close it now.

If folks are still having issues with the sync catalog pod not being able to talk to the local consul agent please open a new issue.

@lkysow lkysow closed this as completed Oct 11, 2019
@ishustava ishustava added the theme/host-network Questions or PRs about enabling host networking for Consul clients label May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/sync Related to catalog sync bug Something isn't working theme/host-network Questions or PRs about enabling host networking for Consul clients
Projects
None yet
Development

No branches or pull requests

6 participants