TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

lauatic · 2020-01-16T19:01:51Z

I "git clone" the kubespray project on 1/16/20 @ 8 AM PST. I got past the "repoquery container-selinux" failure that was reported earlier. I am seeing this failure and wonder there is a quick fix that I can apply. My cluster has one master and two worker nodes. The etcd check on the worker nodes is fine but not the master.

ansible-playbook -i inventory/ks89/hosts.yaml --become --become-user=root cluster.yml

TASK [etcd : Configure | Check if etcd cluster is healthy] ******************************************************************************************************************************
Thursday 16 January 2020 07:51:25 -0800 (0:00:00.103) 0:11:01.407 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.25.26.100]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.25.26.100:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.022447", "end": "2020-01-16 15:51:47.503910", "msg": "non-zero return code", "rc": 1, "start": "2020-01-16 15:51:45.481463", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************
to retry, use: --limit @/home/ngs/ks89/kubespray/cluster.retry

PLAY RECAP ******************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=460 changed=56 unreachable=0 failed=1
node2 : ok=367 changed=47 unreachable=0 failed=0
node3 : ok=366 changed=47 unreachable=0 failed=0

My hosts.yaml file looks like this

cat hosts.yaml

all:
hosts:
node1:
ansible_host: 10.25.26.100
ip: 10.25.26.100
access_ip: 10.25.26.100
node2:
ansible_host: 10.25.26.101
ip: 10.25.26.101
access_ip: 10.25.26.101
node3:
ansible_host: 10.25.26.102
ip: 10.25.26.102
access_ip: 10.25.26.102
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node2:
node3:
etcd:
hosts:
node1:
node2:
node3:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

Environment:
My environment is CentOS 7.

cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

ansible --version

ansible 2.7.12
$ git rev-parse --short HEAD
d640a57

The text was updated successfully, but these errors were encountered:

MartynasPuskunigis · 2020-01-17T12:53:08Z

Having the same issue now. I'm trying to add new worker nodes after updating hosts.yaml.

hk1313 · 2020-02-04T12:58:25Z

all task pass except one
`TASK [etcd : Configure | Check if etcd cluster is healthy] ********************************************************************************************************************************************************
Tuesday 04 February 2020 16:59:28 +0530 (0:00:00.142) 0:56:26.113 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.20.10.10]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.20.10.10:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.015515", "end": "2020-02-04 11:30:09.904893", "msg": "non-zero return code", "rc": 1, "start": "2020-02-04 11:30:07.889378", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ************************************************************************************************************************************************************************************************
to retry, use: --limit @/home/testuser/Downloads/kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=479 changed=69 unreachable=0 failed=1
node2 : ok=435 changed=66 unreachable=0 failed=0
node3 : ok=366 changed=60 unreachable=0 failed=0

Tuesday 04 February 2020 17:00:09 +0530 (0:00:41.193) 0:57:07.307 ******
=============================================================================== `

pskindel · 2020-02-13T11:53:04Z

Same here:

TASK [etcd : Configure | Check if etcd cluster is healthy] ****************************************************************************************************************************************
Thursday 13 February 2020  11:22:06 +0000 (0:00:00.099)       0:07:47.219 *****
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.23.167.108]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.23.167.108:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.017285", "end": "2020-02-13 03:22:36.870594", "msg": "non-zero return code", "rc": 1, "start": "2020-02-13 03:22:34.853309", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

but doing manually:

root@node1:~# env | grep -i etcd
ETCDCTL_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY_FILE=/etc/ssl/etcd/ssl/admin-node1-key.pem
ETCDCTL_CERT_FILE=/etc/ssl/etcd/ssl/admin-node1.pem
root@node1:~# env | grep -i no_proxy
no_proxy=localhost,127.0.0.1,10.23.167.108
root@node1:~# /usr/local/bin/etcdctl --no-sync --endpoints=https://10.23.167.108:2379 cluster-health
member 7bf8e20ae905a2e7 is healthy: got healthy result from https://10.23.167.108:2379
cluster is healthy

Problem with no_proxy? The 10.23.167.108 is an address passed to ansible-playbook, as an address of one of the nodes. I added also this to kubespray/inventory/mycluster/group_vars/all/all.yml::additional_no_proxy but didn't help.

lauatic · 2020-02-13T16:12:25Z

I tried with "retries: 16" (changed it in roles/etcd/tasks/configure.yml) thinking it might be related to timing issue. I still see the same failure with retires set to 16.

TASK [etcd : Configure | Check if etcd cluster is healthy] *******************************************************************************************
Wednesday 12 February 2020  23:01:37 -0800 (0:00:00.110)       0:10:07.732 ****
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (16 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (15 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (14 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (13 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (12 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (11 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (10 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (9 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (8 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (7 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (6 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (5 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [trydccf1 -> 10.100.15.100]: FAILED! => {"attempts": 16, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.100.15.100:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.020718", "end": "2020-02-13 07:04:22.020116", "msg": "non-zero return code", "rc": 1, "start": "2020-02-13 07:04:19.999398", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

lauatic · 2020-02-13T16:20:18Z

I ssh to the node and run the etcdctl command. Got the same output. The 10.100.15.100 is the master node's IP.

# /usr/local/bin/etcdctl --no-sync --endpoints=https://10.100.15.100:2379 cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout

error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout

saikaushik-itsmyworld · 2020-04-14T00:57:18Z

Any update on this issue? I ran into same issue, not sure what's the resolution for this issue.

lauatic · 2020-04-14T01:27:02Z

Any update on this issue? I ran into same issue, not sure what's the resolution for this issue.

In my environment, it uses proxy to get out to the Internet. Because the {http,https}_proxy environment variables were not set, ansible tasks failed. Setting the proxy environment variables fixed the issue for us.

williamleferrand · 2020-04-14T15:50:38Z

We're experiencing the same issue and at least for us it looks permissions-related.

the ansible task fails
on the server:
- sudo openssl s_client -showcerts -cert /etc/ssl/etcd/ssl/admin-<hostname>.pem -key /etc/ssl/etcd/ssl/admin-<hostname>.pem -connect <our ip>:2379 works
- openssl s_client -showcerts -cert /etc/ssl/etcd/ssl/admin-<hostname>.pem -key /etc/ssl/etcd/ssl/admin-<hostname>.pem -connect <our ip>:2379 fails with unable to load client certificate private key file

We're using Centos 7 and /etc/ssl/etcd/ssl/*'s permissions are -rw-r-----. 1 kube root; ansible runs with the --become --become-user=root --user=centos options

poussa · 2020-05-26T16:47:26Z

On the proxy environments, for some reason the kubespray does not use the (autogenerated) no_proxy variable when it is checking the etcd health. If you add the IP address of the master/etcd to the /etc/environment on the master/etcd node, the issue goes away.

# the last line is relevant here
cat /etc/environment
export https_proxy=http://192.168.177.2:911
export http_proxy=http://192.168.177.2:911
export no_proxy=192.168.177.108

EDIT: The above is not entirely true. Kubespray uses autogenerated no_proxy but only if your no_proxy environment variable is not set. So make sure unset no_proxy before running the playbook.

wayne-o · 2020-05-30T10:04:45Z

Could this be because there is an even number of master nodes?

hk1313 · 2020-07-17T05:47:54Z

all task pass except one
`TASK [etcd : Configure | Check if etcd cluster is healthy] ********************************************************************************************************************************************************
Tuesday 04 February 2020 16:59:28 +0530 (0:00:00.142) 0:56:26.113 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.20.10.10]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.20.10.10:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.015515", "end": "2020-02-04 11:30:09.904893", "msg": "non-zero return code", "rc": 1, "start": "2020-02-04 11:30:07.889378", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ************************************************************************************************************************************************************************************************
to retry, use: --limit @/home/testuser/Downloads/kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=479 changed=69 unreachable=0 failed=1
node2 : ok=435 changed=66 unreachable=0 failed=0
node3 : ok=366 changed=60 unreachable=0 failed=0

Tuesday 04 February 2020 17:00:09 +0530 (0:00:41.193) 0:57:07.307 ******
=============================================================================== `

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

mvdpoel · 2020-08-01T14:15:49Z

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

same for me!

poussa · 2020-08-11T16:47:31Z

I think this is the correct fix for this issue: #6346

Svendegroote91 · 2020-08-13T08:30:41Z

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

Can you elaborate how I can add a task to disable firewall/iptables?

mujibishola · 2020-09-05T21:43:30Z

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

Can you elaborate how I can add a task to disable firewall/iptables?

Doesn't feel right to stop the firewall but after stopping the firewall everything works ok.

Add a new task to the config file kubespray/roles/etcd/tasks/configure.yml

name: Configure | Stop firewalld
service: name=firewalld state=stopped

jaurestchoffo · 2020-09-24T09:33:02Z

same problem with CentOS7 but for my part I have the impression that it is due to the fact that it does not use the correct ip for the verification of the cluster
Can someone help me by telling me how to specify the right network interface? it uses ETH0 instead of ETH1

rsleem · 2020-10-17T19:22:35Z

I am facing the same issue, provided my error below.
I am thinking It may be related to tasks not installing etcd on all three nodes as configured in inventroy/mycluster/hosts.yml
In summary how could node1 etcd reach node2 etcd when etcd has been only installed on node1

I then run kubespray but this time I only included node1 in the etcd groups, so etcd will only be available on node1
still getting an error on the same task: TASK [etcd : Configure | Wait for etcd cluster to be healthy

x.x.x.231 is node1
x.x.x.175 is node2

inventroy/mycluster/hosts.yml: (configured for three nodes)

  ...children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
        node5:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s-cluster:
      children:
        kube-master:
        kube-node:

TASK [etcd : Configure | Ensure etcd is running] *************************************************************************************
changed: [node1]
Saturday 17 October 2020  15:03:27 +0000 (0:00:00.899)       0:05:38.352 ******
Saturday 17 October 2020  15:03:28 +0000 (0:00:00.037)       0:05:38.390 ******
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] ************************************************************************
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.018648", "end": "2020-10-17 15:04:10.352453", "msg": "non-zero return code", "rc": 1, "start": "2020-10-17 15:04:05.333805", "stderr": "{\"level\":\"warn\",\"ts\":\"2020-10-17T15:04:10.350Z\",\"caller\":\"clientv3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-0e77535e-de90-46b3-8ea2-628c21f39bde/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp x.x.x.231:2379: connect: connection refused\\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2020-10-17T15:04:10.350Z\",\"caller\":\"clientv3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-0e77535e-de90-46b3-8ea2-628c21f39bde/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp x.x.x.175:2379: connect: connection refused\\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

inventroy/mycluster/hosts.yml: (configures with one node)

  ...children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
        node5:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:

fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -q -v 'Error: unhealthy cluster'", "delta": "0:00:05.018039", "end": "2020-10-17 19:14:09.393315", "msg": "non-zero return code", "rc": 1, "start": "2020-10-17 19:14:04.375276", "stderr": "{\"level\":\"warn\",\"ts\":\"2020-10-17T19:14:09.391Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-96ff840c-421a-47be-a96c-5b0246723180/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2020-10-17T19:14:09.391Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-96ff840c-421a-47be-a96c-5b0246723180/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

I even tested the etcd health manually:

devuser@node1:/etc/kubernetes/ssl$ sudo ETCDCTL_API=3 /usr/local/bin/etcdctl  endpoint health  --endpoints=https://x.x.x.231:2379 --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/admin-node1.pem --key=/etc/ssl/etcd/ssl/admin-node1-key.pem
{"level":"warn","ts":"2020-10-17T21:12:06.869Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-b33f5xxx-2c80-xxxx-xxxx-3377d7axxx45/x.x.x.231:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
https://x.x.x.231:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

elgamal2020 · 2021-01-08T02:31:01Z

+1

manas86 · 2021-01-17T00:11:23Z

Any update? I'm having the same issue

albheim · 2021-01-20T09:44:23Z

I have the same issue. Tested some of the solutions suggested here and none worked for me.

I ended up taking a look at the docker container running the etcd service and it seems to be stuck in some crash/reboot loop.

ubuntu@master1:~$ sudo docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS                     PORTS               NAMES
656986bd5bfd        quay.io/coreos/etcd:v3.4.13   "/usr/local/bin/etcd"   13 seconds ago      Exited (1) 7 seconds ago                       etcd1

First thing I wonder here is that there are no ports connected to the container, is that not needed?

Then I check the logs to see what caused this and this is what I get.

ubuntu@master1:~$ sudo docker logs etcd1
...
2021-01-20 09:33:05.408665 I | etcdmain: etcd Version: 3.4.13
2021-01-20 09:33:05.408674 I | etcdmain: Git SHA: ae9734ed2
2021-01-20 09:33:05.408680 I | etcdmain: Go Version: go1.12.17
2021-01-20 09:33:05.408687 I | etcdmain: Go OS/Arch: linux/amd64
2021-01-20 09:33:05.408694 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2021-01-20 09:33:05.411131 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-01-20 09:33:05.411205 I | embed: peerTLS: cert = /etc/ssl/etcd/ssl/member-master1.pem, key = /etc/ssl/etcd/ssl/member-master1-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file = 
2021-01-20 09:33:05.412217 I | embed: name = etcd1
2021-01-20 09:33:05.412237 I | embed: data dir = /var/lib/etcd
2021-01-20 09:33:05.412246 I | embed: member dir = /var/lib/etcd/member
2021-01-20 09:33:05.412253 I | embed: heartbeat = 250ms
2021-01-20 09:33:05.412260 I | embed: election = 5000ms
2021-01-20 09:33:05.412274 I | embed: snapshot count = 10000
2021-01-20 09:33:05.412286 I | embed: advertise client URLs = https://<global ip of master1>:2379
2021-01-20 09:33:05.412335 W | pkg/fileutil: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2021-01-20 09:33:05.417294 C | etcdmain: --initial-cluster has etcd1=https://<local ip of master1>:2380 but missing from --initial-advertise-peer-urls=https://<global ip of master1>:2380 ("https://<global ip of master1>:2380"(resolved from "https://<global ip of master1>:2380") != "https://<local ip of master1>:2380"(resolved from "https://<local ip of master1>:2380"))

To me it seems like the global and local ip of the master nodes has been mixed up in some way and etcd is not happy about that.

alexmang85 · 2021-02-09T13:34:29Z

Good day! I had a similar error in the case when the etcd - three nodes, there were no problems with one etcd .. my stack is ubuntu 16.04 as a hypervisor, the nodes work in virtual machines on KVM via a network bridge (three masters, on each of the masters one instance of etcd , three workers and one ingress node.) I understood that the problem was in the network connectivity, but I could not understand where, since all the nodes pinged to each other and connections to the etcd via telnet passed, the firewall was disabled on all nodes. But in order to have access to the Internet on all nodes of the cluster, a masquerading rule was written on the hypervisor
iptables -t nat -A POSTROUTING -s 10.208.0.0/24 -j MASQUERADE
(all chains are allowed, packet forwarding is enabled in the kernel)
As a result, the problem turned out to be that by default in the kernel packets passing through the network bridge are sent to iptables for processing.
Solution, add to /etc/sysctl.conf:
net.bridge.bridge-nf-call-iptables=0
net.bridge.bridge-nf-call-arptables=0
net.bridge.bridge-nf-call-ip6tables=0
After
sysctl -p
systemctl restart libvirtd
After that, the deployment of the cluster took place without errors in the etcd, I hope this will help someone!))
P.S. article about bridge-nf-call
https://wiki.libvirt.org/page/Net.bridge.bridge-nf-call_and_sysctl.conf

MengS1024 · 2021-03-09T04:11:03Z

+1
I met the same error. The status of the container etcd is created on the master node.

CONTAINER ID        IMAGE                         COMMAND                 CREATED                  STATUS              PORTS               NAMES
c730e2cd64a7        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   Less than a second ago   Created                                 etcd1

MengS1024 · 2021-03-18T13:13:46Z

+1
I met the same error. The status of the container etcd is created on the master node.

CONTAINER ID        IMAGE                         COMMAND                 CREATED                  STATUS              PORTS               NAMES
c730e2cd64a7        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   Less than a second ago   Created                                 etcd1

Here is my solution. I tried running docker run hello-world on the master node but get the following error:

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/ec5230f20c55068a7603dbfb553a5799ad3902698a64be2ddd7748d2fcfc65e9/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled

so I install nvidia-container-runtime on it, the etcd issue solved after this.

alexmang85 · 2021-03-18T16:00:12Z

+1
Я встретил ту же ошибку. Статус контейнера etcd создается на главном узле.
КОНТЕЙНЕР ИДЕНТИФИКАЦИЯ ИЗОБРАЖЕНИЕ КОМАНДА СОЗДАНО СОСТОЯНИЕ ИМЕНА ПОРТОВ
c730e2cd64a7 quay.io/coreos/etcd:v3.3.10    " / usr / local / bin / etcd "    Меньше секунды назад Создано etcd1
Вот мое решение. Я попытался запустить docker run hello-worldглавный узел, но получил следующую ошибку:
docker: ответ об ошибке от демона: сбой при создании среды выполнения OCI: невозможно получить ошибку времени выполнения OCI (откройте /run/containerd/io.containerd.runtime.v1.linux/moby/ec5230f20c55068a7603dbfb553a5799ad3902698a64be2ddd7748d: файл или каталог) fork / exec / usr / bin / nvidia-container-runtime: нет такого файла или каталога: неизвестно.
ERRO [0000] Ошибка ожидания для контейнера: контекст отменен
поэтому я устанавливаю на него nvidia-container-runtime, после этого проблема с etcd решена.

Wow, are you using nvidia-container-runtime? Interesting case)

fejta-bot · 2021-06-16T16:28:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-17T03:05:37Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

whatnick · 2021-07-23T09:42:55Z

Still happens against master. Similar setup as all cases above. Screenshot for evidence.

whatnick · 2021-07-23T11:04:06Z

Further investigation reveals that the container is running but stuck in an inaccessible port state.
LOGS:

$sudo docker logs etcd2
...
2021-07-23 11:02:43.794915 W | rafthttp: health check for peer ff0bb6fbc61533bb could not connect: dial tcp 192.168.1.23:2380: i/o timeout
2021-07-23 11:02:48.795253 W | rafthttp: health check for peer ff0bb6fbc61533bb could not connect: dial tcp 192.168.1.23:2380: i/o timeout
2021-07-23 11:02:48.795305 W | rafthttp: health check for peer ff0bb6fbc61533bb could not connect: dial tcp 192.168.1.23:2380: i/o timeout

DOCKER Container status:

$ sudo docker ps -a
CONTAINER ID   IMAGE                         COMMAND                 CREATED       STATUS       PORTS     NAMES
d29d9c7a4f78   quay.io/coreos/etcd:v3.4.13   "/usr/local/bin/etcd"   2 hours ago   Up 2 hours             etcd2

ETCD Launch command:

$ps -ef | grep etcd
root      144722       1  0 19:29 ?        00:00:00 /bin/bash /usr/local/bin/etcd
root      144723  144722  0 19:29 ?        00:00:01 /usr/bin/docker run --restart=on-failure:5 --env-file=/etc/etcd.env --net=host -v /etc/ssl/certs:/etc/ssl/certs:ro -v /etc/ssl/etcd/ssl:/etc/ssl/etcd/ssl:ro -v /var/lib/etcd:/var/lib/etcd:rw --memory=512M --blkio-weight=1000 --name=etcd2 quay.io/coreos/etcd:v3.4.13 /usr/local/bin/etcd
root      144783  144762  3 19:29 ?        00:03:09 /usr/local/bin/etcd

TELNET to port 2379/2380:

$ telnet 192.168.1.22 2380
Trying 192.168.1.22...
Connected to 192.168.1.22.
Escape character is '^]'.

$ telnet 192.168.1.22 2379
Trying 192.168.1.22...
Connected to 192.168.1.22.
Escape character is '^]'.

RESOLUTION:

Underlying cause was misconfigured UFW in node3, allowing traffic from rest of the nodes in the cluster makes etcd healthy*

arno01 · 2021-07-29T15:41:59Z

To me it seems like the global and local ip of the master nodes has been mixed up in some way and etcd is not happy about that.

Yep, looks like a bug in the template somewhere as with the single node deployment /etc/etcd.env gets this:

with the etcd_deployment_type: host

# cat /etc/etcd.env
...
ETCD_ADVERTISE_CLIENT_URLS=https://PUBLIC_IP:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://PUBLIC_IP:2380
ETCD_LISTEN_CLIENT_URLS=https://INTERNAL_IP:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://INTERNAL_IP:2380
ETCD_INITIAL_CLUSTER=etcd1=https://INTERNAL_IP:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

Hence we are getting the following error where etcd gets confused by the PUBLIC_IP it sees in the peers and the client urls:

2021-07-29 14:11:57.269323 C | etcdmain: --initial-cluster has etcd1=https://INTERNAL_IP:2380 but missing from --initial-advertise-peer-urls=https://PUBLIC_IP:2380 ("https://PUBLIC_IP:2380"(resolved from "https://PUBLIC_IP:2380") != "https://INTERNAL_IP:2380"(resolved from "https://INTERNAL_IP:2380"))

This is how the working etcd deployment should look like with 3 etcd nodes (with the 10.0.0.{1..3} internal IPs respectively):

root@node1:~# cat /etc/etcd.env | grep http
ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.1:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.1:2380
ETCD_LISTEN_CLIENT_URLS=https://10.0.0.1:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://10.0.0.1:2380
ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

root@node2:~# cat /etc/etcd.env | grep http
ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.2:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.2:2380
ETCD_LISTEN_CLIENT_URLS=https://10.0.0.2:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://10.0.0.2:2380
ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

root@node3:~# cat /etc/etcd.env | grep http
ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.3:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.3:2380
ETCD_LISTEN_CLIENT_URLS=https://10.0.0.3:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://10.0.0.3:2380
ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

k8s-triage-robot · 2021-08-28T15:44:34Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-08-28T15:44:41Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

patvdleer · 2021-12-21T14:02:31Z

I've ran into something very similar;

The healthcheck gives Error while dialing dial tcp 192.168.112.6:2379: connect: connection refused

I have;

# 0|1 bastion nodes
number_of_bastions = 1

# masters
number_of_k8s_masters = 1
number_of_k8s_masters_no_etcd = 0
number_of_k8s_masters_no_floating_ip = 0
number_of_k8s_masters_no_floating_ip_no_etcd = 0

# nodes
number_of_k8s_nodes = 0
number_of_k8s_nodes_no_floating_ip = 3

on master sudo docker logs etcd1

 etcdmain: --initial-cluster has etcd1=https://192.168.112.6:2380 but missing from --initial-advertise-peer-urls=https://89.xx.xx.xxx:2380 ("https://89.xx.xx.xxx:2380"(resolved from "https://89.xx.xx.xxx:2380") != "https://192.168.112.6:2380"(resolved from "https://192.168.112.6:2380"))

running contrib/terraform/openstack/hosts --list --pretty

 "k8s-gripop-k8s-master-1": {
                "access_ip_v4": "89.xx.xx.xxx",
                "access_ip_v6": "",
                "access_ip": "89.xx.xx.xxx",
                "ip": "192.168.112.6",

In /roles/kubespray-defaults/defaults/main.yaml, replace the lines below to make it work

etcd_access_address: "{{ access_ip | default(etcd_address) }}"
etcd_events_access_address: "{{ access_ip | default(etcd_address) }}"

by

etcd_access_address: "{{ ip | default(etcd_address) }}"
etcd_events_access_address: "{{ ip | default(etcd_address) }}"

Kirkirillka · 2022-09-12T10:47:40Z

I'm wondering when this proposal will be merged? Seems work fine

lauatic added the kind/bug Categorizes issue or PR as related to a bug. label Jan 16, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021

whatnick mentioned this issue Jul 23, 2021

Documentation / Contrib folder code for known good configuration for firewalld/ufw #7820

Closed

k8s-ci-robot closed this as completed Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

lauatic commented Jan 16, 2020

MartynasPuskunigis commented Jan 17, 2020

hk1313 commented Feb 4, 2020 •

edited

Loading

pskindel commented Feb 13, 2020 •

edited

Loading

lauatic commented Feb 13, 2020

lauatic commented Feb 13, 2020

saikaushik-itsmyworld commented Apr 14, 2020

lauatic commented Apr 14, 2020

williamleferrand commented Apr 14, 2020

poussa commented May 26, 2020 •

edited

Loading

wayne-o commented May 30, 2020

hk1313 commented Jul 17, 2020

mvdpoel commented Aug 1, 2020

poussa commented Aug 11, 2020

Svendegroote91 commented Aug 13, 2020 •

edited

Loading

mujibishola commented Sep 5, 2020

jaurestchoffo commented Sep 24, 2020

rsleem commented Oct 17, 2020 •

edited

Loading

elgamal2020 commented Jan 8, 2021

manas86 commented Jan 17, 2021

albheim commented Jan 20, 2021

alexmang85 commented Feb 9, 2021 •

edited

Loading

MengS1024 commented Mar 9, 2021

MengS1024 commented Mar 18, 2021

alexmang85 commented Mar 18, 2021

fejta-bot commented Jun 16, 2021

fejta-bot commented Jul 17, 2021

whatnick commented Jul 23, 2021

whatnick commented Jul 23, 2021 •

edited

Loading

arno01 commented Jul 29, 2021 •

edited

Loading

k8s-triage-robot commented Aug 28, 2021

k8s-ci-robot commented Aug 28, 2021

patvdleer commented Dec 21, 2021 •

edited

Loading

Kirkirillka commented Sep 12, 2022

TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

Comments

lauatic commented Jan 16, 2020

ansible-playbook -i inventory/ks89/hosts.yaml --become --become-user=root cluster.yml

cat hosts.yaml

cat /etc/os-release

ansible --version

MartynasPuskunigis commented Jan 17, 2020

hk1313 commented Feb 4, 2020 • edited Loading

pskindel commented Feb 13, 2020 • edited Loading

lauatic commented Feb 13, 2020

lauatic commented Feb 13, 2020

saikaushik-itsmyworld commented Apr 14, 2020

lauatic commented Apr 14, 2020

williamleferrand commented Apr 14, 2020

poussa commented May 26, 2020 • edited Loading

wayne-o commented May 30, 2020

hk1313 commented Jul 17, 2020

mvdpoel commented Aug 1, 2020

poussa commented Aug 11, 2020

Svendegroote91 commented Aug 13, 2020 • edited Loading

mujibishola commented Sep 5, 2020

jaurestchoffo commented Sep 24, 2020

rsleem commented Oct 17, 2020 • edited Loading

elgamal2020 commented Jan 8, 2021

manas86 commented Jan 17, 2021

albheim commented Jan 20, 2021

alexmang85 commented Feb 9, 2021 • edited Loading

MengS1024 commented Mar 9, 2021

MengS1024 commented Mar 18, 2021

alexmang85 commented Mar 18, 2021

fejta-bot commented Jun 16, 2021

fejta-bot commented Jul 17, 2021

whatnick commented Jul 23, 2021

whatnick commented Jul 23, 2021 • edited Loading

arno01 commented Jul 29, 2021 • edited Loading

k8s-triage-robot commented Aug 28, 2021

k8s-ci-robot commented Aug 28, 2021

patvdleer commented Dec 21, 2021 • edited Loading

Kirkirillka commented Sep 12, 2022

hk1313 commented Feb 4, 2020 •

edited

Loading

pskindel commented Feb 13, 2020 •

edited

Loading

poussa commented May 26, 2020 •

edited

Loading

Svendegroote91 commented Aug 13, 2020 •

edited

Loading

rsleem commented Oct 17, 2020 •

edited

Loading

alexmang85 commented Feb 9, 2021 •

edited

Loading

whatnick commented Jul 23, 2021 •

edited

Loading

arno01 commented Jul 29, 2021 •

edited

Loading

patvdleer commented Dec 21, 2021 •

edited

Loading