Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

Closed
lauatic opened this issue Jan 16, 2020 · 33 comments
Closed

TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550

lauatic opened this issue Jan 16, 2020 · 33 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@lauatic
Copy link

lauatic commented Jan 16, 2020

I "git clone" the kubespray project on 1/16/20 @ 8 AM PST. I got past the "repoquery container-selinux" failure that was reported earlier. I am seeing this failure and wonder there is a quick fix that I can apply. My cluster has one master and two worker nodes. The etcd check on the worker nodes is fine but not the master.

ansible-playbook -i inventory/ks89/hosts.yaml --become --become-user=root cluster.yml

TASK [etcd : Configure | Check if etcd cluster is healthy] ******************************************************************************************************************************
Thursday 16 January 2020 07:51:25 -0800 (0:00:00.103) 0:11:01.407 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.25.26.100]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.25.26.100:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.022447", "end": "2020-01-16 15:51:47.503910", "msg": "non-zero return code", "rc": 1, "start": "2020-01-16 15:51:45.481463", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************
to retry, use: --limit @/home/ngs/ks89/kubespray/cluster.retry

PLAY RECAP ******************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=460 changed=56 unreachable=0 failed=1
node2 : ok=367 changed=47 unreachable=0 failed=0
node3 : ok=366 changed=47 unreachable=0 failed=0

My hosts.yaml file looks like this

cat hosts.yaml

all:
hosts:
node1:
ansible_host: 10.25.26.100
ip: 10.25.26.100
access_ip: 10.25.26.100
node2:
ansible_host: 10.25.26.101
ip: 10.25.26.101
access_ip: 10.25.26.101
node3:
ansible_host: 10.25.26.102
ip: 10.25.26.102
access_ip: 10.25.26.102
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node2:
node3:
etcd:
hosts:
node1:
node2:
node3:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

Environment:
My environment is CentOS 7.

cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

ansible --version

ansible 2.7.12
$ git rev-parse --short HEAD
d640a57

@lauatic lauatic added the kind/bug Categorizes issue or PR as related to a bug. label Jan 16, 2020
@MartynasPuskunigis
Copy link

Having the same issue now. I'm trying to add new worker nodes after updating hosts.yaml.

@hk1313
Copy link

hk1313 commented Feb 4, 2020

all task pass except one
`TASK [etcd : Configure | Check if etcd cluster is healthy] ********************************************************************************************************************************************************
Tuesday 04 February 2020 16:59:28 +0530 (0:00:00.142) 0:56:26.113 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.20.10.10]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.20.10.10:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.015515", "end": "2020-02-04 11:30:09.904893", "msg": "non-zero return code", "rc": 1, "start": "2020-02-04 11:30:07.889378", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ************************************************************************************************************************************************************************************************
to retry, use: --limit @/home/testuser/Downloads/kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=479 changed=69 unreachable=0 failed=1
node2 : ok=435 changed=66 unreachable=0 failed=0
node3 : ok=366 changed=60 unreachable=0 failed=0

Tuesday 04 February 2020 17:00:09 +0530 (0:00:41.193) 0:57:07.307 ******
=============================================================================== `

@pskindel
Copy link

pskindel commented Feb 13, 2020

Same here:

TASK [etcd : Configure | Check if etcd cluster is healthy] ****************************************************************************************************************************************
Thursday 13 February 2020  11:22:06 +0000 (0:00:00.099)       0:07:47.219 *****
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.23.167.108]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.23.167.108:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.017285", "end": "2020-02-13 03:22:36.870594", "msg": "non-zero return code", "rc": 1, "start": "2020-02-13 03:22:34.853309", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.23.167.108:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

but doing manually:

root@node1:~# env | grep -i etcd
ETCDCTL_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY_FILE=/etc/ssl/etcd/ssl/admin-node1-key.pem
ETCDCTL_CERT_FILE=/etc/ssl/etcd/ssl/admin-node1.pem
root@node1:~# env | grep -i no_proxy
no_proxy=localhost,127.0.0.1,10.23.167.108
root@node1:~# /usr/local/bin/etcdctl --no-sync --endpoints=https://10.23.167.108:2379 cluster-health
member 7bf8e20ae905a2e7 is healthy: got healthy result from https://10.23.167.108:2379
cluster is healthy

Problem with no_proxy? The 10.23.167.108 is an address passed to ansible-playbook, as an address of one of the nodes. I added also this to kubespray/inventory/mycluster/group_vars/all/all.yml::additional_no_proxy but didn't help.

@lauatic
Copy link
Author

lauatic commented Feb 13, 2020

I tried with "retries: 16" (changed it in roles/etcd/tasks/configure.yml) thinking it might be related to timing issue. I still see the same failure with retires set to 16.

TASK [etcd : Configure | Check if etcd cluster is healthy] *******************************************************************************************
Wednesday 12 February 2020  23:01:37 -0800 (0:00:00.110)       0:10:07.732 ****
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (16 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (15 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (14 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (13 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (12 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (11 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (10 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (9 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (8 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (7 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (6 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (5 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [trydccf1 -> 10.100.15.100]: FAILED! => {"attempts": 16, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.100.15.100:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.020718", "end": "2020-02-13 07:04:22.020116", "msg": "non-zero return code", "rc": 1, "start": "2020-02-13 07:04:19.999398", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

@lauatic
Copy link
Author

lauatic commented Feb 13, 2020

I ssh to the node and run the etcdctl command. Got the same output. The 10.100.15.100 is the master node's IP.

# /usr/local/bin/etcdctl --no-sync --endpoints=https://10.100.15.100:2379 cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout

error #0: client: endpoint https://10.100.15.100:2379 exceeded header timeout

@saikaushik-itsmyworld
Copy link

Any update on this issue? I ran into same issue, not sure what's the resolution for this issue.

@lauatic
Copy link
Author

lauatic commented Apr 14, 2020

Any update on this issue? I ran into same issue, not sure what's the resolution for this issue.

In my environment, it uses proxy to get out to the Internet. Because the {http,https}_proxy environment variables were not set, ansible tasks failed. Setting the proxy environment variables fixed the issue for us.

@williamleferrand
Copy link

We're experiencing the same issue and at least for us it looks permissions-related.

  • the ansible task fails

  • on the server:

    • sudo openssl s_client -showcerts -cert /etc/ssl/etcd/ssl/admin-<hostname>.pem -key /etc/ssl/etcd/ssl/admin-<hostname>.pem -connect <our ip>:2379 works
    • openssl s_client -showcerts -cert /etc/ssl/etcd/ssl/admin-<hostname>.pem -key /etc/ssl/etcd/ssl/admin-<hostname>.pem -connect <our ip>:2379 fails with unable to load client certificate private key file

We're using Centos 7 and /etc/ssl/etcd/ssl/*'s permissions are -rw-r-----. 1 kube root; ansible runs with the --become --become-user=root --user=centos options

@poussa
Copy link

poussa commented May 26, 2020

On the proxy environments, for some reason the kubespray does not use the (autogenerated) no_proxy variable when it is checking the etcd health. If you add the IP address of the master/etcd to the /etc/environment on the master/etcd node, the issue goes away.

# the last line is relevant here
cat /etc/environment
export https_proxy=http://192.168.177.2:911
export http_proxy=http://192.168.177.2:911
export no_proxy=192.168.177.108

EDIT: The above is not entirely true. Kubespray uses autogenerated no_proxy but only if your no_proxy environment variable is not set. So make sure unset no_proxy before running the playbook.

@wayne-o
Copy link

wayne-o commented May 30, 2020

Could this be because there is an even number of master nodes?

@hk1313
Copy link

hk1313 commented Jul 17, 2020

all task pass except one
`TASK [etcd : Configure | Check if etcd cluster is healthy] ********************************************************************************************************************************************************
Tuesday 04 February 2020 16:59:28 +0530 (0:00:00.142) 0:56:26.113 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.20.10.10]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.20.10.10:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.015515", "end": "2020-02-04 11:30:09.904893", "msg": "non-zero return code", "rc": 1, "start": "2020-02-04 11:30:07.889378", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.20.10.10:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ************************************************************************************************************************************************************************************************
to retry, use: --limit @/home/testuser/Downloads/kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=479 changed=69 unreachable=0 failed=1
node2 : ok=435 changed=66 unreachable=0 failed=0
node3 : ok=366 changed=60 unreachable=0 failed=0

Tuesday 04 February 2020 17:00:09 +0530 (0:00:41.193) 0:57:07.307 ******
=============================================================================== `

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

@mvdpoel
Copy link

mvdpoel commented Aug 1, 2020

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

same for me!

@poussa
Copy link

poussa commented Aug 11, 2020

I think this is the correct fix for this issue: #6346

@Svendegroote91
Copy link
Contributor

Svendegroote91 commented Aug 13, 2020

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

Can you elaborate how I can add a task to disable firewall/iptables?

@mujibishola
Copy link

for me, it got resolved by add task for disable firewall/iptables. then all thing get works

Can you elaborate how I can add a task to disable firewall/iptables?

Doesn't feel right to stop the firewall but after stopping the firewall everything works ok.

Add a new task to the config file kubespray/roles/etcd/tasks/configure.yml

  • name: Configure | Stop firewalld
    service: name=firewalld state=stopped

@jaurestchoffo
Copy link

same problem with CentOS7 but for my part I have the impression that it is due to the fact that it does not use the correct ip for the verification of the cluster
Can someone help me by telling me how to specify the right network interface? it uses ETH0 instead of ETH1

@rsleem
Copy link

rsleem commented Oct 17, 2020

I am facing the same issue, provided my error below.
I am thinking It may be related to tasks not installing etcd on all three nodes as configured in inventroy/mycluster/hosts.yml
In summary how could node1 etcd reach node2 etcd when etcd has been only installed on node1

I then run kubespray but this time I only included node1 in the etcd groups, so etcd will only be available on node1
still getting an error on the same task: TASK [etcd : Configure | Wait for etcd cluster to be healthy

x.x.x.231 is node1
x.x.x.175 is node2

inventroy/mycluster/hosts.yml: (configured for three nodes)

  ...children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
        node5:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
TASK [etcd : Configure | Ensure etcd is running] *************************************************************************************
changed: [node1]
Saturday 17 October 2020  15:03:27 +0000 (0:00:00.899)       0:05:38.352 ******
Saturday 17 October 2020  15:03:28 +0000 (0:00:00.037)       0:05:38.390 ******
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] ************************************************************************
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.018648", "end": "2020-10-17 15:04:10.352453", "msg": "non-zero return code", "rc": 1, "start": "2020-10-17 15:04:05.333805", "stderr": "{\"level\":\"warn\",\"ts\":\"2020-10-17T15:04:10.350Z\",\"caller\":\"clientv3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-0e77535e-de90-46b3-8ea2-628c21f39bde/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp x.x.x.231:2379: connect: connection refused\\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2020-10-17T15:04:10.350Z\",\"caller\":\"clientv3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-0e77535e-de90-46b3-8ea2-628c21f39bde/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp x.x.x.175:2379: connect: connection refused\\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

inventroy/mycluster/hosts.yml: (configures with one node)

  ...children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
        node5:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -q -v 'Error: unhealthy cluster'", "delta": "0:00:05.018039", "end": "2020-10-17 19:14:09.393315", "msg": "non-zero return code", "rc": 1, "start": "2020-10-17 19:14:04.375276", "stderr": "{\"level\":\"warn\",\"ts\":\"2020-10-17T19:14:09.391Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-96ff840c-421a-47be-a96c-5b0246723180/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2020-10-17T19:14:09.391Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-96ff840c-421a-47be-a96c-5b0246723180/x.x.x.231:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

I even tested the etcd health manually:

devuser@node1:/etc/kubernetes/ssl$ sudo ETCDCTL_API=3 /usr/local/bin/etcdctl  endpoint health  --endpoints=https://x.x.x.231:2379 --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/admin-node1.pem --key=/etc/ssl/etcd/ssl/admin-node1-key.pem
{"level":"warn","ts":"2020-10-17T21:12:06.869Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-b33f5xxx-2c80-xxxx-xxxx-3377d7axxx45/x.x.x.231:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
https://x.x.x.231:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

@elgamal2020
Copy link

+1

@manas86
Copy link

manas86 commented Jan 17, 2021

Any update? I'm having the same issue

@albheim
Copy link

albheim commented Jan 20, 2021

I have the same issue. Tested some of the solutions suggested here and none worked for me.

I ended up taking a look at the docker container running the etcd service and it seems to be stuck in some crash/reboot loop.

ubuntu@master1:~$ sudo docker ps -a
CONTAINER ID        IMAGE                         COMMAND                 CREATED             STATUS                     PORTS               NAMES
656986bd5bfd        quay.io/coreos/etcd:v3.4.13   "/usr/local/bin/etcd"   13 seconds ago      Exited (1) 7 seconds ago                       etcd1

First thing I wonder here is that there are no ports connected to the container, is that not needed?

Then I check the logs to see what caused this and this is what I get.

ubuntu@master1:~$ sudo docker logs etcd1
...
2021-01-20 09:33:05.408665 I | etcdmain: etcd Version: 3.4.13
2021-01-20 09:33:05.408674 I | etcdmain: Git SHA: ae9734ed2
2021-01-20 09:33:05.408680 I | etcdmain: Go Version: go1.12.17
2021-01-20 09:33:05.408687 I | etcdmain: Go OS/Arch: linux/amd64
2021-01-20 09:33:05.408694 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2021-01-20 09:33:05.411131 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-01-20 09:33:05.411205 I | embed: peerTLS: cert = /etc/ssl/etcd/ssl/member-master1.pem, key = /etc/ssl/etcd/ssl/member-master1-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file = 
2021-01-20 09:33:05.412217 I | embed: name = etcd1
2021-01-20 09:33:05.412237 I | embed: data dir = /var/lib/etcd
2021-01-20 09:33:05.412246 I | embed: member dir = /var/lib/etcd/member
2021-01-20 09:33:05.412253 I | embed: heartbeat = 250ms
2021-01-20 09:33:05.412260 I | embed: election = 5000ms
2021-01-20 09:33:05.412274 I | embed: snapshot count = 10000
2021-01-20 09:33:05.412286 I | embed: advertise client URLs = https://<global ip of master1>:2379
2021-01-20 09:33:05.412335 W | pkg/fileutil: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2021-01-20 09:33:05.417294 C | etcdmain: --initial-cluster has etcd1=https://<local ip of master1>:2380 but missing from --initial-advertise-peer-urls=https://<global ip of master1>:2380 ("https://<global ip of master1>:2380"(resolved from "https://<global ip of master1>:2380") != "https://<local ip of master1>:2380"(resolved from "https://<local ip of master1>:2380"))

To me it seems like the global and local ip of the master nodes has been mixed up in some way and etcd is not happy about that.

@alexmang85
Copy link

alexmang85 commented Feb 9, 2021

Good day! I had a similar error in the case when the etcd - three nodes, there were no problems with one etcd .. my stack is ubuntu 16.04 as a hypervisor, the nodes work in virtual machines on KVM via a network bridge (three masters, on each of the masters one instance of etcd , three workers and one ingress node.) I understood that the problem was in the network connectivity, but I could not understand where, since all the nodes pinged to each other and connections to the etcd via telnet passed, the firewall was disabled on all nodes. But in order to have access to the Internet on all nodes of the cluster, a masquerading rule was written on the hypervisor
iptables -t nat -A POSTROUTING -s 10.208.0.0/24 -j MASQUERADE
(all chains are allowed, packet forwarding is enabled in the kernel)
As a result, the problem turned out to be that by default in the kernel packets passing through the network bridge are sent to iptables for processing.
Solution, add to /etc/sysctl.conf:
net.bridge.bridge-nf-call-iptables=0
net.bridge.bridge-nf-call-arptables=0
net.bridge.bridge-nf-call-ip6tables=0

After
sysctl -p
systemctl restart libvirtd
After that, the deployment of the cluster took place without errors in the etcd, I hope this will help someone!))
P.S. article about bridge-nf-call
https://wiki.libvirt.org/page/Net.bridge.bridge-nf-call_and_sysctl.conf

@MengS1024
Copy link

+1
I met the same error. The status of the container etcd is created on the master node.

CONTAINER ID        IMAGE                         COMMAND                 CREATED                  STATUS              PORTS               NAMES
c730e2cd64a7        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   Less than a second ago   Created                                 etcd1

@MengS1024
Copy link

+1
I met the same error. The status of the container etcd is created on the master node.

CONTAINER ID        IMAGE                         COMMAND                 CREATED                  STATUS              PORTS               NAMES
c730e2cd64a7        quay.io/coreos/etcd:v3.3.10   "/usr/local/bin/etcd"   Less than a second ago   Created                                 etcd1

Here is my solution. I tried running docker run hello-world on the master node but get the following error:

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/ec5230f20c55068a7603dbfb553a5799ad3902698a64be2ddd7748d2fcfc65e9/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled

so I install nvidia-container-runtime on it, the etcd issue solved after this.

@alexmang85
Copy link

+1
Я встретил ту же ошибку. Статус контейнера etcd создается на главном узле.

КОНТЕЙНЕР ИДЕНТИФИКАЦИЯ ИЗОБРАЖЕНИЕ КОМАНДА СОЗДАНО СОСТОЯНИЕ ИМЕНА ПОРТОВ
c730e2cd64a7 quay.io/coreos/etcd:v3.3.10    " / usr / local / bin / etcd "    Меньше секунды назад Создано etcd1

Вот мое решение. Я попытался запустить docker run hello-worldглавный узел, но получил следующую ошибку:

docker: ответ об ошибке от демона: сбой при создании среды выполнения OCI: невозможно получить ошибку времени выполнения OCI (откройте /run/containerd/io.containerd.runtime.v1.linux/moby/ec5230f20c55068a7603dbfb553a5799ad3902698a64be2ddd7748d: файл или каталог) fork / exec / usr / bin / nvidia-container-runtime: нет такого файла или каталога: неизвестно.
ERRO [0000] Ошибка ожидания для контейнера: контекст отменен

поэтому я устанавливаю на него nvidia-container-runtime, после этого проблема с etcd решена.

Wow, are you using nvidia-container-runtime? Interesting case)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 17, 2021
@whatnick
Copy link

Still happens against master. Similar setup as all cases above. Screenshot for evidence.
image

@whatnick
Copy link

whatnick commented Jul 23, 2021

Further investigation reveals that the container is running but stuck in an inaccessible port state.
LOGS:

$sudo docker logs etcd2
...
2021-07-23 11:02:43.794915 W | rafthttp: health check for peer ff0bb6fbc61533bb could not connect: dial tcp 192.168.1.23:2380: i/o timeout
2021-07-23 11:02:48.795253 W | rafthttp: health check for peer ff0bb6fbc61533bb could not connect: dial tcp 192.168.1.23:2380: i/o timeout
2021-07-23 11:02:48.795305 W | rafthttp: health check for peer ff0bb6fbc61533bb could not connect: dial tcp 192.168.1.23:2380: i/o timeout

DOCKER Container status:

$ sudo docker ps -a
CONTAINER ID   IMAGE                         COMMAND                 CREATED       STATUS       PORTS     NAMES
d29d9c7a4f78   quay.io/coreos/etcd:v3.4.13   "/usr/local/bin/etcd"   2 hours ago   Up 2 hours             etcd2

ETCD Launch command:

$ps -ef | grep etcd
root      144722       1  0 19:29 ?        00:00:00 /bin/bash /usr/local/bin/etcd
root      144723  144722  0 19:29 ?        00:00:01 /usr/bin/docker run --restart=on-failure:5 --env-file=/etc/etcd.env --net=host -v /etc/ssl/certs:/etc/ssl/certs:ro -v /etc/ssl/etcd/ssl:/etc/ssl/etcd/ssl:ro -v /var/lib/etcd:/var/lib/etcd:rw --memory=512M --blkio-weight=1000 --name=etcd2 quay.io/coreos/etcd:v3.4.13 /usr/local/bin/etcd
root      144783  144762  3 19:29 ?        00:03:09 /usr/local/bin/etcd

TELNET to port 2379/2380:

$ telnet 192.168.1.22 2380
Trying 192.168.1.22...
Connected to 192.168.1.22.
Escape character is '^]'.

$ telnet 192.168.1.22 2379
Trying 192.168.1.22...
Connected to 192.168.1.22.
Escape character is '^]'.

RESOLUTION:

Underlying cause was misconfigured UFW in node3, allowing traffic from rest of the nodes in the cluster makes etcd healthy*

@arno01
Copy link
Contributor

arno01 commented Jul 29, 2021

To me it seems like the global and local ip of the master nodes has been mixed up in some way and etcd is not happy about that.

Yep, looks like a bug in the template somewhere as with the single node deployment /etc/etcd.env gets this:

with the etcd_deployment_type: host

# cat /etc/etcd.env
...
ETCD_ADVERTISE_CLIENT_URLS=https://PUBLIC_IP:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://PUBLIC_IP:2380
ETCD_LISTEN_CLIENT_URLS=https://INTERNAL_IP:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://INTERNAL_IP:2380
ETCD_INITIAL_CLUSTER=etcd1=https://INTERNAL_IP:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

Hence we are getting the following error where etcd gets confused by the PUBLIC_IP it sees in the peers and the client urls:

2021-07-29 14:11:57.269323 C | etcdmain: --initial-cluster has etcd1=https://INTERNAL_IP:2380 but missing from --initial-advertise-peer-urls=https://PUBLIC_IP:2380 ("https://PUBLIC_IP:2380"(resolved from "https://PUBLIC_IP:2380") != "https://INTERNAL_IP:2380"(resolved from "https://INTERNAL_IP:2380"))

This is how the working etcd deployment should look like with 3 etcd nodes (with the 10.0.0.{1..3} internal IPs respectively):

root@node1:~# cat /etc/etcd.env | grep http
ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.1:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.1:2380
ETCD_LISTEN_CLIENT_URLS=https://10.0.0.1:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://10.0.0.1:2380
ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
root@node2:~# cat /etc/etcd.env | grep http
ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.2:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.2:2380
ETCD_LISTEN_CLIENT_URLS=https://10.0.0.2:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://10.0.0.2:2380
ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
root@node3:~# cat /etc/etcd.env | grep http
ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.3:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.3:2380
ETCD_LISTEN_CLIENT_URLS=https://10.0.0.3:2379,https://127.0.0.1:2379
ETCD_LISTEN_PEER_URLS=https://10.0.0.3:2380
ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.1:2380,etcd2=https://10.0.0.2:2380,etcd3=https://10.0.0.3:2380
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@patvdleer
Copy link

patvdleer commented Dec 21, 2021

I've ran into something very similar;

The healthcheck gives Error while dialing dial tcp 192.168.112.6:2379: connect: connection refused

I have;

# 0|1 bastion nodes
number_of_bastions = 1

# masters
number_of_k8s_masters = 1
number_of_k8s_masters_no_etcd = 0
number_of_k8s_masters_no_floating_ip = 0
number_of_k8s_masters_no_floating_ip_no_etcd = 0

# nodes
number_of_k8s_nodes = 0
number_of_k8s_nodes_no_floating_ip = 3

on master sudo docker logs etcd1

 etcdmain: --initial-cluster has etcd1=https://192.168.112.6:2380 but missing from --initial-advertise-peer-urls=https://89.xx.xx.xxx:2380 ("https://89.xx.xx.xxx:2380"(resolved from "https://89.xx.xx.xxx:2380") != "https://192.168.112.6:2380"(resolved from "https://192.168.112.6:2380"))

running contrib/terraform/openstack/hosts --list --pretty

 "k8s-gripop-k8s-master-1": {
                "access_ip_v4": "89.xx.xx.xxx",
                "access_ip_v6": "",
                "access_ip": "89.xx.xx.xxx",
                "ip": "192.168.112.6",

In /roles/kubespray-defaults/defaults/main.yaml, replace the lines below to make it work

etcd_access_address: "{{ access_ip | default(etcd_address) }}"
etcd_events_access_address: "{{ access_ip | default(etcd_address) }}"

by

etcd_access_address: "{{ ip | default(etcd_address) }}"
etcd_events_access_address: "{{ ip | default(etcd_address) }}"

@Kirkirillka
Copy link

I'm wondering when this proposal will be merged? Seems work fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests