-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TASK [etcd : Configure | Check if etcd cluster is healthy] is failing #5550
Comments
Having the same issue now. I'm trying to add new worker nodes after updating hosts.yaml. |
all task pass except one NO MORE HOSTS LEFT ************************************************************************************************************************************************************************************************ PLAY RECAP ******************************************************************************************************************************************************************************************************** Tuesday 04 February 2020 17:00:09 +0530 (0:00:41.193) 0:57:07.307 ****** |
Same here:
but doing manually:
Problem with no_proxy? The |
I tried with "retries: 16" (changed it in roles/etcd/tasks/configure.yml) thinking it might be related to timing issue. I still see the same failure with retires set to 16.
|
I ssh to the node and run the etcdctl command. Got the same output. The 10.100.15.100 is the master node's IP.
|
Any update on this issue? I ran into same issue, not sure what's the resolution for this issue. |
In my environment, it uses proxy to get out to the Internet. Because the {http,https}_proxy environment variables were not set, ansible tasks failed. Setting the proxy environment variables fixed the issue for us. |
We're experiencing the same issue and at least for us it looks permissions-related.
We're using Centos 7 and |
On the proxy environments, for some reason the kubespray does not use the (autogenerated) no_proxy variable when it is checking the etcd health. If you add the IP address of the master/etcd to the # the last line is relevant here
cat /etc/environment
export https_proxy=http://192.168.177.2:911
export http_proxy=http://192.168.177.2:911
export no_proxy=192.168.177.108 EDIT: The above is not entirely true. Kubespray uses autogenerated no_proxy but only if your |
Could this be because there is an even number of master nodes? |
for me, it got resolved by add task for disable firewall/iptables. then all thing get works |
same for me! |
I think this is the correct fix for this issue: #6346 |
Can you elaborate how I can add a task to disable firewall/iptables? |
Doesn't feel right to stop the firewall but after stopping the firewall everything works ok. Add a new task to the config file kubespray/roles/etcd/tasks/configure.yml
|
same problem with CentOS7 but for my part I have the impression that it is due to the fact that it does not use the correct ip for the verification of the cluster |
I am facing the same issue, provided my error below. I then run kubespray but this time I only included node1 in the etcd groups, so etcd will only be available on node1 x.x.x.231 is node1 inventroy/mycluster/hosts.yml: (configured for three nodes)
inventroy/mycluster/hosts.yml: (configures with one node)
I even tested the etcd health manually:
|
+1 |
Any update? I'm having the same issue |
I have the same issue. Tested some of the solutions suggested here and none worked for me. I ended up taking a look at the docker container running the etcd service and it seems to be stuck in some crash/reboot loop.
First thing I wonder here is that there are no ports connected to the container, is that not needed? Then I check the logs to see what caused this and this is what I get.
To me it seems like the global and local ip of the master nodes has been mixed up in some way and etcd is not happy about that. |
Good day! I had a similar error in the case when the etcd - three nodes, there were no problems with one etcd .. my stack is ubuntu 16.04 as a hypervisor, the nodes work in virtual machines on KVM via a network bridge (three masters, on each of the masters one instance of etcd , three workers and one ingress node.) I understood that the problem was in the network connectivity, but I could not understand where, since all the nodes pinged to each other and connections to the etcd via telnet passed, the firewall was disabled on all nodes. But in order to have access to the Internet on all nodes of the cluster, a masquerading rule was written on the hypervisor |
+1 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c730e2cd64a7 quay.io/coreos/etcd:v3.3.10 "/usr/local/bin/etcd" Less than a second ago Created etcd1 |
Here is my solution. I tried running docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/ec5230f20c55068a7603dbfb553a5799ad3902698a64be2ddd7748d2fcfc65e9/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled so I install nvidia-container-runtime on it, the etcd issue solved after this. |
Wow, are you using nvidia-container-runtime? Interesting case) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Further investigation reveals that the container is running but stuck in an inaccessible port state.
DOCKER Container status:
ETCD Launch command:
TELNET to port 2379/2380:
RESOLUTION: Underlying cause was misconfigured UFW in node3, allowing traffic from rest of the nodes in the cluster makes etcd healthy* |
Yep, looks like a bug in the template somewhere as with the single node deployment
Hence we are getting the following error where
This is how the working etcd deployment should look like with 3 etcd nodes (with the
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I've ran into something very similar; The healthcheck gives I have;
on master
running
In
by
|
I'm wondering when this proposal will be merged? Seems work fine |
I "git clone" the kubespray project on 1/16/20 @ 8 AM PST. I got past the "repoquery container-selinux" failure that was reported earlier. I am seeing this failure and wonder there is a quick fix that I can apply. My cluster has one master and two worker nodes. The etcd check on the worker nodes is fine but not the master.
ansible-playbook -i inventory/ks89/hosts.yaml --become --become-user=root cluster.yml
TASK [etcd : Configure | Check if etcd cluster is healthy] ******************************************************************************************************************************
Thursday 16 January 2020 07:51:25 -0800 (0:00:00.103) 0:11:01.407 ******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 10.25.26.100]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://10.25.26.100:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.022447", "end": "2020-01-16 15:51:47.503910", "msg": "non-zero return code", "rc": 1, "start": "2020-01-16 15:51:45.481463", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.25.26.100:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT **********************************************************************************************************************************************************************
to retry, use: --limit @/home/ngs/ks89/kubespray/cluster.retry
PLAY RECAP ******************************************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=460 changed=56 unreachable=0 failed=1
node2 : ok=367 changed=47 unreachable=0 failed=0
node3 : ok=366 changed=47 unreachable=0 failed=0
My hosts.yaml file looks like this
cat hosts.yaml
all:
hosts:
node1:
ansible_host: 10.25.26.100
ip: 10.25.26.100
access_ip: 10.25.26.100
node2:
ansible_host: 10.25.26.101
ip: 10.25.26.101
access_ip: 10.25.26.101
node3:
ansible_host: 10.25.26.102
ip: 10.25.26.102
access_ip: 10.25.26.102
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node2:
node3:
etcd:
hosts:
node1:
node2:
node3:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}
Environment:
My environment is CentOS 7.
cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
ansible --version
ansible 2.7.12
$ git rev-parse --short HEAD
d640a57
The text was updated successfully, but these errors were encountered: