Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore cluster with embedded etcd datastore snapshot failed #5334

Closed
chinasoftgit opened this issue Mar 25, 2022 · 8 comments
Closed

Restore cluster with embedded etcd datastore snapshot failed #5334

chinasoftgit opened this issue Mar 25, 2022 · 8 comments

Comments

@chinasoftgit
Copy link

Environmental info:
K3s Version:v1.22.7+k3s1
K3s Cluster:
NAME STATUS ROLES AGE VERSION
server-virtualbox Ready control-plane,etcd,master 21m v1.22.7+k3s1
server2-virtualbox Ready control-plane,etcd,master 17m v1.22.7+k3s1
server3-virtualbox Ready control-plane,etcd,master 13m v1.22.7+k3s1

Rsetore Snapshot:

  1. stop first master server :
    systemctl stop k3s
  2. restore server from snapshot
    k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-serverXXXX
  3. stop other master server and start k3s
    systemctl stop k3s
    rm -rf /var/lib/rancher/k3s/db
    systemctl start k3s

Expected behavior:
cluster is healthy

Describe bug:
the first master ,ks3 server not active running,logs:

INFO[0011] Failed to set etcd role label: failed to register CRDs: Get "https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1/customresourcedefinitions": dial tcp 127.0.0.1:6444: connect: connection refused
INFO[0012] etcd data store connection OK
INFO[0012] ETCD server is now running
INFO[0012] k3s is up and running
WARN[0012] failed to unmarshal etcd key: unexpected end of JSON input
WARN[0012] bootstrap key already exists
INFO[0012] Reconciling etcd snapshot data in k3s-etcd-snapshots ConfigMap
INFO[0012] Reconciling bootstrap data between datastore and disk
INFO[0012] Cluster reset: backing up certificates directory to /var/lib/rancher/k3s/server/tls-1648174319
INFO[0012] Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

and other master server not running,how can restore healthy cluster with embedded etcd datastore snapshot correctly?

@brandond
Copy link
Member

Did you see the instructions at the end of the log?

INFO[0012] Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

@chinasoftgit
Copy link
Author

chinasoftgit commented Mar 25, 2022

Did you see the instructions at the end of the log?

INFO[0012] Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

how to restart without --cluster-reset flag? In the first master node ,execute “k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-serverXXXX”, and then execute “systemctl restart k3s”, k3s server is not running and inactive.

cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s
server
'--cluster-init'
'--flannel-backend=none'
'--disable-network-policy'
'--disable=traefik'
'--node-ip=xx.xx.xx.xx'
'--etcd-snapshot-schedule-cron=0 */1 * * *' \

@brandond
Copy link
Member

brandond commented Mar 25, 2022

If you stopped the K3s service and ran the cluster-reset command from your shell, all you should need to do is start the k3s service again? Did you somehow run the cluster-reset command with the k3s service still running?

@chinasoftgit
Copy link
Author

After ran the cluster-reset command, Istart k3s service again, but k3s service can not run , always in activating (start), logs:

● k3s.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: activating (start) since Mon 2022-03-28 09:47:51 CST; 1min 57s ago
Docs: https://k3s.io
Process: 19464 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Process: 19466 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 19467 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 19468 (k3s-server)
Tasks: 16
Memory: 150.2M
CGroup: /system.slice/k3s.service
├─19468 /usr/local/bin/k3s server
└─19479 containerd

3月 28 09:49:40 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:40+08:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
3月 28 09:49:40 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:40+08:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
3月 28 09:49:40 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:40+08:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
3月 28 09:49:43 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:43+08:00" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [server-virtualbox-49709429=https://10.0.2.1>
3月 28 09:49:43 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:43+08:00" level=info msg="runtime is not yet initialized"
3月 28 09:49:45 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:45+08:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
3月 28 09:49:45 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:45+08:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
3月 28 09:49:45 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:45+08:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
3月 28 09:49:48 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:48+08:00" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [server-virtualbox-49709429=https://10.0.2.1>
3月 28 09:49:48 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:48+08:00" level=info msg="runtime is not yet initialized"

@brandond
Copy link
Member

brandond commented Mar 29, 2022

Did you change the VM's hostname or IP address at some point? The log line is truncated here but the logs indicate that the node's current name/IP cannot be found in the cluster. If you included the whole log line here it should be more clear what the node's current name/IP are.

Failed to test data store connection: this server is a not a member of the etcd cluster. Found [server-virtualbox-49709429=https://10.0.2.1

@chinasoftgit
Copy link
Author

thanks, my vm has multiple networking-interfaces.I exec cluster-reset --node-ip , restore success. In addition,I exec etcdctl restore snapshots command, also can restore success.

@dberardo-com
Copy link

tbh i think the documentation is misleading. after losing about 2 hours time finding for a solution, i have figured out that i had to remove not just the data_dir/server/db folder, but the whole data_dir/server directory ... because otherwise k3s will ignore the --server option and try to recreate a sqlite database.

is this normal ?

note: i faced this problem so i had to restore the cluster: etcd-io/etcd#13766

@brandond
Copy link
Member

otherwise k3s will ignore the --server option and try to recreate a sqlite database.

No, that's not right. All you should need to delete is the db dir. If there are existing etcd files on disk, it will ignore the --server flag and try to use the existing etcd datastore, but there is nothing that will cause it to ignore the --server flag and create a new sqlite datastore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants