Restore cluster with embedded etcd datastore snapshot failed #5334

chinasoftgit · 2022-03-25T03:06:54Z

Environmental info:
K3s Version:v1.22.7+k3s1
K3s Cluster:
NAME STATUS ROLES AGE VERSION
server-virtualbox Ready control-plane,etcd,master 21m v1.22.7+k3s1
server2-virtualbox Ready control-plane,etcd,master 17m v1.22.7+k3s1
server3-virtualbox Ready control-plane,etcd,master 13m v1.22.7+k3s1

Rsetore Snapshot:

stop first master server :
systemctl stop k3s
restore server from snapshot
k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-serverXXXX
stop other master server and start k3s
systemctl stop k3s
rm -rf /var/lib/rancher/k3s/db
systemctl start k3s

Expected behavior:
cluster is healthy

Describe bug:
the first master ,ks3 server not active running，logs：

INFO[0011] Failed to set etcd role label: failed to register CRDs: Get "https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1/customresourcedefinitions": dial tcp 127.0.0.1:6444: connect: connection refused
INFO[0012] etcd data store connection OK
INFO[0012] ETCD server is now running
INFO[0012] k3s is up and running
WARN[0012] failed to unmarshal etcd key: unexpected end of JSON input
WARN[0012] bootstrap key already exists
INFO[0012] Reconciling etcd snapshot data in k3s-etcd-snapshots ConfigMap
INFO[0012] Reconciling bootstrap data between datastore and disk
INFO[0012] Cluster reset: backing up certificates directory to /var/lib/rancher/k3s/server/tls-1648174319
INFO[0012] Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

and other master server not running，how can restore healthy cluster with embedded etcd datastore snapshot correctly？

brandond · 2022-03-25T03:37:39Z

Did you see the instructions at the end of the log?

INFO[0012] Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

chinasoftgit · 2022-03-25T07:05:32Z

Did you see the instructions at the end of the log?

INFO[0012] Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

how to restart without --cluster-reset flag? In the first master node ，execute “k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-serverXXXX”， and then execute “systemctl restart k3s”, k3s server is not running and inactive.

cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s
server
'--cluster-init'
'--flannel-backend=none'
'--disable-network-policy'
'--disable=traefik'
'--node-ip=xx.xx.xx.xx'
'--etcd-snapshot-schedule-cron=0 */1 * * *' \

brandond · 2022-03-25T17:38:17Z

If you stopped the K3s service and ran the cluster-reset command from your shell, all you should need to do is start the k3s service again? Did you somehow run the cluster-reset command with the k3s service still running?

chinasoftgit · 2022-03-28T01:54:17Z

After ran the cluster-reset command, Istart k3s service again, but k3s service can not run , always in activating (start), logs:

● k3s.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: activating (start) since Mon 2022-03-28 09:47:51 CST; 1min 57s ago
Docs: https://k3s.io
Process: 19464 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Process: 19466 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 19467 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 19468 (k3s-server)
Tasks: 16
Memory: 150.2M
CGroup: /system.slice/k3s.service
├─19468 /usr/local/bin/k3s server
└─19479 containerd

3月 28 09:49:40 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:40+08:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
3月 28 09:49:40 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:40+08:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
3月 28 09:49:40 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:40+08:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
3月 28 09:49:43 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:43+08:00" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [server-virtualbox-49709429=https://10.0.2.1>
3月 28 09:49:43 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:43+08:00" level=info msg="runtime is not yet initialized"
3月 28 09:49:45 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:45+08:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
3月 28 09:49:45 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:45+08:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
3月 28 09:49:45 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:45+08:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
3月 28 09:49:48 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:48+08:00" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [server-virtualbox-49709429=https://10.0.2.1>
3月 28 09:49:48 server-VirtualBox k3s[19468]: time="2022-03-28T09:49:48+08:00" level=info msg="runtime is not yet initialized"

brandond · 2022-03-29T17:47:55Z

Did you change the VM's hostname or IP address at some point? The log line is truncated here but the logs indicate that the node's current name/IP cannot be found in the cluster. If you included the whole log line here it should be more clear what the node's current name/IP are.

Failed to test data store connection: this server is a not a member of the etcd cluster. Found [server-virtualbox-49709429=https://10.0.2.1

chinasoftgit · 2022-03-30T09:15:38Z

thanks, my vm has multiple networking-interfaces.I exec cluster-reset --node-ip , restore success. In addition，I exec etcdctl restore snapshots command， also can restore success.

dberardo-com · 2023-07-11T15:33:20Z

tbh i think the documentation is misleading. after losing about 2 hours time finding for a solution, i have figured out that i had to remove not just the data_dir/server/db folder, but the whole data_dir/server directory ... because otherwise k3s will ignore the --server option and try to recreate a sqlite database.

is this normal ?

note: i faced this problem so i had to restore the cluster: etcd-io/etcd#13766

brandond · 2023-07-13T17:12:21Z

otherwise k3s will ignore the --server option and try to recreate a sqlite database.

No, that's not right. All you should need to delete is the db dir. If there are existing etcd files on disk, it will ignore the --server flag and try to use the existing etcd datastore, but there is nothing that will cause it to ignore the --server flag and create a new sqlite datastore.

chinasoftgit closed this as completed Mar 30, 2022

chinasoftgit reopened this Mar 30, 2022

chinasoftgit closed this as completed Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore cluster with embedded etcd datastore snapshot failed #5334

Restore cluster with embedded etcd datastore snapshot failed #5334

chinasoftgit commented Mar 25, 2022

brandond commented Mar 25, 2022

chinasoftgit commented Mar 25, 2022 •

edited

Loading

brandond commented Mar 25, 2022 •

edited

Loading

chinasoftgit commented Mar 28, 2022

brandond commented Mar 29, 2022 •

edited

Loading

chinasoftgit commented Mar 30, 2022

dberardo-com commented Jul 11, 2023

brandond commented Jul 13, 2023

Restore cluster with embedded etcd datastore snapshot failed #5334

Restore cluster with embedded etcd datastore snapshot failed #5334

Comments

chinasoftgit commented Mar 25, 2022

brandond commented Mar 25, 2022

chinasoftgit commented Mar 25, 2022 • edited Loading

brandond commented Mar 25, 2022 • edited Loading

chinasoftgit commented Mar 28, 2022

brandond commented Mar 29, 2022 • edited Loading

chinasoftgit commented Mar 30, 2022

dberardo-com commented Jul 11, 2023

brandond commented Jul 13, 2023

chinasoftgit commented Mar 25, 2022 •

edited

Loading

brandond commented Mar 25, 2022 •

edited

Loading

brandond commented Mar 29, 2022 •

edited

Loading