-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Recovering from etcd minority failure
- Loading branch information
1 parent
eaad2f9
commit 274682e
Showing
2 changed files
with
278 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,277 @@ | ||
// Module included in the following assembly: | ||
// | ||
// * hcp-backup-restore-dr.adoc | ||
|
||
:_content-type: PROCEDURE | ||
[id="hosted-cluster-etcd-quorum-loss-recovery_{context}"] | ||
= Recovering an etcd cluster from a quorum loss | ||
|
||
If multiple members of the etcd cluster have lost data or return a `CrashLoopBackOff` status, it can cause an etcd quorum loss. You must restore your etcd cluster from a snapshot. | ||
|
||
[IMPORTANT] | ||
==== | ||
This procedure requires API downtime. | ||
==== | ||
|
||
.Prerequisites | ||
* The `oc` and `jq` binaries have been installed. | ||
.Procedure | ||
|
||
. First, set up your environment variables and scale down the API servers: | ||
|
||
.. Set up environment variables for your hosted cluster by entering the following commands, replacing values as necessary: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ CLUSTER_NAME=my-cluster | ||
---- | ||
+ | ||
[source,terminal] | ||
---- | ||
$ HOSTED_CLUSTER_NAMESPACE=clusters | ||
---- | ||
+ | ||
[source,terminal] | ||
---- | ||
$ CONTROL_PLANE_NAMESPACE="${HOSTED_CLUSTER_NAMESPACE}-${CLUSTER_NAME}" | ||
---- | ||
|
||
.. Pause reconciliation of the hosted cluster by entering the following command, replacing values as necessary: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":"true"}}' --type=merge | ||
---- | ||
|
||
.. Scale down the API servers by entering the following commands: | ||
+ | ||
... Scale down the `kube-apiserver`: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/kube-apiserver --replicas=0 | ||
---- | ||
|
||
... Scale down the `openshift-apiserver`: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-apiserver --replicas=0 | ||
---- | ||
|
||
... Scale down the `openshift-oauth-apiserver`: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-oauth-apiserver --replicas=0 | ||
---- | ||
|
||
. Next, take a snapshot of etcd by using one of the following methods: | ||
|
||
.. Use a previously backed-up snapshot of etcd. | ||
.. If you have an available etcd pod, take a snapshot from the active etcd pod by completing the following steps: | ||
|
||
... List etcd pods by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd | ||
---- | ||
|
||
... Take a snapshot of the pod database and save it locally to your machine by entering the following commands: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ ETCD_POD=etcd-0 | ||
---- | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- env ETCDCTL_API=3 /usr/bin/etcdctl \ | ||
--cacert /etc/etcd/tls/etcd-ca/ca.crt \ | ||
--cert /etc/etcd/tls/client/etcd-client.crt \ | ||
--key /etc/etcd/tls/client/etcd-client.key \ | ||
--endpoints=https://localhost:2379 \ | ||
snapshot save /var/lib/snapshot.db | ||
---- | ||
|
||
... Verify that the snapshot is successful by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status /var/lib/snapshot.db | ||
---- | ||
|
||
.. Make a local copy of the snapshot by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/snapshot.db /tmp/etcd.snapshot.db | ||
---- | ||
|
||
... Make a copy of the snapshot database from etcd persistent storage: | ||
+ | ||
.... List etcd pods by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd | ||
---- | ||
|
||
.... Find a pod that is running and set its name as the value of `ETCD_POD: ETCD_POD=etcd-0`, and then copy its snapshot database by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/data/member/snap/db /tmp/etcd.snapshot.db | ||
---- | ||
|
||
. Next, scale down the etcd statefulset by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0 | ||
---- | ||
|
||
.. Delete volumes for second and third members by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2 | ||
---- | ||
|
||
.. Create a pod to access the first etcd member's data: | ||
|
||
... Get the etcd image by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd -o jsonpath='{ .spec.template.spec.containers[0].image }') | ||
---- | ||
+ | ||
... Create a pod that allows access to etcd data: | ||
+ | ||
[source,yaml] | ||
---- | ||
$ cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f - | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: etcd-data | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: etcd-data | ||
template: | ||
metadata: | ||
labels: | ||
app: etcd-data | ||
spec: | ||
containers: | ||
- name: access | ||
image: $ETCD_IMAGE | ||
volumeMounts: | ||
- name: data | ||
mountPath: /var/lib | ||
command: | ||
- /usr/bin/bash | ||
args: | ||
- -c | ||
- |- | ||
while true; do | ||
sleep 1000 | ||
done | ||
volumes: | ||
- name: data | ||
persistentVolumeClaim: | ||
claimName: data-etcd-0 | ||
EOF | ||
---- | ||
|
||
... Check the status of the `etcd-data` pod and wait for it to be running by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data | ||
---- | ||
|
||
... Get the name of the `etcd-data` pod by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ DATA_POD=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers -l app=etcd-data -o name | cut -d/ -f2) | ||
---- | ||
|
||
.. Copy an etcd snapshot into the pod by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc cp /tmp/etcd.snapshot.db ${CONTROL_PLANE_NAMESPACE}/${DATA_POD}:/var/lib/restored.snap.db | ||
---- | ||
|
||
.. Remove old data from the `etcd-data` pod by entering the following commands: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm -rf /var/lib/data | ||
---- | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- mkdir -p /var/lib/data | ||
---- | ||
|
||
.. Restore the etcd snapshot by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- etcdutl snapshot restore /var/lib/restored.snap.db \ | ||
--data-dir=/var/lib/data --skip-hash-check \ | ||
--name etcd-0 \ | ||
--initial-cluster-token=etcd-cluster \ | ||
--initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \ | ||
--initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 | ||
---- | ||
|
||
.. Remove the temporary etcd snapshot from the pod by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm /var/lib/restored.snap.db | ||
---- | ||
|
||
.. Delete data access deployment by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data | ||
---- | ||
|
||
.. Scale up the etcd cluster by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3 | ||
---- | ||
|
||
.. Wait for the etcd member pods to return and report as available by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w | ||
---- | ||
|
||
.. Scale up all etcd-writer deployments by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc scale deployment -n ${CONTROL_PLANE_NAMESPACE} --replicas=3 kube-apiserver openshift-apiserver openshift-oauth-apiserver | ||
---- | ||
|
||
. Restore reconciliation of the hosted cluster by entering the following command: | ||
+ | ||
[source,terminal] | ||
---- | ||
$ oc patch -n ${CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":""}}' --type=merge | ||
---- |