Have OpenShift and k8s etcd cluster details in separate dirs (kaniste…

…rio#739) * 1. Move OCP and k8s to different dirs 2. Change OpenShift blueprint to have namespace as subject * 1. Complete the OCP restore with disaster imitate 2. Push the trimmed down version of restore script in kanister * Fix typos Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
stevejoachim · Aug 24, 2020 · d33d9aa · d33d9aa
1 parent d085af6
commit d33d9aa
Show file tree

Hide file tree

Showing 7 changed files with 384 additions and 74 deletions.
diff --git a/examples/etcd/etcd-in-cluster/README.md → examples/etcd/etcd-in-cluster/k8s/README.md b/examples/etcd/etcd-in-cluster/README.md → examples/etcd/etcd-in-cluster/k8s/README.md
@@ -8,7 +8,6 @@ The cluster this example is perfomed on is two node kubeadm cluster.
 * Kubernetes 1.9+ with Beta APIs enabled, and you are not on managed Kubernetes
 * PV support on the underlying infrastructure
 * Kanister version 0.32.0 with `profiles.cr.kanister.io` CRD, [`kanctl`](https://docs.kanister.io/tooling.html#install-the-tools) Kanister tool installed
-* If you are OpenShift cluster, you can use OpenShift client (`oc`) instead of `kubectl`, but `kubectl` would also work
 
 # Integrating with Kanister
 
@@ -55,14 +54,6 @@ secret is going to have the name of the format `etcd-<etcd-pod-namespace>` with
 secret/etcd-kube-system created
 ```
 
-If you are on OpenShift distribution you dont have to mention all these details you just have mention the etcd endpoint in the secret. Or in other
-words if you are on OpenShift, the create secret command is going to look like this
-
-```
-» oc create secret generic etcd-openshift-etcd --from-literal=endpoints=https://10.0.133.5:2379 --from-literal=key=/etc/kubernetes/pki/etcd/server.key -n openshift-etcd
-secret/etcd-openshift-etcd created
-```
-
 **Note**
 
 Please make sure that you have correct path of these certificate files. If any of the path is incorrect the etcd snapshot will fail.
@@ -76,13 +67,6 @@ Once secret is created, let's go ahead and create Blueprint in the same namespac
 blueprint.cr.kanister.io/etcd-blueprint created
 ```
 
-For OpenShift clusters please run below command to create the blueprint
-
-```
-» oc create -f etcd-incluster-os-blueprint.yaml -n kanister
-blueprint.cr.kanister.io/etcd-blueprint created
-```
-
 ## Create test namespace
 
 Now we can create a test namespace, and delete it after taking the ETCD backup so that we can make sure that the namespace is restored
@@ -263,46 +247,6 @@ that would belong to one cluster and accidently wouldnt join any other one.
 
 To explore more about this we can look into the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#backing-up-an-etcd-cluster).
 
-### Restoring the ETCD snapshot on OpenShift clusters
-
-The steps that are mentioned above can be followed restore the backup on Kubernetes cluster but to restore the snapshot on OpenShift cluster
-the OpenShift community provides `cluster-restore` script that can be used.
-
-[This documentation](https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state) has details steps on how to restore the snapshot using `cluster-restore` script.
-
-```
-# kubectl describe pod -n kube-system etcd-ubuntu-s-4vcpu-8gb-blr1-01-master-1
-Name:                 etcd-ubuntu-s-4vcpu-8gb-blr1-01-master-1
-Namespace:            kube-system
-....
-....
-Containers:
-  etcd:
-    Image:      k8s.gcr.io/etcd:3.4.3-0
-    Port:       <none>
-    Host Port:  <none>
-    Command:
-      etcd
-      ...
-      --data-dir=/var/lib/etcd-from-backup
-      ...
-      --initial-cluster-token=etcd-cluster-1
-    Liveness:     http-get http://127.0.0.1:2381/health delay=15s timeout=15s period=10s #success=1 #failure=8
-    Environment:  <none>
-    Mounts:
-      /etc/kubernetes/pki/etcd from etcd-certs (rw)
-      /var/lib/etcd-from-backup from etcd-data (rw)
-Volumes:
-  etcd-certs:
-    Type:          HostPath (bare host directory volume)
-    Path:          /etc/kubernetes/pki/etcd
-    HostPathType:  DirectoryOrCreate
-  etcd-data:
-    Type:          HostPath (bare host directory volume)
-    Path:          /var/lib/etcd-from-backup
-    HostPathType:  DirectoryOrCreate
-...
-```
 
 ## Delete the Artifacts
 

diff --git a/...tcd/etcd-in-cluster/backup-actionset.yaml → ...etcd-in-cluster/k8s/backup-actionset.yaml b/...tcd/etcd-in-cluster/backup-actionset.yaml → ...etcd-in-cluster/k8s/backup-actionset.yaml
diff --git a/...-in-cluster/etcd-incluster-blueprint.yaml → ...cluster/k8s/etcd-incluster-blueprint.yaml b/...-in-cluster/etcd-incluster-blueprint.yaml → ...cluster/k8s/etcd-incluster-blueprint.yaml
diff --git a/examples/etcd/etcd-in-cluster/ocp/README.md b/examples/etcd/etcd-in-cluster/ocp/README.md
@@ -0,0 +1,274 @@
+This document is going to show you how you can backup the ETCD of your OpenShift cluster. The
+commands are run into an [OCP](https://www.openshift.com/products/container-platform) but it should work on any other OpenShift cluster.
+
+
+## Prerequisites Details
+
+* OpenShift (OCP) cluster
+* PV support on the underlying infrastructure
+* Kanister version 0.32.0 with `profiles.cr.kanister.io` CRD, [`kanctl`](https://docs.kanister.io/tooling.html#install-the-tools) Kanister tool installed
+
+
+# Integrating with Kanister
+
+When we say integrating with Kanister we mean creating some CRs, for example Blueprint and ActionSet
+that would help perform the actions on the the ETCD instance that we are running.
+
+## Create Profile resource
+
+```bash
+» kanctl create profile s3compliant --access-key <aws-access-key> \
+        --secret-key <aws-secret-key> \
+        --bucket <bucket-name> --region <region-name> \
+        --namespace kanister
+secret 's3-secret-7umv91' created
+profile 's3-profile-nnvmm' created
+```
+This command creates a profile which we will use later.
+
+The command will configure a location where artifacts resulting from Kanister
+data operations such as backup should go. This is stored as a `profiles.cr.kanister.io`
+*CustomResource (CR)* which is then referenced in Kanister ActionSets. Every ActionSet
+requires a Profile reference to complete the action. This CR (`profiles.cr.kanister.io`)
+can be shared between Kanister-enabled application instances.
+
+
+## Create Blueprint
+
+Before actually creating the Blueprint, we will have to create a secret in the same namespace where your ETCD pod is running. This
+secret is going to have the name of the format `etcd-<etcd-pod-namespace>` with these fields
+
+- **endpoints** : ETCD server client listen URL, https://[127.0.0.1]:2379
+- **labels** : These labels will be used to identify the ETCD pods that are running, for ex `app=etcd,etcd=true`
+
+Below command can be used to create the secret, assuming the ETCD pods are running in the `openshift-etcd` namespace
+
+```
+» oc create secret generic etcd-openshift-etcd \
+    --from-literal=endpoints=https://10.0.133.5:2379 \
+    --from-literal=labels=app=etcd,etcd=true \
+    --namespace openshift-etcd
+secret/etcd-openshift-etcd created
+```
+
+Once the secret is created below command can be used to create the Blueprint
+
+```
+» oc apply -f etcd-incluster-ocp-blueprint.yaml -n kanister
+blueprint.cr.kanister.io/etcd-blueprint configured
+```
+
+## Protect the Application
+
+Before actually taking the backup of ETCD let's first create a dummy namespace and some resources in that namespace, and after taking the ETCD backup we will delete
+this namespace, so that we can check if this namespace has actually been restored after restoring the ETCD.
+
+```
+root@workmachine:/repo# oc create ns nginx
+namespace/nginx created
+root@workmachine:/repo# oc create deployment -n nginx nginx --image nginx
+deployment.apps/nginx created
+root@workmachine:/repo# oc get all -n nginx
+NAME                        READY     STATUS             RESTARTS   AGE
+pod/nginx-f89759699-k6f5n   0/1       CrashLoopBackOff   2          45s
+
+NAME                    READY     UP-TO-DATE   AVAILABLE   AGE
+deployment.apps/nginx   0/1       1            0           46s
+
+NAME                              DESIRED   CURRENT   READY     AGE
+replicaset.apps/nginx-f89759699   1         1         0         47s
+
+```
+
+We can now take snapshot of the ETCD server that is running by creating backup ActionSet that is going to execute backup phase from the Blueprint that we have
+created above
+
+**Note**
+
+Please make sure to change the **profile-name**,  **namespace-name** and **blueprint name** in the `backup-actionset.yaml` manifest file. Where `namespace-name` is the namespace where the ETCD pods are running.
+
+```
+# find the profile name
+» oc get profiles.cr.kanister.io -n kanister
+NAME               AGE
+s3-profile-2lhk8   52s
+
+# find the Blueprint name
+» oc get blueprint -n kanister
+NAME             AGE
+etcd-blueprint   85m
+
+# create actionset
+» oc create -f backup-actionset.yaml --namespace kanister
+actionset.cr.kanister.io/backup-4f6jn created
+
+# you can check the status of the actionset to make sure it has been completed
+» oc describe actionset -n kanister backup-4f6jn
+Name:         backup-4f6jn
+Namespace:    kanister
+Labels:       <none>
+...
+...
+Events:
+  Type    Reason           Age   From                 Message
+  ----    ------           ----  ----                 -------
+  Normal  Started Action   3m    Kanister Controller  Executing action backup
+  Normal  Started Phase    3m    Kanister Controller  Executing phase takeSnapshot
+  Normal  Ended Phase      3m    Kanister Controller  Completed phase takeSnapshot
+  Normal  Started Phase    3m    Kanister Controller  Executing phase uploadSnapshot
+  Normal  Ended Phase      2m    Kanister Controller  Completed phase uploadSnapshot
+  Normal  Started Phase    2m    Kanister Controller  Executing phase removeSnapshot
+  Normal  Ended Phase      2m    Kanister Controller  Completed phase removeSnapshot
+  Normal  Update Complete  2m    Kanister Controller  Updated ActionSet 'backup-4f6jn' Status->complete
+```
+
+## Imitate Disaster
+
+After the backup has successfully been taken, let's go ahead and delete the dummy namespace that we created to imitate the disaster
+
+```
+root@workmachine:/repo# oc delete ns nginx
+namespace "nginx" deleted
+
+root@workmachine:/repo# oc get all -n nginx
+No resources found.
+```
+
+
+## Restore ETCD cluster
+
+To restore the ETCD cluster we can follow the [documentation](https://docs.openshift.com/container-platform/4.5/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html) that is provided by the OpenShift team.
+But we will have to make some modification in the restore script (`cluster-restore.sh`) because default
+restore script expects the static pods manifest as well, and in our case we didn't backup the static pod manifests.
+
+You can follow the steps that are mentioned below along with the documentation that is mentioned above, most of the steps that are mentioned here are either directly taken from the documentation or are modified version of it. Among all the running leader nodes choose one node to be the restore node, make sure you have SSH connectivity to all of the leader nodes including the one that you have chosen to be restore node.
+
+You will have to have a command line utility that can be used to download the ETCD snapshot that we have taken in the eariler step, that will depend on the object
+storage that you used. For example if you used the object storage to be AWS S3, you will need `aws` cli to download the ETCD snapshot. Once you have the CLI installed
+on the restore host, below steps can be followed to restore ETCD:
+
+- Download the ETCD snapshot on the restore host using the aws cli on a specific path let's say `/var/home/core/etcd-backup`
+- Stop the static pods from all other leader hosts (not the recovery host) by copying the manifests out of the static pod path dir, i.e. `/etc/kubernetes/manifests`
+
+  ```
+  # move etcd pod manifest
+  sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
+
+  # make sure etcd pod has been stopped
+  sudo crictl ps | grep etcd
+
+  # move api server pod
+  sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
+  ```
+
+- Move the etcd data dir to a different location
+  ```
+  sudo mv /var/lib/etcd/ /tmp
+  ```
+
+  Repeat these steps on all other leader hosts that are not the restore host
+
+- Run the `cluster-ocp-restore.sh` script with the location where you have downloaded the etcd snapshot that in our case is `/var/home/core/etcd-backup`
+
+  ```
+  sudo ./cluster-ocp-restore.sh /var/home/core/etcd-backup
+  ```
+
+- Restart `kubelet` service on all the leader nodes
+
+  ```
+  sudo systemctl restart kubelet.service
+  ```
+
+- Verify single ETCD node has been started, run below from recovery host to check if ETCD container is up
+
+  ```
+  sudo crictl ps | grep etcd
+
+  # you can also verify that the ETCD pod is running now.
+  root@workmachine:/repo# oc get pods -n openshift-etcd
+  NAME                                                           READY     STATUS      RESTARTS   AGE
+  etcd-ip-10-0-149-197.us-west-1.compute.internal                1/1       Running     0          3m57s
+  installer-2-ip-10-0-149-197.us-west-1.compute.internal         0/1       Completed   0          7h54m
+  installer-2-ip-10-0-166-99.us-west-1.compute.internal          0/1       Completed   0          7h53m
+  installer-2-ip-10-0-212-253.us-west-1.compute.internal         0/1       Completed   0          7h52m
+  revision-pruner-2-ip-10-0-149-197.us-west-1.compute.internal   0/1       Completed   0          7h51m
+  revision-pruner-2-ip-10-0-166-99.us-west-1.compute.internal    0/1       Completed   0          7h51m
+  revision-pruner-2-ip-10-0-212-253.us-west-1.compute.internal   0/1       Completed   0          7h51m
+
+  ```
+
+- Force ETCD deployment, you can run below command from the terminal you have cluster access
+  ```
+  oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+
+  # Verify all nodes are updated to latest version
+  oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
+  ```
+
+  And you will get message like
+
+  ```
+  3 nodes are at revision 2; 0 nodes have achieved new revision 3
+  ```
+  Please wait for some time make sure the component has been updated to the latest version, and then the message would look somewhat like this
+
+  ```
+  oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
+  AllNodesAtLatestRevision
+  3 nodes are at revision 3
+
+  ```
+
+  that depicts that all the three nodes have been updated to the latest version.
+
+- Force rollout for the control plance components
+
+  ```
+  # API Server
+  oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+  # Wait for version update
+  oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
+  # again you will have to wait until you get message like
+  # 3 nodes are at revision 6
+
+  # kubecontrollermanager
+  oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+  # wait for the revision
+  oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
+  # 3 nodes are at revision 9
+
+  #kubescheduler
+  oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
+  # wait for reviosn to update
+  oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
+  # 3 nodes are at revision 7; 0 nodes have achieved new revision 8
+  # 3 nodes are at revision 8
+
+  ```
+
+- Verify that all etcd pods are running fine
+
+  ```
+  root@workmachine:/repo# oc get pods -n openshift-etcd | grep etcd
+  etcd-ip-10-0-149-197.us-west-1.compute.internal                4/4       Running     0          19m
+  etcd-ip-10-0-166-99.us-west-1.compute.internal                 4/4       Running     0          20m
+  etcd-ip-10-0-212-253.us-west-1.compute.internal                4/4       Running     0          20m
+  ```
+
+- Now that, we can see all the ETCD pods have been restored we can make sure the dummy namespace that we created and then deleted, has been restored or not
+
+  ```
+  root@workmachine:/repo# oc get all -n nginx
+  NAME                        READY     STATUS             RESTARTS   AGE
+  pod/nginx-f89759699-k6f5n   0/1       CrashLoopBackOff   9          46m
+
+  NAME                    READY     UP-TO-DATE   AVAILABLE   AGE
+  deployment.apps/nginx   0/1       1            0           46m
+
+  NAME                              DESIRED   CURRENT   READY     AGE
+  replicaset.apps/nginx-f89759699   1         1         0         46m
+
+  ```
+
+  and as you can see we have successfully restored the namespace that we deleted.
diff --git a/examples/etcd/etcd-in-cluster/ocp/backup-actionset.yaml b/examples/etcd/etcd-in-cluster/ocp/backup-actionset.yaml
@@ -0,0 +1,28 @@
+apiVersion: cr.kanister.io/v1alpha1
+kind: ActionSet
+metadata:
+  creationTimestamp: null
+  generateName: backup-
+  namespace: kanister
+spec:
+  actions:
+  - blueprint: <blueprint-name>
+    configMaps: {}
+    name: backup
+    object:
+      apiVersion: v1
+      group: ""
+      kind: ""
+      name: <namespace-name>
+      namespace: ""
+      resource: namespaces
+    options: {}
+    preferredVersion: ""
+    profile:
+      apiVersion: ""
+      group: ""
+      kind: ""
+      name: <profile-name>
+      namespace: kanister
+      resource: ""
+    secrets: {}