OSDOCS-429: Adding disaster recovery docs #14859

bergerhoffer · 2019-05-15T12:52:41Z

Preview: http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html

bergerhoffer · 2019-05-17T02:43:10Z

@tnozicka Can you please review: http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-3-expired-certs.html

@hexfusion Can you please review: http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html

bergerhoffer · 2019-05-20T01:24:43Z

@tnozicka @hexfusion @patrickdillon Docs have been updated with the latest changes for all 3 scenarios, if you can please review. Thanks!

vikram-redhat · 2019-05-20T10:06:47Z

@bergerhoffer - @geliu2016 pointed me to this bug that affects these docs: https://bugzilla.redhat.com/show_bug.cgi?id=1711879

hexfusion

looks good added a few notes, will keep digging

modules/dr-recover-lost-control-plane-hosts.adoc

modules/dr-restoring-cluster-state.adoc

hexfusion · 2019-05-20T16:01:45Z

@bergerhoffer - @geliu2016 pointed me to this bug that affects these docs: https://bugzilla.redhat.com/show_bug.cgi?id=1711879

@vikram-redhat we have a pending update that will resolve this.

bergerhoffer · 2019-05-20T17:38:31Z

@hexfusion @patrickdillon Thanks for reviewing, doc is updated w/ the feedback (and some from google docs) if you can take another look.

modules/dr-recover-lost-control-plane-hosts.adoc

bergerhoffer · 2019-05-21T00:43:03Z

@hexfusion PR is updated with the latest, can you look at the updates?

hexfusion

Overall this looks great, thanks for the hard work and working with us through all of the iterations.

/lgtm

bergerhoffer · 2019-05-22T01:12:07Z

Thanks for the feedback @tnozicka - updates are in the PR if you can double check them.

bergerhoffer · 2019-05-22T19:51:50Z

@openshift/team-documentation Can I get a peer review started of this please?

tnozicka · 2019-05-24T15:10:46Z

modules/dr-recover-expired-control-plane-certs.adoc

@rphillips is this a long lived token?

I'm not sure

@deads2k do you know if this token is long lived?

modules/dr-recover-expired-control-plane-certs.adoc

bergerhoffer · 2019-05-24T16:18:03Z

Hi @geliu2016, @xingxingxia - could someone QE review these disaster recovery docs? Thanks!

modules/dr-restoring-cluster-state.adoc

hexfusion · 2019-05-24T17:29:26Z

Small nit above otherwise LGTM for etcd sections

geliu2016 · 2019-05-27T11:14:13Z

@bergerhoffer , it seems that this bug have not be fix in doc: https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5, and we're verifying doc on baremetal and vsphere cluster, so we may close it after test done on all support platform.

bergerhoffer · 2019-05-27T21:54:40Z

Great thanks @geliu2016. I've reached out to @hexfusion and @tnozicka to help me get the right steps added to the doc for approving the CSRs. I will hopefully update this PR tomorrow to fix that BZ.

Let me know if you find any other issues with doc during your testing. Thanks!

geliu2016 · 2019-05-29T03:11:48Z

This doc is work well to cover aws platform after bug fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5, so we may close it at first, and if there is any special issue on other platform, Dev may have options to adjust the script or doc, so there is not block issue for doc processing. thx

modules/dr-recover-expired-control-plane-certs.adoc

modules/dr-recover-lost-control-plane-hosts.adoc

bergerhoffer · 2019-05-30T18:06:24Z

Got the go ahead to merge this. If we need to add additional steps for anything else, we can do so in a separate PR.

kalexand-rh

@bergerhoffer, sorry the review's so late, but here are some suggestions.

kalexand-rh · 2019-06-12T17:55:13Z

disaster_recovery/scenario-1-infra-recovery.adoc

+
+toc::[]
+
+This document describes the process to recover from a complete loss of a master host. This includes


s/This document describes the process to/You can
s/host. This includes/host, including

kalexand-rh · 2019-06-12T17:56:01Z

disaster_recovery/scenario-1-infra-recovery.adoc

+. Correct DNS and load balancer entries.
+. Grow etcd to full membership.
+
+If the majority of master hosts have been lost, you will need a xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd snapshot] to restore etcd quorum on the remaining master host.


kalexand-rh · 2019-06-12T17:57:20Z

modules/backup-etcd.adoc

+[id="backing-up-etcd-data_{context}"]
+= Backing up etcd data
+
+Follow these steps to back up etcd data by creating a snapshot. This snapshot can be saved and used at a later time if you need to restore etcd.


~~Follow these steps to~~

kalexand-rh · 2019-06-12T18:02:07Z

modules/backup-etcd.adoc

+
+. Access a master host as the root user.
+
+. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to.


So, is the script available by default on all master hosts in that location?
s/to./to:
s/pass in/specify

kalexand-rh · 2019-06-12T18:03:45Z

modules/backup-etcd.adoc

+$ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db
+----
+
+In this example, the snapshot is saved to `./assets/backup/snapshot.db` on the master host.


Is this where we want to keep the snapshot?

kalexand-rh · 2019-06-12T19:29:25Z

modules/dr-recover-expired-control-plane-certs.adoc

+# ./restore_kubeconfig.sh > kubeconfig
+----
+
+.. Copy the `kubeconfig` file to all master hosts and move it to `/etc/kubernetes/kubeconfig`.


s/move it to /etc/kubernetes/kubeconfig./move it to the /etc/kubernetes/kubeconfig directory.

kalexand-rh · 2019-06-12T19:30:20Z

modules/dr-recover-expired-control-plane-certs.adoc

+.. Get the list of current CSRs.
+
+----
+# oc get csr


Do you need to run these commands as root? If not, s/#/$

kalexand-rh · 2019-06-12T19:30:31Z

modules/dr-recover-expired-control-plane-certs.adoc

+# oc get csr
+----
+
+.. Review the details of a CSR to verify it is valid.


s/verify it/verify that it

kalexand-rh · 2019-06-12T19:30:54Z

modules/dr-recover-expired-control-plane-certs.adoc

+
+Be sure to approve all pending `node-bootstrapper` CSRs.
+
+. Destroy the recovery API server because it is no longer needed.


~~because it is no longer needed~~

kalexand-rh · 2019-06-12T19:31:31Z

modules/dr-recover-expired-control-plane-certs.adoc

+# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy
+----
+
+Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes.


How do you know if it's restarted? This might need to be a separate step.

s/This/This process

bergerhoffer added the branch/enterprise-4.1 label May 15, 2019

bergerhoffer added this to the Future Release milestone May 15, 2019

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2019

bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from 9b15c24 to 9810272 Compare May 17, 2019 02:34

bergerhoffer force-pushed the disaster-recovery branch from 9810272 to 61092a9 Compare May 20, 2019 01:14

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 20, 2019

bergerhoffer force-pushed the disaster-recovery branch from 61092a9 to d99ab7d Compare May 20, 2019 01:30

hexfusion reviewed May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

hexfusion reviewed May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

patrickdillon reviewed May 20, 2019

View reviewed changes

modules/dr-restoring-cluster-state.adoc Outdated Show resolved Hide resolved

bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from fb21687 to 50a4c33 Compare May 20, 2019 17:35

hexfusion reviewed May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

hexfusion reviewed May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

hexfusion reviewed May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

hexfusion reviewed May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

bergerhoffer commented May 20, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

bergerhoffer force-pushed the disaster-recovery branch from 50a4c33 to 4eed2b4 Compare May 21, 2019 00:41

hexfusion approved these changes May 21, 2019

View reviewed changes

openshift-ci-robot assigned hexfusion May 21, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2019

bergerhoffer added the peer-review-needed Signifies that the peer review team needs to review this PR label May 22, 2019

bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from 40c4891 to cc5c982 Compare May 24, 2019 14:46

tnozicka reviewed May 24, 2019

View reviewed changes

bergerhoffer force-pushed the disaster-recovery branch from cc5c982 to 03e4828 Compare May 24, 2019 15:31

bergerhoffer changed the title ~~[WIP] OSDOCS-429: Adding disaster recovery docs~~ OSDOCS-429: Adding disaster recovery docs May 24, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2019

hexfusion reviewed May 24, 2019

View reviewed changes

modules/dr-restoring-cluster-state.adoc Outdated Show resolved Hide resolved

bergerhoffer force-pushed the disaster-recovery branch from 03e4828 to abdbd2d Compare May 24, 2019 18:08

bergerhoffer force-pushed the disaster-recovery branch from abdbd2d to e358f02 Compare May 29, 2019 15:43

tnozicka reviewed May 29, 2019

View reviewed changes

modules/dr-recover-expired-control-plane-certs.adoc Outdated Show resolved Hide resolved

bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from 4f2c243 to da19c67 Compare May 30, 2019 01:15

hexfusion reviewed May 30, 2019

View reviewed changes

modules/dr-recover-lost-control-plane-hosts.adoc Outdated Show resolved Hide resolved

bergerhoffer force-pushed the disaster-recovery branch from da19c67 to c90aff9 Compare May 30, 2019 13:25

OSDOCS-429: Adding disaster recovery docs

095417d

bergerhoffer force-pushed the disaster-recovery branch from c90aff9 to 095417d Compare May 30, 2019 16:12

bergerhoffer merged commit ab0e5bd into openshift:enterprise-4.1 May 30, 2019

vikram-redhat modified the milestones: Future Release, OCP 4.1 GA Jun 3, 2019

kalexand-rh reviewed Jun 12, 2019

View reviewed changes

bergerhoffer deleted the disaster-recovery branch January 22, 2020 18:19


		toc::[]

		This document describes the process to recover from a complete loss of a master host. This includes


		. Access a master host as the root user.

		. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to.

OSDOCS-429: Adding disaster recovery docs #14859

OSDOCS-429: Adding disaster recovery docs #14859

Uh oh!

Conversation

bergerhoffer commented May 15, 2019

Uh oh!

bergerhoffer commented May 17, 2019

Uh oh!

bergerhoffer commented May 20, 2019

Uh oh!

vikram-redhat commented May 20, 2019

Uh oh!

hexfusion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hexfusion commented May 20, 2019

Uh oh!

bergerhoffer commented May 20, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bergerhoffer commented May 21, 2019

Uh oh!

hexfusion left a comment

Choose a reason for hiding this comment

Uh oh!

bergerhoffer commented May 22, 2019

Uh oh!

bergerhoffer commented May 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bergerhoffer commented May 24, 2019

Uh oh!

Uh oh!

hexfusion commented May 24, 2019

Uh oh!

geliu2016 commented May 27, 2019

Uh oh!

bergerhoffer commented May 27, 2019

Uh oh!

geliu2016 commented May 29, 2019

Uh oh!

Uh oh!

Uh oh!

bergerhoffer commented May 30, 2019

Uh oh!

kalexand-rh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!