Skip to content

Conversation

@bergerhoffer
Copy link
Contributor

@bergerhoffer bergerhoffer added this to the Future Release milestone May 15, 2019
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2019
@bergerhoffer bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from 9b15c24 to 9810272 Compare May 17, 2019 02:34
@bergerhoffer
Copy link
Contributor Author

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 20, 2019
@bergerhoffer
Copy link
Contributor Author

@tnozicka @hexfusion @patrickdillon Docs have been updated with the latest changes for all 3 scenarios, if you can please review. Thanks!

@vikram-redhat
Copy link
Contributor

@bergerhoffer - @geliu2016 pointed me to this bug that affects these docs: https://bugzilla.redhat.com/show_bug.cgi?id=1711879

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good added a few notes, will keep digging

@hexfusion
Copy link
Contributor

@bergerhoffer - @geliu2016 pointed me to this bug that affects these docs: https://bugzilla.redhat.com/show_bug.cgi?id=1711879

@vikram-redhat we have a pending update that will resolve this.

@bergerhoffer bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from fb21687 to 50a4c33 Compare May 20, 2019 17:35
@bergerhoffer
Copy link
Contributor Author

@hexfusion @patrickdillon Thanks for reviewing, doc is updated w/ the feedback (and some from google docs) if you can take another look.

@bergerhoffer
Copy link
Contributor Author

@hexfusion PR is updated with the latest, can you look at the updates?

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, thanks for the hard work and working with us through all of the iterations.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2019
@bergerhoffer
Copy link
Contributor Author

Thanks for the feedback @tnozicka - updates are in the PR if you can double check them.

@bergerhoffer
Copy link
Contributor Author

@openshift/team-documentation Can I get a peer review started of this please?

@bergerhoffer bergerhoffer added the peer-review-needed Signifies that the peer review team needs to review this PR label May 22, 2019
@bergerhoffer bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from 40c4891 to cc5c982 Compare May 24, 2019 14:46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rphillips is this a long lived token?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure

@deads2k do you know if this token is long lived?

@bergerhoffer bergerhoffer changed the title [WIP] OSDOCS-429: Adding disaster recovery docs OSDOCS-429: Adding disaster recovery docs May 24, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2019
@bergerhoffer
Copy link
Contributor Author

Hi @geliu2016, @xingxingxia - could someone QE review these disaster recovery docs? Thanks!

@hexfusion
Copy link
Contributor

Small nit above otherwise LGTM for etcd sections

@geliu2016
Copy link

@bergerhoffer , it seems that this bug have not be fix in doc: https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5, and we're verifying doc on baremetal and vsphere cluster, so we may close it after test done on all support platform.

@bergerhoffer
Copy link
Contributor Author

Great thanks @geliu2016. I've reached out to @hexfusion and @tnozicka to help me get the right steps added to the doc for approving the CSRs. I will hopefully update this PR tomorrow to fix that BZ.

Let me know if you find any other issues with doc during your testing. Thanks!

@geliu2016
Copy link

This doc is work well to cover aws platform after bug fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5, so we may close it at first, and if there is any special issue on other platform, Dev may have options to adjust the script or doc, so there is not block issue for doc processing. thx

@bergerhoffer bergerhoffer force-pushed the disaster-recovery branch 2 times, most recently from 4f2c243 to da19c67 Compare May 30, 2019 01:15
@bergerhoffer
Copy link
Contributor Author

Got the go ahead to merge this. If we need to add additional steps for anything else, we can do so in a separate PR.

@bergerhoffer bergerhoffer merged commit ab0e5bd into openshift:enterprise-4.1 May 30, 2019
Copy link
Contributor

@kalexand-rh kalexand-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergerhoffer, sorry the review's so late, but here are some suggestions.


toc::[]

This document describes the process to recover from a complete loss of a master host. This includes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/This document describes the process to/You can
s/host. This includes/host, including

. Correct DNS and load balancer entries.
. Grow etcd to full membership.

If the majority of master hosts have been lost, you will need a xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd snapshot] to restore etcd quorum on the remaining master host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will

[id="backing-up-etcd-data_{context}"]
= Backing up etcd data

Follow these steps to back up etcd data by creating a snapshot. This snapshot can be saved and used at a later time if you need to restore etcd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow these steps to


. Access a master host as the root user.

. Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the script available by default on all master hosts in that location?
s/to./to:
s/pass in/specify

$ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db
----
+
In this example, the snapshot is saved to `./assets/backup/snapshot.db` on the master host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where we want to keep the snapshot?

# ./restore_kubeconfig.sh > kubeconfig
----

.. Copy the `kubeconfig` file to all master hosts and move it to `/etc/kubernetes/kubeconfig`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/move it to /etc/kubernetes/kubeconfig./move it to the /etc/kubernetes/kubeconfig directory.

.. Get the list of current CSRs.
+
----
# oc get csr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to run these commands as root? If not, s/#/$

# oc get csr
----

.. Review the details of a CSR to verify it is valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/verify it/verify that it

+
Be sure to approve all pending `node-bootstrapper` CSRs.

. Destroy the recovery API server because it is no longer needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is no longer needed

# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy
----
+
Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know if it's restarted? This might need to be a separate step.

s/This/This process

@bergerhoffer bergerhoffer deleted the disaster-recovery branch January 22, 2020 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

branch/enterprise-4.1 peer-review-needed Signifies that the peer review team needs to review this PR size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants