-
Notifications
You must be signed in to change notification settings - Fork 1.8k
OSDOCS-429: Adding disaster recovery docs #14859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSDOCS-429: Adding disaster recovery docs #14859
Conversation
9b15c24 to
9810272
Compare
9810272 to
61092a9
Compare
|
@tnozicka @hexfusion @patrickdillon Docs have been updated with the latest changes for all 3 scenarios, if you can please review. Thanks! |
61092a9 to
d99ab7d
Compare
|
@bergerhoffer - @geliu2016 pointed me to this bug that affects these docs: https://bugzilla.redhat.com/show_bug.cgi?id=1711879 |
hexfusion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good added a few notes, will keep digging
@vikram-redhat we have a pending update that will resolve this. |
fb21687 to
50a4c33
Compare
|
@hexfusion @patrickdillon Thanks for reviewing, doc is updated w/ the feedback (and some from google docs) if you can take another look. |
50a4c33 to
4eed2b4
Compare
|
@hexfusion PR is updated with the latest, can you look at the updates? |
hexfusion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great, thanks for the hard work and working with us through all of the iterations.
/lgtm
|
Thanks for the feedback @tnozicka - updates are in the PR if you can double check them. |
|
@openshift/team-documentation Can I get a peer review started of this please? |
40c4891 to
cc5c982
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rphillips is this a long lived token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure
@deads2k do you know if this token is long lived?
cc5c982 to
03e4828
Compare
|
Hi @geliu2016, @xingxingxia - could someone QE review these disaster recovery docs? Thanks! |
|
Small nit above otherwise LGTM for etcd sections |
03e4828 to
abdbd2d
Compare
|
@bergerhoffer , it seems that this bug have not be fix in doc: https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5, and we're verifying doc on baremetal and vsphere cluster, so we may close it after test done on all support platform. |
|
Great thanks @geliu2016. I've reached out to @hexfusion and @tnozicka to help me get the right steps added to the doc for approving the CSRs. I will hopefully update this PR tomorrow to fix that BZ. Let me know if you find any other issues with doc during your testing. Thanks! |
|
This doc is work well to cover aws platform after bug fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5, so we may close it at first, and if there is any special issue on other platform, Dev may have options to adjust the script or doc, so there is not block issue for doc processing. thx |
abdbd2d to
e358f02
Compare
4f2c243 to
da19c67
Compare
da19c67 to
c90aff9
Compare
c90aff9 to
095417d
Compare
|
Got the go ahead to merge this. If we need to add additional steps for anything else, we can do so in a separate PR. |
kalexand-rh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bergerhoffer, sorry the review's so late, but here are some suggestions.
|
|
||
| toc::[] | ||
|
|
||
| This document describes the process to recover from a complete loss of a master host. This includes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/This document describes the process to/You can
s/host. This includes/host, including
| . Correct DNS and load balancer entries. | ||
| . Grow etcd to full membership. | ||
|
|
||
| If the majority of master hosts have been lost, you will need a xref:../disaster_recovery/backing-up-etcd.html#backing-up-etcd-data_backup-etcd[backed up etcd snapshot] to restore etcd quorum on the remaining master host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will
| [id="backing-up-etcd-data_{context}"] | ||
| = Backing up etcd data | ||
|
|
||
| Follow these steps to back up etcd data by creating a snapshot. This snapshot can be saved and used at a later time if you need to restore etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow these steps to
|
|
||
| . Access a master host as the root user. | ||
|
|
||
| . Run the `etcd-snapshot-backup.sh` script and pass in the location to save the etcd snapshot to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is the script available by default on all master hosts in that location?
s/to./to:
s/pass in/specify
| $ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/backup/snapshot.db | ||
| ---- | ||
| + | ||
| In this example, the snapshot is saved to `./assets/backup/snapshot.db` on the master host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where we want to keep the snapshot?
| # ./restore_kubeconfig.sh > kubeconfig | ||
| ---- | ||
|
|
||
| .. Copy the `kubeconfig` file to all master hosts and move it to `/etc/kubernetes/kubeconfig`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/move it to /etc/kubernetes/kubeconfig./move it to the /etc/kubernetes/kubeconfig directory.
| .. Get the list of current CSRs. | ||
| + | ||
| ---- | ||
| # oc get csr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to run these commands as root? If not, s/#/$
| # oc get csr | ||
| ---- | ||
|
|
||
| .. Review the details of a CSR to verify it is valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/verify it/verify that it
| + | ||
| Be sure to approve all pending `node-bootstrapper` CSRs. | ||
|
|
||
| . Destroy the recovery API server because it is no longer needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is no longer needed
| # podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy | ||
| ---- | ||
| + | ||
| Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know if it's restarted? This might need to be a separate step.
s/This/This process
Preview: http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html