Refactor DTR disaster recovery (docker#357)

* Refactor DTR disaster recovery docs * Introduce disaster recovery overview * Introduce emergency-repair * Add DTR offline backup option
PKinvest · Apr 16, 2018 · 3d8bf74 · 3d8bf74
1 parent cc17571
commit 3d8bf74
Show file tree

Hide file tree

Showing 19 changed files with 2,066 additions and 115 deletions.
diff --git a/_data/toc.yaml b/_data/toc.yaml
@@ -2373,8 +2373,18 @@ manuals:
           title: Troubleshoot with logs
         - path: /datacenter/dtr/2.5/guides/admin/monitor-and-troubleshoot/troubleshoot-batch-jobs/
           title: Troubleshoot batch jobs
-      - path: /datacenter/dtr/2.5/guides/admin/backups-and-disaster-recovery/
-        title: Backups and disaster recovery
+      - sectiontitle: Disaster recovery
+        section:
+        - title: Overview
+          path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/
+        - title: Repair a single replica
+          path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-single-replica/
+        - title: Repair a cluster
+          path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-cluster/
+        - title: Create a backup
+          path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/create-a-backup/
+        - title: Restore from a backup
+          path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/restore-from-backup/
     - sectiontitle: User guides
       section:
       - sectiontitle: Access DTR

diff --git a/datacenter/dtr/2.5/guides/admin/configure/use-a-load-balancer.md b/datacenter/dtr/2.5/guides/admin/configure/use-a-load-balancer.md
@@ -267,5 +267,4 @@ docker run --detach \
 
 ## Where to go next
 
-* [Backups and disaster recovery](../backups-and-disaster-recovery.md)
-* [Monitor and troubleshoot](../monitor-and-troubleshoot/index.md)
+* [DTR architecture](../../architecture.md)
diff --git a/...es/admin/backups-and-disaster-recovery.md → ...dmin/disaster-recovery/create-a-backup.md b/...es/admin/backups-and-disaster-recovery.md → ...dmin/disaster-recovery/create-a-backup.md
@@ -1,17 +1,12 @@
 ---
-title: DTR backups and recovery
-description: Learn how to back up your Docker Trusted Registry cluster, and to recover your cluster from an existing backup.
-keywords: registry, high-availability, backup, recovery
+title: Create a backup
+description: Learn how to create a backup of Docker Trusted Registry, for disaster recovery.
+keywords: dtr, disaster recovery
 ---
 
-{% assign image_backup_file = "backup-images.tar" %}
-{% assign metadata_backup_file = "backup-metadata.tar" %}
+{% assign metadata_backup_file = "dtr-metadata-backup.tar" %}
+{% assign image_backup_file = "dtr-image-backup.tar" %}
 
-DTR requires that a majority (n/2 + 1) of its replicas are healthy at all times
-for it to work. So if a majority of replicas is unhealthy or lost, the only
-way to restore DTR to a working state, is by recovering from a backup. This
-is why it's important to ensure replicas are healthy and perform frequent
-backups.
 
 ## Data managed by DTR
 
@@ -66,8 +61,8 @@ you can backup the images by using ssh to log into a node where DTR is running,
 and creating a tar archive of the [dtr-registry volume](../architecture.md):
 
 ```none
-{% raw %}
 sudo tar -cf {{ image_backup_file }} \
+{% raw %}
 $(dirname $(docker volume inspect --format '{{.Mountpoint}}' dtr-registry-<replica-id>))
 {% endraw %}
 ```
@@ -89,26 +84,32 @@ docker run --log-driver none -i --rm \
   --ucp-url <ucp-url> \
   --ucp-insecure-tls \
   --ucp-username <ucp-username> \
-  --existing-replica-id <replica-id> > backup-metadata.tar
+  --existing-replica-id <replica-id> > {{ metadata_backup_file }}
 ```
 
 Where:
 
-* `<ucp-url>` is the url you use to access UCP
-* `<ucp-username>` is the username of a UCP administrator
-* `<replica-id>` is the id of the DTR replica to backup
-
+* `<ucp-url>` is the url you use to access UCP.
+* `<ucp-username>` is the username of a UCP administrator.
+* `<replica-id>` is the id of the DTR replica to backup.
 
 This prompts you for the UCP password, backups up the DTR metadata and saves the
 result into a tar archive. You can learn more about the supported flags in
-the [reference documentation](/reference/dtr/2.5/cli/backup.md).
+the [reference documentation](../../reference/cli/backup.md).
 
-The backup command doesn't stop DTR, so that you can take frequent backups
-without affecting your users. Also, the backup contains sensitive information
+By default the backup command doesn't stop the DTR replica being backed up.
+This allows performing backups without affecting your users. Since the replica
+is not stopped, it's possible that happen while the backup is taking place, won't
+be persisted.
+
+You can use the `--offline-backup` option to stop the DTR replica while taking
+the backup. If you do this, remove the replica from the load balancing pool.
+
+Also, the backup contains sensitive information
 like private keys, so you can encrypt the backup by running:
 
 ```none
-gpg --symmetric {{ backup-metadata.tar }}
+gpg --symmetric {{ metadata_backup_file }}
 ```
 
 This prompts you for a password to encrypt the backup, copies the backup file
@@ -120,7 +121,7 @@ To validate that the backup was correctly performed, you can print the contents
 of the tar file created. The backup of the images should look like:
 
 ```none
-tar -tf {{ image_backup_file }}
+tar -tf {{ metadata_backup_file }}
 
 dtr-backup-v{{ page.dtr_version }}/
 dtr-backup-v{{ page.dtr_version }}/rethink/
@@ -130,7 +131,7 @@ dtr-backup-v{{ page.dtr_version }}/rethink/layers/
 And the backup of the DTR metadata should look like:
 
 ```none
-tar -tf {{ backup-metadata.tar }}
+tar -tf {{ metadata_backup_file }}
 
 # The archive should look like this
 dtr-backup-v{{ page.dtr_version }}/
@@ -142,96 +143,9 @@ dtr-backup-v{{ page.dtr_version }}/rethink/properties/0
 If you've encrypted the metadata backup, you can use:
 
 ```none
-gpg -d /tmp/backup.tar.gpg | tar -t
+gpg -d {{ metadata_backup_file }} | tar -t
 ```
 
 You can also create a backup of a UCP cluster and restore it into a new
 cluster. Then restore DTR on that new cluster to confirm that everything is
 working as expected.
-
-## Restore DTR data
-
-If your DTR has a majority of unhealthy replicas, the one way to restore it to
-a working state is by restoring from an existing backup.
-
-To restore DTR, you need to:
-
-1. Stop any DTR containers that might be running
-2. Restore the images from a backup
-3. Restore DTR metadata from a backup
-4. Re-fetch the vulnerability database
-
-You need to restore DTR on the same UCP cluster where you've created the
-backup. If you restore on a different UCP cluster, all DTR resources will be
-owned by users that don't exist, so you'll not be able to manage the resources,
-even though they're stored in the DTR data store.
-
-When restoring, you need to use the same version of the `docker/dtr` image
-that you've used when creating the update. Other versions are not guaranteed
-to work.
-
-### Stop DTR containers
-
-Start by removing any DTR container that is still running:
-
-```none
-docker run -it --rm \
-  {{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} destroy \
-  --ucp-insecure-tls
-```
-
-### Restore images
-
-If you had DTR configured to store images on the local filesystem, you can
-extract your backup:
-
-```none
-sudo tar -xzf {{ image_backup_file }} -C /var/lib/docker/volumes
-```
-
-If you're using a different storage backend, follow the best practices
-recommended for that system. When restoring the DTR metadata, DTR will be
-deployed with the same configurations it had when creating the backup.
-
-
-### Restore DTR metadata
-
-You can restore the DTR metadata with the `docker/dtr restore` command. This
-performs a fresh installation of DTR, and reconfigures it with
-the configuration created during a backup.
-
-Load your UCP client bundle, and run the following command, replacing the
-placeholders for the real values:
-
-```none
-read -sp 'ucp password: ' UCP_PASSWORD; \
-docker run -i --rm \
-  --env UCP_PASSWORD=$UCP_PASSWORD \
-  {{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} restore \
-  --ucp-url <ucp-url> \
-  --ucp-insecure-tls \
-  --ucp-username <ucp-username> \
-  --ucp-node <hostname> \
-  --replica-id <replica-id> \
-  --dtr-external-url <dtr-external-url> < {{ metadata_backup_file }}
-```
-
-Where:
-
-* `<ucp-url>` is the url you use to access UCP
-* `<ucp-username>` is the username of a UCP administrator
-* `<hostname>` is the hostname of the node where you've restored the images
-* `<replica-id>` the id of the replica you backed up
-* `<dtr-external-url>`the url that clients use to access DTR
-
-### Re-fetch the vulnerability database
-
-If you're scanning images, you now need to download the vulnerability database.
-
-After you successfully restore DTR, you can join new replicas the same way you
-would after a fresh installation. [Learn more](configure/set-up-vulnerability-scans.md).
-
-## Where to go next
-
-* [Set up high availability](configure/set-up-high-availability.md)
-* [DTR architecture](../architecture.md)
diff --git a/datacenter/dtr/2.5/guides/admin/disaster-recovery/index.md b/datacenter/dtr/2.5/guides/admin/disaster-recovery/index.md
@@ -0,0 +1,58 @@
+---
+title: DTR disaster recovery overview
+description: Learn the multiple disaster recovery strategies you can use with
+  Docker Trusted Registry.
+keywords: dtr, disaster recovery
+---
+
+Docker Trusted Registry is a clustered application. You can join multiple
+replicas for high availability.
+For a DTR cluster to be healthy, a majority of its replicas (n/2 + 1) need to
+be healthy and be able to communicate with the other replicas. This is also
+known as maintaining quorum.
+
+This means that there are three failure scenarios possible.
+
+## Replica is unhealthy but cluster maintains quorum
+
+One or more replicas are unhealthy, but the overall majority (n/2 + 1) is still
+healthy and able to communicate with one another.
+
+![Failure scenario 1](../../images/dr-overview-1.svg)
+
+In this example the DTR cluster has five replicas but one of the nodes stopped
+working, and the other has problems with the DTR overlay network.
+Even though these two replicas are unhealthy the DTR cluster has a majority
+of replicas still working, which means that the cluster is healthy.
+
+In this case you should repair the unhealthy replicas, or remove them from
+the cluster and join new ones.
+
+[Learn how to repair a replica](repair-a-single-replica.md).
+
+## The majority of replicas are unhealthy
+
+A majority of replicas are unhealthy, making the cluster lose quorum, but at
+least one replica is still healthy, or at least the data volumes for DTR are
+accessible from that replica.
+
+![Failure scenario 2](../../images/dr-overview-2.svg)
+
+In this example the DTR cluster is unhealthy but since one replica is still
+running it's possible to repair the cluster without having to restore from
+a backup. This minimizes the amount of data loss.
+
+[Learn how to do an emergency repair](repair-a-cluster.md).
+
+## All replicas are unhealthy
+
+This is a total disaster scenario where all DTR replicas were lost, causing
+the data volumes for all DTR replicas to get corrupted or lost.
+
+![Failure scenario 3](../../images/dr-overview-3.svg)
+
+In a disaster scenario like this, you'll have to restore DTR from an existing
+backup. Restoring from a backup should be only used as a last resort, since
+doing an emergency repair might prevent some data loss.
+
+[Learn how to restore from a backup](restore-from-backup.md).
diff --git a/datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-cluster.md b/datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-cluster.md
@@ -0,0 +1,81 @@
+---
+title: Repair a cluster
+description: Learn how to repair DTR when the majority of replicas are unhealthy.
+keywords: dtr, disaster recovery
+---
+
+For a DTR cluster to be healthy, a majority of its replicas (n/2 + 1) need to
+be healthy and be able to communicate with the other replicas. This is known
+as maintaining quorum.
+
+In a scenario where quorum is lost, but at least one replica is still
+accessible, you can use that replica to repair the cluster. That replica doesn't
+need to be completely healthy. The cluster can still be repaired as the DTR
+data volumes are persisted and accessible.
+
+![Unhealthy cluster](../../images/repair-cluster-1.svg)
+
+Repairing the cluster from an existing replica minimizes the amount of data lost.
+If this procedure doesn't work, you'll have to
+[restore from an existing backup](restore-from-backup.md).
+
+## Diagnose an unhealthy cluster
+
+When a majority of replicas are unhealthy, causing the overall DTR cluster to
+become unhealthy, operations like `docker login`, `docker pull`, and `docker push`
+present `internal server error`.
+
+Accessing the `/_ping` endpoint of any replica also returns the same error.
+It's also possible that the DTR web UI is partially or fully unresponsive.
+
+## Perform an emergency repair
+
+Use the `docker/dtr emergency-repair` command to try to repair an unhealthy
+DTR cluster, from an existing replica.
+
+This command checks the data volumes for the DTR
+
+This command checks the data volumes for the DTR replica are uncorrupted,
+redeploys all internal DTR components and reconfigured them to use the existing
+volumes.
+
+It also reconfigures DTR removing all other nodes from the cluster, leaving DTR
+as a single-replica cluster with the replica you chose.
+
+Start by finding the ID of the DTR replica that you want to repair from.
+You can find the list of replicas by navigating to the UCP web UI, or by using
+a UCP client bundle to run:
+
+```
+{% raw %}
+docker ps --format "{{.Names}}" | grep dtr
+
+# The list of DTR containers with <node>/<component>-<replicaID>, e.g.
+# node-1/dtr-api-a1640e1c15b6
+{% endraw %}
+```
+
+Then, use your UCP client bundle to run the emergency repair command:
+
+```
+{% raw %}
+docker run -it --rm {{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} emergency-repair \
+  --ucp-insecure-tls \
+  --existing-replica-id <replica-id>
+{% endraw %}
+```
+
+If the emergency repair procedure is successful, your DTR cluster now has a
+single replica. You should now
+[join more replicas for high availability](../configure/set-up-high-availability.md).
+
+![Healthy cluster](../../images/repair-cluster-2.svg)
+
+If the emergency repair command fails, try running it again using a different
+replica ID. As a last resort, you can restore your cluster from an existing
+backup.
+
+## Where to go next
+
+* [Create a backup](create-a-backup.md)
+* [Restore from an existing backup](restore-from-backup.md)