Skip to content

Commit

Permalink
Refactor DTR disaster recovery (docker#357)
Browse files Browse the repository at this point in the history
* Refactor DTR disaster recovery docs
* Introduce disaster recovery overview
* Introduce emergency-repair
* Add DTR offline backup option
  • Loading branch information
joaofnfernandes authored and Jim Galasyn committed Apr 16, 2018
1 parent cc17571 commit 3d8bf74
Show file tree
Hide file tree
Showing 19 changed files with 2,066 additions and 115 deletions.
14 changes: 12 additions & 2 deletions _data/toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2373,8 +2373,18 @@ manuals:
title: Troubleshoot with logs
- path: /datacenter/dtr/2.5/guides/admin/monitor-and-troubleshoot/troubleshoot-batch-jobs/
title: Troubleshoot batch jobs
- path: /datacenter/dtr/2.5/guides/admin/backups-and-disaster-recovery/
title: Backups and disaster recovery
- sectiontitle: Disaster recovery
section:
- title: Overview
path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/
- title: Repair a single replica
path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-single-replica/
- title: Repair a cluster
path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-cluster/
- title: Create a backup
path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/create-a-backup/
- title: Restore from a backup
path: /datacenter/dtr/2.5/guides/admin/disaster-recovery/restore-from-backup/
- sectiontitle: User guides
section:
- sectiontitle: Access DTR
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -267,5 +267,4 @@ docker run --detach \

## Where to go next

* [Backups and disaster recovery](../backups-and-disaster-recovery.md)
* [Monitor and troubleshoot](../monitor-and-troubleshoot/index.md)
* [DTR architecture](../../architecture.md)
Original file line number Diff line number Diff line change
@@ -1,17 +1,12 @@
---
title: DTR backups and recovery
description: Learn how to back up your Docker Trusted Registry cluster, and to recover your cluster from an existing backup.
keywords: registry, high-availability, backup, recovery
title: Create a backup
description: Learn how to create a backup of Docker Trusted Registry, for disaster recovery.
keywords: dtr, disaster recovery
---

{% assign image_backup_file = "backup-images.tar" %}
{% assign metadata_backup_file = "backup-metadata.tar" %}
{% assign metadata_backup_file = "dtr-metadata-backup.tar" %}
{% assign image_backup_file = "dtr-image-backup.tar" %}

DTR requires that a majority (n/2 + 1) of its replicas are healthy at all times
for it to work. So if a majority of replicas is unhealthy or lost, the only
way to restore DTR to a working state, is by recovering from a backup. This
is why it's important to ensure replicas are healthy and perform frequent
backups.

## Data managed by DTR

Expand Down Expand Up @@ -66,8 +61,8 @@ you can backup the images by using ssh to log into a node where DTR is running,
and creating a tar archive of the [dtr-registry volume](../architecture.md):

```none
{% raw %}
sudo tar -cf {{ image_backup_file }} \
{% raw %}
$(dirname $(docker volume inspect --format '{{.Mountpoint}}' dtr-registry-<replica-id>))
{% endraw %}
```
Expand All @@ -89,26 +84,32 @@ docker run --log-driver none -i --rm \
--ucp-url <ucp-url> \
--ucp-insecure-tls \
--ucp-username <ucp-username> \
--existing-replica-id <replica-id> > backup-metadata.tar
--existing-replica-id <replica-id> > {{ metadata_backup_file }}
```

Where:

* `<ucp-url>` is the url you use to access UCP
* `<ucp-username>` is the username of a UCP administrator
* `<replica-id>` is the id of the DTR replica to backup

* `<ucp-url>` is the url you use to access UCP.
* `<ucp-username>` is the username of a UCP administrator.
* `<replica-id>` is the id of the DTR replica to backup.

This prompts you for the UCP password, backups up the DTR metadata and saves the
result into a tar archive. You can learn more about the supported flags in
the [reference documentation](/reference/dtr/2.5/cli/backup.md).
the [reference documentation](../../reference/cli/backup.md).

The backup command doesn't stop DTR, so that you can take frequent backups
without affecting your users. Also, the backup contains sensitive information
By default the backup command doesn't stop the DTR replica being backed up.
This allows performing backups without affecting your users. Since the replica
is not stopped, it's possible that happen while the backup is taking place, won't
be persisted.

You can use the `--offline-backup` option to stop the DTR replica while taking
the backup. If you do this, remove the replica from the load balancing pool.

Also, the backup contains sensitive information
like private keys, so you can encrypt the backup by running:

```none
gpg --symmetric {{ backup-metadata.tar }}
gpg --symmetric {{ metadata_backup_file }}
```

This prompts you for a password to encrypt the backup, copies the backup file
Expand All @@ -120,7 +121,7 @@ To validate that the backup was correctly performed, you can print the contents
of the tar file created. The backup of the images should look like:

```none
tar -tf {{ image_backup_file }}
tar -tf {{ metadata_backup_file }}
dtr-backup-v{{ page.dtr_version }}/
dtr-backup-v{{ page.dtr_version }}/rethink/
Expand All @@ -130,7 +131,7 @@ dtr-backup-v{{ page.dtr_version }}/rethink/layers/
And the backup of the DTR metadata should look like:

```none
tar -tf {{ backup-metadata.tar }}
tar -tf {{ metadata_backup_file }}
# The archive should look like this
dtr-backup-v{{ page.dtr_version }}/
Expand All @@ -142,96 +143,9 @@ dtr-backup-v{{ page.dtr_version }}/rethink/properties/0
If you've encrypted the metadata backup, you can use:

```none
gpg -d /tmp/backup.tar.gpg | tar -t
gpg -d {{ metadata_backup_file }} | tar -t
```

You can also create a backup of a UCP cluster and restore it into a new
cluster. Then restore DTR on that new cluster to confirm that everything is
working as expected.

## Restore DTR data

If your DTR has a majority of unhealthy replicas, the one way to restore it to
a working state is by restoring from an existing backup.

To restore DTR, you need to:

1. Stop any DTR containers that might be running
2. Restore the images from a backup
3. Restore DTR metadata from a backup
4. Re-fetch the vulnerability database

You need to restore DTR on the same UCP cluster where you've created the
backup. If you restore on a different UCP cluster, all DTR resources will be
owned by users that don't exist, so you'll not be able to manage the resources,
even though they're stored in the DTR data store.

When restoring, you need to use the same version of the `docker/dtr` image
that you've used when creating the update. Other versions are not guaranteed
to work.

### Stop DTR containers

Start by removing any DTR container that is still running:

```none
docker run -it --rm \
{{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} destroy \
--ucp-insecure-tls
```

### Restore images

If you had DTR configured to store images on the local filesystem, you can
extract your backup:

```none
sudo tar -xzf {{ image_backup_file }} -C /var/lib/docker/volumes
```

If you're using a different storage backend, follow the best practices
recommended for that system. When restoring the DTR metadata, DTR will be
deployed with the same configurations it had when creating the backup.


### Restore DTR metadata

You can restore the DTR metadata with the `docker/dtr restore` command. This
performs a fresh installation of DTR, and reconfigures it with
the configuration created during a backup.

Load your UCP client bundle, and run the following command, replacing the
placeholders for the real values:

```none
read -sp 'ucp password: ' UCP_PASSWORD; \
docker run -i --rm \
--env UCP_PASSWORD=$UCP_PASSWORD \
{{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} restore \
--ucp-url <ucp-url> \
--ucp-insecure-tls \
--ucp-username <ucp-username> \
--ucp-node <hostname> \
--replica-id <replica-id> \
--dtr-external-url <dtr-external-url> < {{ metadata_backup_file }}
```

Where:

* `<ucp-url>` is the url you use to access UCP
* `<ucp-username>` is the username of a UCP administrator
* `<hostname>` is the hostname of the node where you've restored the images
* `<replica-id>` the id of the replica you backed up
* `<dtr-external-url>`the url that clients use to access DTR

### Re-fetch the vulnerability database

If you're scanning images, you now need to download the vulnerability database.

After you successfully restore DTR, you can join new replicas the same way you
would after a fresh installation. [Learn more](configure/set-up-vulnerability-scans.md).

## Where to go next

* [Set up high availability](configure/set-up-high-availability.md)
* [DTR architecture](../architecture.md)
58 changes: 58 additions & 0 deletions datacenter/dtr/2.5/guides/admin/disaster-recovery/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: DTR disaster recovery overview
description: Learn the multiple disaster recovery strategies you can use with
Docker Trusted Registry.
keywords: dtr, disaster recovery
---

Docker Trusted Registry is a clustered application. You can join multiple
replicas for high availability.
For a DTR cluster to be healthy, a majority of its replicas (n/2 + 1) need to
be healthy and be able to communicate with the other replicas. This is also
known as maintaining quorum.

This means that there are three failure scenarios possible.

## Replica is unhealthy but cluster maintains quorum

One or more replicas are unhealthy, but the overall majority (n/2 + 1) is still
healthy and able to communicate with one another.

![Failure scenario 1](../../images/dr-overview-1.svg)

In this example the DTR cluster has five replicas but one of the nodes stopped
working, and the other has problems with the DTR overlay network.
Even though these two replicas are unhealthy the DTR cluster has a majority
of replicas still working, which means that the cluster is healthy.

In this case you should repair the unhealthy replicas, or remove them from
the cluster and join new ones.

[Learn how to repair a replica](repair-a-single-replica.md).

## The majority of replicas are unhealthy

A majority of replicas are unhealthy, making the cluster lose quorum, but at
least one replica is still healthy, or at least the data volumes for DTR are
accessible from that replica.

![Failure scenario 2](../../images/dr-overview-2.svg)

In this example the DTR cluster is unhealthy but since one replica is still
running it's possible to repair the cluster without having to restore from
a backup. This minimizes the amount of data loss.

[Learn how to do an emergency repair](repair-a-cluster.md).

## All replicas are unhealthy

This is a total disaster scenario where all DTR replicas were lost, causing
the data volumes for all DTR replicas to get corrupted or lost.

![Failure scenario 3](../../images/dr-overview-3.svg)

In a disaster scenario like this, you'll have to restore DTR from an existing
backup. Restoring from a backup should be only used as a last resort, since
doing an emergency repair might prevent some data loss.

[Learn how to restore from a backup](restore-from-backup.md).
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: Repair a cluster
description: Learn how to repair DTR when the majority of replicas are unhealthy.
keywords: dtr, disaster recovery
---

For a DTR cluster to be healthy, a majority of its replicas (n/2 + 1) need to
be healthy and be able to communicate with the other replicas. This is known
as maintaining quorum.

In a scenario where quorum is lost, but at least one replica is still
accessible, you can use that replica to repair the cluster. That replica doesn't
need to be completely healthy. The cluster can still be repaired as the DTR
data volumes are persisted and accessible.

![Unhealthy cluster](../../images/repair-cluster-1.svg)

Repairing the cluster from an existing replica minimizes the amount of data lost.
If this procedure doesn't work, you'll have to
[restore from an existing backup](restore-from-backup.md).

## Diagnose an unhealthy cluster

When a majority of replicas are unhealthy, causing the overall DTR cluster to
become unhealthy, operations like `docker login`, `docker pull`, and `docker push`
present `internal server error`.

Accessing the `/_ping` endpoint of any replica also returns the same error.
It's also possible that the DTR web UI is partially or fully unresponsive.

## Perform an emergency repair

Use the `docker/dtr emergency-repair` command to try to repair an unhealthy
DTR cluster, from an existing replica.

This command checks the data volumes for the DTR

This command checks the data volumes for the DTR replica are uncorrupted,
redeploys all internal DTR components and reconfigured them to use the existing
volumes.

It also reconfigures DTR removing all other nodes from the cluster, leaving DTR
as a single-replica cluster with the replica you chose.

Start by finding the ID of the DTR replica that you want to repair from.
You can find the list of replicas by navigating to the UCP web UI, or by using
a UCP client bundle to run:

```
{% raw %}
docker ps --format "{{.Names}}" | grep dtr
# The list of DTR containers with <node>/<component>-<replicaID>, e.g.
# node-1/dtr-api-a1640e1c15b6
{% endraw %}
```

Then, use your UCP client bundle to run the emergency repair command:

```
{% raw %}
docker run -it --rm {{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} emergency-repair \
--ucp-insecure-tls \
--existing-replica-id <replica-id>
{% endraw %}
```

If the emergency repair procedure is successful, your DTR cluster now has a
single replica. You should now
[join more replicas for high availability](../configure/set-up-high-availability.md).

![Healthy cluster](../../images/repair-cluster-2.svg)

If the emergency repair command fails, try running it again using a different
replica ID. As a last resort, you can restore your cluster from an existing
backup.

## Where to go next

* [Create a backup](create-a-backup.md)
* [Restore from an existing backup](restore-from-backup.md)
Loading

0 comments on commit 3d8bf74

Please sign in to comment.