forked from docker/docs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor DTR disaster recovery (docker#357)
* Refactor DTR disaster recovery docs * Introduce disaster recovery overview * Introduce emergency-repair * Add DTR offline backup option
- Loading branch information
1 parent
cc17571
commit 3d8bf74
Showing
19 changed files
with
2,066 additions
and
115 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
58 changes: 58 additions & 0 deletions
58
datacenter/dtr/2.5/guides/admin/disaster-recovery/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
title: DTR disaster recovery overview | ||
description: Learn the multiple disaster recovery strategies you can use with | ||
Docker Trusted Registry. | ||
keywords: dtr, disaster recovery | ||
--- | ||
|
||
Docker Trusted Registry is a clustered application. You can join multiple | ||
replicas for high availability. | ||
For a DTR cluster to be healthy, a majority of its replicas (n/2 + 1) need to | ||
be healthy and be able to communicate with the other replicas. This is also | ||
known as maintaining quorum. | ||
|
||
This means that there are three failure scenarios possible. | ||
|
||
## Replica is unhealthy but cluster maintains quorum | ||
|
||
One or more replicas are unhealthy, but the overall majority (n/2 + 1) is still | ||
healthy and able to communicate with one another. | ||
|
||
 | ||
|
||
In this example the DTR cluster has five replicas but one of the nodes stopped | ||
working, and the other has problems with the DTR overlay network. | ||
Even though these two replicas are unhealthy the DTR cluster has a majority | ||
of replicas still working, which means that the cluster is healthy. | ||
|
||
In this case you should repair the unhealthy replicas, or remove them from | ||
the cluster and join new ones. | ||
|
||
[Learn how to repair a replica](repair-a-single-replica.md). | ||
|
||
## The majority of replicas are unhealthy | ||
|
||
A majority of replicas are unhealthy, making the cluster lose quorum, but at | ||
least one replica is still healthy, or at least the data volumes for DTR are | ||
accessible from that replica. | ||
|
||
 | ||
|
||
In this example the DTR cluster is unhealthy but since one replica is still | ||
running it's possible to repair the cluster without having to restore from | ||
a backup. This minimizes the amount of data loss. | ||
|
||
[Learn how to do an emergency repair](repair-a-cluster.md). | ||
|
||
## All replicas are unhealthy | ||
|
||
This is a total disaster scenario where all DTR replicas were lost, causing | ||
the data volumes for all DTR replicas to get corrupted or lost. | ||
|
||
 | ||
|
||
In a disaster scenario like this, you'll have to restore DTR from an existing | ||
backup. Restoring from a backup should be only used as a last resort, since | ||
doing an emergency repair might prevent some data loss. | ||
|
||
[Learn how to restore from a backup](restore-from-backup.md). |
81 changes: 81 additions & 0 deletions
81
datacenter/dtr/2.5/guides/admin/disaster-recovery/repair-a-cluster.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
--- | ||
title: Repair a cluster | ||
description: Learn how to repair DTR when the majority of replicas are unhealthy. | ||
keywords: dtr, disaster recovery | ||
--- | ||
|
||
For a DTR cluster to be healthy, a majority of its replicas (n/2 + 1) need to | ||
be healthy and be able to communicate with the other replicas. This is known | ||
as maintaining quorum. | ||
|
||
In a scenario where quorum is lost, but at least one replica is still | ||
accessible, you can use that replica to repair the cluster. That replica doesn't | ||
need to be completely healthy. The cluster can still be repaired as the DTR | ||
data volumes are persisted and accessible. | ||
|
||
 | ||
|
||
Repairing the cluster from an existing replica minimizes the amount of data lost. | ||
If this procedure doesn't work, you'll have to | ||
[restore from an existing backup](restore-from-backup.md). | ||
|
||
## Diagnose an unhealthy cluster | ||
|
||
When a majority of replicas are unhealthy, causing the overall DTR cluster to | ||
become unhealthy, operations like `docker login`, `docker pull`, and `docker push` | ||
present `internal server error`. | ||
|
||
Accessing the `/_ping` endpoint of any replica also returns the same error. | ||
It's also possible that the DTR web UI is partially or fully unresponsive. | ||
|
||
## Perform an emergency repair | ||
|
||
Use the `docker/dtr emergency-repair` command to try to repair an unhealthy | ||
DTR cluster, from an existing replica. | ||
|
||
This command checks the data volumes for the DTR | ||
|
||
This command checks the data volumes for the DTR replica are uncorrupted, | ||
redeploys all internal DTR components and reconfigured them to use the existing | ||
volumes. | ||
|
||
It also reconfigures DTR removing all other nodes from the cluster, leaving DTR | ||
as a single-replica cluster with the replica you chose. | ||
|
||
Start by finding the ID of the DTR replica that you want to repair from. | ||
You can find the list of replicas by navigating to the UCP web UI, or by using | ||
a UCP client bundle to run: | ||
|
||
``` | ||
{% raw %} | ||
docker ps --format "{{.Names}}" | grep dtr | ||
# The list of DTR containers with <node>/<component>-<replicaID>, e.g. | ||
# node-1/dtr-api-a1640e1c15b6 | ||
{% endraw %} | ||
``` | ||
|
||
Then, use your UCP client bundle to run the emergency repair command: | ||
|
||
``` | ||
{% raw %} | ||
docker run -it --rm {{ page.dtr_org }}/{{ page.dtr_repo }}:{{ page.dtr_version }} emergency-repair \ | ||
--ucp-insecure-tls \ | ||
--existing-replica-id <replica-id> | ||
{% endraw %} | ||
``` | ||
|
||
If the emergency repair procedure is successful, your DTR cluster now has a | ||
single replica. You should now | ||
[join more replicas for high availability](../configure/set-up-high-availability.md). | ||
|
||
 | ||
|
||
If the emergency repair command fails, try running it again using a different | ||
replica ID. As a last resort, you can restore your cluster from an existing | ||
backup. | ||
|
||
## Where to go next | ||
|
||
* [Create a backup](create-a-backup.md) | ||
* [Restore from an existing backup](restore-from-backup.md) |
Oops, something went wrong.