_attributes/common-attributes.adoc = Backup, restore, and disaster recovery for {hcp} :context: hcp-backup-restore-dr

If you need to back up and restore etcd on a hosted cluster or provide disaster recovery for a hosted cluster, see the following procedures.

Recovering etcd pods for hosted clusters

In hosted clusters, etcd pods run as part of a stateful set. The stateful set relies on persistent storage to store etcd data for each member. In a highly available control plane, the size of the stateful set is three pods, and each member has its own persistent volume claim.

modules/hosted-cluster-etcd-status.adoc modules/hosted-cluster-single-node-recovery.adoc

Backing up and restoring etcd on a hosted cluster on {aws-short}

If you use {hcp} for {product-title}, the process to back up and restore etcd is different from the usual etcd backup process.

The following procedures are specific to {hcp} on {aws-short}.

snippets/technology-preview.adoc

modules/backup-etcd-hosted-cluster.adoc

modules/restoring-etcd-snapshot-hosted-cluster.adoc

modules/hosted-cluster-etcd-backup-restore-on-prem.adoc

Disaster recovery for a hosted cluster within an {aws-short} region

In a situation where you need disaster recovery (DR) for a hosted cluster, you can recover a hosted cluster to the same region within {aws-short}. For example, you need DR when the upgrade of a management cluster fails and the hosted cluster is in a read-only state.

snippets/technology-preview.adoc

The DR process involves three main steps:

Backing up the hosted cluster on the source management cluster
Restoring the hosted cluster on a destination management cluster
Deleting the hosted cluster from the source management cluster

Your workloads remain running during the process. The Cluster API might be unavailable for a period, but that will not affect the services that are running on the worker nodes.

Important

Both the source management cluster and the destination management cluster must have the --external-dns flags to maintain the API server URL, as shown in this example:

Example: External DNS flags

--external-dns-provider=aws \
--external-dns-credentials=<AWS Credentials location> \
--external-dns-domain-filter=<DNS Base Domain>

That way, the server URL ends with https://api-sample-hosted.sample-hosted.aws.openshift.com.

If you do not include the --external-dns flags to maintain the API server URL, the hosted cluster cannot be migrated.

Example environment and context

Consider an scenario where you have three clusters to restore. Two are management clusters, and one is a hosted cluster. You can restore either the control plane only or the control plane and the nodes. Before you begin, you need the following information:

Source MGMT Namespace: The source management namespace
Source MGMT ClusterName: The source management cluster name
Source MGMT Kubeconfig: The source management kubeconfig file
Destination MGMT Kubeconfig: The destination management kubeconfig file
HC Kubeconfig: The hosted cluster kubeconfig file
SSH key file: The SSH public key
Pull secret: The pull secret file to access the release images
{aws-short} credentials
{aws-short} region
Base domain: The DNS base domain to use as an external DNS
S3 bucket name: The bucket in the {aws-short} region where you plan to upload the etcd backup

This information is shown in the following example environment variables.

Example environment variables

SSH_KEY_FILE=${HOME}/.ssh/id_rsa.pub
BASE_PATH=${HOME}/hypershift
BASE_DOMAIN="aws.sample.com"
PULL_SECRET_FILE="${HOME}/pull_secret.json"
AWS_CREDS="${HOME}/.aws/credentials"
AWS_ZONE_ID="Z02718293M33QHDEQBROL"

CONTROL_PLANE_AVAILABILITY_POLICY=SingleReplica
HYPERSHIFT_PATH=${BASE_PATH}/src/hypershift
HYPERSHIFT_CLI=${HYPERSHIFT_PATH}/bin/hypershift
HYPERSHIFT_IMAGE=${HYPERSHIFT_IMAGE:-"quay.io/${USER}/hypershift:latest"}
NODE_POOL_REPLICAS=${NODE_POOL_REPLICAS:-2}

# MGMT Context
MGMT_REGION=us-west-1
MGMT_CLUSTER_NAME="${USER}-dev"
MGMT_CLUSTER_NS=${USER}
MGMT_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${MGMT_CLUSTER_NS}-${MGMT_CLUSTER_NAME}"
MGMT_KUBECONFIG="${MGMT_CLUSTER_DIR}/kubeconfig"

# MGMT2 Context
MGMT2_CLUSTER_NAME="${USER}-dest"
MGMT2_CLUSTER_NS=${USER}
MGMT2_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${MGMT2_CLUSTER_NS}-${MGMT2_CLUSTER_NAME}"
MGMT2_KUBECONFIG="${MGMT2_CLUSTER_DIR}/kubeconfig"

# Hosted Cluster Context
HC_CLUSTER_NS=clusters
HC_REGION=us-west-1
HC_CLUSTER_NAME="${USER}-hosted"
HC_CLUSTER_DIR="${BASE_PATH}/hosted_clusters/${HC_CLUSTER_NS}-${HC_CLUSTER_NAME}"
HC_KUBECONFIG="${HC_CLUSTER_DIR}/kubeconfig"
BACKUP_DIR=${HC_CLUSTER_DIR}/backup

BUCKET_NAME="${USER}-hosted-${MGMT_REGION}"

# DNS
AWS_ZONE_ID="Z07342811SH9AA102K1AC"
EXTERNAL_DNS_DOMAIN="hc.jpdv.aws.kerbeross.com"

Overview of the backup and restore process

The backup and restore process works as follows:

On management cluster 1, which you can think of as the source management cluster, the control plane and workers interact by using the external DNS API. The external DNS API is accessible, and a load balancer sits between the management clusters.
You take a snapshot of the hosted cluster, which includes etcd, the control plane, and the worker nodes. During this process, the worker nodes continue to try to access the external DNS API even if it is not accessible, the workloads are running, the control plane is saved in a local manifest file, and etcd is backed up to an S3 bucket. The data plane is active and the control plane is paused.
On management cluster 2, which you can think of as the destination management cluster, you restore etcd from the S3 bucket and restore the control plane from the local manifest file. During this process, the external DNS API is stopped, the hosted cluster API becomes inaccessible, and any workers that use the API are unable to update their manifest files, but the workloads are still running.
The external DNS API is accessible again, and the worker nodes use it to move to management cluster 2. The external DNS API can access the load balancer that points to the control plane.
On management cluster 2, the control plane and worker nodes interact by using the external DNS API. The resources are deleted from management cluster 1, except for the S3 backup of etcd. If you try to set up the hosted cluster again on mangagement cluster 1, it will not work.

You can manually back up and restore your hosted cluster, or you can run a script to complete the process. For more information about the script, see "Running a script to back up and restore a hosted cluster".