Skip to content

RED-38248 Slave HA changes for 5.6 #708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 2, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 31 additions & 32 deletions content/rs/administering/database-operations/slave-ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,44 +5,42 @@ weight: $weight
alwaysopen: false
categories: ["RS"]
---
When you enable [database replication]({{< relref "/rs/concepts/high-availability/replication.md" >}})
for your database, RS replicates your data to a slave node to make sure that your
data is highly available. Whether the slave node fails or the master node fails
and the slave is promoted to master, the remaining master node is a
single point of failure.
When you enable [database replication]({{< relref "/rs/concepts/high-availability/replication.md" >}}) for your database,
RS replicates your data to a slave node to make sure that your data is highly available.
If the slave node fails or if the master node fails and the slave is promoted to master,
the remaining master node is a single point of failure.

You can configure high availability for slave shards (slave HA) so that the cluster
automatically migrates the slave shards to another available node. In practice, slave
migration creates a new slave shard and replicates the data from the master shard to the
new slave shard. For example:
You can configure high availability for slave shards (slave HA) so that the cluster automatically migrates the slave shards to another available node.
In practice, slave migration creates a new slave shard and replicates the data from the master shard to the new slave shard.
For example:

1. Node:2 has a master shard and node:3 has the corresponding the slave shard.
1. Either:

- Node:2 fails and the slave shard on node:3 is promoted to master.
- Node:3 fails and the master shard is no longer replicated.
- Node:3 fails and the master shard is no longer replicated to the slave shard on the failed node.

1. If slave HA is enabled, a new slave shard is created on an available node
that does not hold the master shard.
1. If slave HA is enabled, a new slave shard is created on an available node that does not hold the master shard.

All of the constraints of shard migration apply, such as [rack-awareness]({{< relref "/rs/concepts/high-availability/rack-zone-awareness.md" >}}).

1. The data from the master shard is replicated to the new slave shard.

## Configuring High Availability for Slave Shards

You can enable slave HA using rladmin or using the REST API either for:
Using rladmin or the REST API, slave HA is controlled on the database level and on the cluster level.
You can enable or disable slave HA for a database or for the entire cluster.

- Cluster - All databases in the cluster use slave HA
- Database - Only the specified database uses slave HA
When slave HA is enabled for both the cluster and a database,
slave shards for that database are automatically migrated to another node in the event of a master or slave shard failure.
If slave HA is disabled at the cluster level,
slave HA will not migrate slave shards even if slave HA is enabled for a database.

By default, slave HA is set to disabled at the cluster level and enabled at the
database level, with the cluster level overriding, so that:
By default, slave HA is enabled for the cluster and disabled for each database so that o enable slave HA for a database, enable slave HA for that database.

- To enable slave HA for all databases in the cluster - Enable slave HA for the cluster
- To enable slave HA for only specified databases in the cluster:
1. Enable slave HA for the cluster
1. Disable slave HA for the databases for which you do not want slave HA enabled
{{% note %}}
For Active-Active databases, slave HA is enabled for the database by default to make sure that slave shards are available for Active-Active replication.
{{% /note %}}

To enable slave HA for a cluster using rladmin, run:

Expand All @@ -58,22 +56,24 @@ You can see the current configuration options for slave HA with: `rladmin info c

### Grace Period

By default, slave HA has a 15-minute grace period after node failure and before new slave shards are created.
By default, slave HA has a 10-minute grace period after node failure and before new slave shards are created.
To configure this grace period from rladmin, run:

rladmin tune cluster slave_ha_grace_period <time_in_seconds>

### Shard Priority

Slave shard migration is based on priority so that, in the case of limited memory resources, the most important slave shards are migrated first. Slave HA migrates slave shards for databases according to this order of priority:
Slave shard migration is based on priority so that, in the case of limited memory resources,
the most important slave shards are migrated first.
Slave HA migrates slave shards for databases according to this order of priority:

1. slave_ha_priority - The slave shards of the database with the higher slave_ha_priority
integer value are migrated first.

To assign priority to a database, run:

```src
rladmin tune db <bdb_uid> slave_ha_priority <positive integer>
rladmin tune db <bdb_uid> slave_ha_priority <positive integer>
```

1. CRDBs - The CRDB synchronization uses slave shards to synchronize between the replicas.
Expand All @@ -82,26 +82,25 @@ Slave shard migration is based on priority so that, in the case of limited memor

### Cooldown Periods

Both the cluster and the database have cooldown periods. After node failure, the cluster
cooldown period prevents another slave migration due to another node failure for any
databases in the cluster until the cooldown period ends (Default: 1 hour).
Both the cluster and the database have cooldown periods.
After node failure, the cluster cooldown period prevents another slave migration due to another node failure for any
database in the cluster until the cooldown period ends (Default: 1 hour).

After a database is migrated with slave HA, it cannot go through another slave migration
due to another node failure until the cooldown period for the database ends (Default: 24
hours).
After a database is migrated with slave HA,
it cannot go through another slave migration due to another node failure until the cooldown period for the database ends (Default: 2 hours).

To configure this grace period from rladmin, run:

- For the cluster:

```src
rladmin tune cluster slave_ha_cooldown_period <time_in_seconds>
rladmin tune cluster slave_ha_cooldown_period <time_in_seconds>
```

- For all databases in the cluster:

```src
rladmin tune cluster slave_ha_bdb_cooldown_period <time_in_seconds>
rladmin tune cluster slave_ha_bdb_cooldown_period <time_in_seconds>
```

### Alerts
Expand Down