|
1 | 1 | # {{ ydb-short-name }} cluster topology |
2 | 2 |
|
3 | | -A {{ ydb-short-name }} cluster consists of [storage](glossary.md#storage-node) and [database](glossary.md#database-node) nodes. As the data stored in {{ ydb-short-name }} is available only via queries and API calls, both types of nodes are essential for [database availability](#database-availability). However, [distributed storage](glossary.md#distributed-storage) consisting of storage nodes has the most impact on the cluster's fault tolerance and ability to persist data reliably. During the initial cluster deployment, an appropriate distributed storage [operating mode](#cluster-config) needs to be chosen according to the expected workload and [database availability](#database-availability) requirements. |
| 3 | +A {{ ydb-short-name }} cluster consists of [storage](glossary.md#storage-node) and [database](glossary.md#database-node) nodes. As the data stored in {{ ydb-short-name }} is available only via queries and API calls, both types of nodes are essential for [database availability](#database-availability). However, [distributed storage](glossary.md#distributed-storage) consisting of storage nodes has the most impact on the cluster's fault tolerance and ability to persist data reliably. During the initial cluster deployment, an appropriate distributed storage [operating mode](#cluster-config) needs to be chosen according to the expected workload and [database availability](#database-availability) requirements. The operation mode cannot be changed after the initial cluster setup, making it one of the key decisions to consider when planning a new {{ ydb-short-name }} deployment. |
4 | 4 |
|
5 | 5 | ## Cluster operating modes {#cluster-config} |
6 | 6 |
|
7 | | -Cluster topology is based on the chosen operating mode, which needs to be determined according to the fault tolerance requirements. {{ ydb-short-name }}'s failure model is based on the concepts of [fail domain](glossary.md#fail-domain) and [fail realm](glossary.md#fail-realm). |
| 7 | +Cluster topology is based on the chosen distributed storage operating mode, which needs to be determined according to the fault tolerance requirements. {{ ydb-short-name }}'s failure model is based on the concepts of [fail domain](glossary.md#fail-domain) and [fail realm](glossary.md#fail-realm). |
8 | 8 |
|
9 | 9 | The following {{ ydb-short-name }} distributed storage operating modes are available: |
10 | 10 |
|
11 | | -- `none`. There is no redundancy. Any hardware failure causes data to become unavailable or permanently lost. This mode is only recommended for development and functional testing. |
12 | | -- `block-4-2`. [Erasure coding](https://en.wikipedia.org/wiki/Erasure_code) is applied with two blocks of redundancy added to the four blocks of source data. Storage nodes are placed in at least 8 failure domains (usually racks). A [storage pool](glossary.md#storage-pool) remains available if any two domains fail, continuing to record all 6 data parts in the remaining domains. This mode is recommended for clusters deployed within a single availability zone or data center. |
| 11 | +- `mirror-3-dc`. Data is replicated to 3 failure realms (usually availability zones or data centers) using 3 failure domains (usually racks) within each realm. {{ ydb-short-name }} cluster remains available even if one failure realm completely fails; additionally, one failure domain in the remaining zones can fail at the same time. This mode is recommended for multi-datacenter clusters with high availability requirements. |
13 | 12 |
|
14 | | -  |
| 13 | +  |
15 | 14 |
|
16 | | -- `mirror-3-dc`. Data is replicated to 3 availability zones using 3 failure domains (usually racks) within each zone. {{ ydb-short-name }} cluster remains available even if one availability zone completely fails; additionally, one failure domain in the remaining zones can fail at the same time. This mode is recommended for multi-datacenter clusters with high availability requirements. |
| 15 | +- `block-4-2`. [Erasure coding](https://en.wikipedia.org/wiki/Erasure_code) is applied with two blocks of redundancy added to the four blocks of source data. Storage nodes are placed in at least 8 failure domains (usually racks). {{ ydb-short-name }} cluster remains available if any two domains fail, continuing to record all 6 data parts in the remaining domains. This mode is recommended for clusters deployed within a single availability zone or data center. |
17 | 16 |
|
18 | | -  |
| 17 | +  |
| 18 | + |
| 19 | +- `none`. There is no redundancy. Any hardware failure causes data to become unavailable or permanently lost. This mode is only recommended for development and functional testing. |
19 | 20 |
|
20 | 21 | {% note info %} |
21 | 22 |
|
22 | | -Node failure means both its total and partial unavailability, for example, failure of a single disk on a node. |
| 23 | +Server failure refers to both total and partial unavailability. For example, the failure of a single disk is also considered a server failure in this context. |
23 | 24 |
|
24 | 25 | {% endnote %} |
25 | 26 |
|
26 | | -The table below describes the requirements and fault tolerance levels for different operating modes: |
| 27 | +Fault-tolerant operation modes of distributed storage require a significant amount of hardware to provide the maximum level of high availability guarantees supported by {{ ydb-short-name }}. However, for some use cases, the investment into hardware might be too high upfront. Therefore, {{ ydb-short-name }} offers variations of these operation modes that require less hardware while still providing a reasonable level of fault tolerance. The requirements and guarantees of all these operation modes and their variants are shown in the table below, while the implications of choosing a particular mode are discussed further in the article. |
27 | 28 |
|
28 | 29 | | Mode | Storage<br>volume multiplier | Minimum<br>number<br>of nodes | Fail<br>domain | Fail<br>realm | Number of<br>data centers | Number of<br>server racks | |
29 | 30 | | --- | --- | --- | --- | --- | --- | --- | |
30 | | -| `none`, no fault tolerance | 1 | 1 | Node | Node | 1 | 1 | |
31 | | -| `block-4-2`, can stand a failure of 2 racks | 1.5 | 8 (10 recommended) | Rack | Data center | 1 | 8 | |
32 | 31 | | `mirror-3-dc`, can stand a failure of a data center and 1 rack in one of the remaining data centers | 3 | 9 (12 recommended) | Rack | Data center | 3 | 3 in each data center | |
33 | | -| `block-4-2` *(reduced)*, can stand a failure of 1 rack | 1.5 | 10 | ½ a rack | Data center | 1 | 5 | |
34 | 32 | | `mirror-3-dc` *(reduced)*, can stand a failure of a data center and 1 server in one of the two other data centers | 3 | 12 | ½ a rack | Data center | 3 | 6 | |
35 | 33 | | `mirror-3-dc` *(3 nodes)*, can stand a failure of a single server, or a failure of a data center | 3 | 3 | Server | Data center | 3 | Doesn't matter | |
| 34 | +| `block-4-2`, can stand a failure of 2 racks | 1.5 | 8 (10 recommended) | Rack | Data center | 1 | 8 | |
| 35 | +| `block-4-2` *(reduced)*, can stand a failure of 1 rack | 1.5 | 10 | ½ a rack | Data center | 1 | 5 | |
| 36 | +| `none`, no fault tolerance | 1 | 1 | Node | Node | 1 | 1 | |
36 | 37 |
|
37 | 38 | {% note info %} |
38 | 39 |
|
39 | | -The storage volume multiplier specified above only applies to the fault tolerance factor. Other influencing factors (for example, [slot](glossary.md#slot) fragmentation and granularity) must be taken into account for storage size planning. |
| 40 | +The storage volume multiplier specified above only applies to the fault tolerance factor. Other influencing factors (for example, [slot](glossary.md#slot) fragmentation and granularity) must be taken into account for storage capacity planning. |
40 | 41 |
|
41 | 42 | {% endnote %} |
42 | 43 |
|
43 | | -When creating a [storage group](glossary.md#storage-group), which is a basic allocation unit for storage management, {{ ydb-short-name }} selects [VDisks](glossary.md#vdisk) that are located on [PDisks](glossary.md#pdisk) from different fail domains. For `block-4-2` mode, a storage group should be distributed across at least 8 fail domains, while for `mirror-3-dc` mode, it should be distributed across 3 fail realms, with at least 3 fail domains in each realm. |
44 | | - |
45 | 44 | For information about how to set the {{ ydb-short-name }} cluster topology, see [{#T}](../reference/configuration/index.md#domains-blob). |
46 | 45 |
|
47 | 46 | ### Reduced configurations {#reduced} |
48 | 47 |
|
49 | 48 | If it is impossible to use the [recommended amount](#cluster-config) of hardware, you can divide servers within a single rack into two dummy fail domains. In this configuration, the failure of one rack results in the failure of two domains instead of just one. In such reduced configurations, {{ ydb-short-name }} will continue to operate if two domains fail. The minimum number of racks in a cluster is five for `block-4-2` mode and two per data center (e.g., six in total) for `mirror-3-dc` mode. |
50 | 49 |
|
51 | | -The minimal fault-tolerant configuration of a {{ ydb-short-name }} cluster uses the 3 nodes variant of `mirror-3-dc` operating mode, which requires only three servers with three disks each. In this configuration, each server acts as both a fail domain and a fail realm, and the cluster can withstand the failure of only a single server. Each server must be located in an independent data center to provide reasonable fault tolerance. **This mode is only recommended for functional testing or building prototypes.** |
52 | | - |
53 | | -## Redundancy recovery {#rebuild} |
54 | | - |
55 | | -If a disk fails, {{ ydb-short-name }} can automatically reconfigure a storage group. Whether the disk failure is caused by the whole server failure or not is irrelevant in this context. Auto reconfiguration of storage groups reduces the risk of data loss in the event of a sequence of failures, provided these failures occur with sufficient time intervals to recover redundancy. By default, reconfiguration begins one hour after {{ ydb-short-name }} detects a failure. |
56 | | - |
57 | | -Disk group reconfiguration replaces the VDisk located on the failed hardware with a new VDisk, and the system tries to place it on operational hardware. The same rules apply as when creating a storage group: |
58 | | - |
59 | | -* The new VDisk is created in a fail domain that is different from any other VDisks in the group. |
60 | | -* In the `mirror-3-dc` mode, it is created within the same fail realm as the failed VDisk. |
61 | | - |
62 | | -To ensure reconfiguration is possible, a cluster should have free slots available for creating VDisks in different fail domains. When determining the number of slots to keep free, consider the risk of hardware failure, the time required to replicate data, and the time needed to replace the failed hardware. |
63 | | - |
64 | | -The disk group reconfiguration process increases the load on other VDisks in the group as well as on the network. The total data replication speed is limited on both the source and target VDisks to minimize the impact of redundancy recovery on system performance. |
65 | | - |
66 | | -The time required to restore redundancy depends on the amount of data and hardware performance. For example, replication on fast NVMe SSDs may take an hour, while it could take more than 24 hours on large HDDs. |
67 | | - |
68 | | -Disk group reconfiguration is limited or totally impossible when a cluster's hardware is distributed across the minimum required number of fail domains: |
69 | | - |
70 | | -* If an entire fail domain is down, reconfiguration becomes impractical, as a new VDisk can only be placed in the fail domain that is down. |
71 | | -* Reconfiguration only happens when part of a fail domain is down. However, the load previously handled by the failed hardware will be redistributed across the surviving hardware, remaining in the same fail domain. |
| 50 | +The minimal fault-tolerant configuration of a {{ ydb-short-name }} cluster uses the 3 nodes variant of `mirror-3-dc` operating mode, which requires only three servers with three disks each. In this configuration, each server acts as both a fail domain and a fail realm, and the cluster can withstand the failure of only a single server. Each server must be located in an independent data center to provide reasonable fault tolerance. |
72 | 51 |
|
73 | | -The load can be redistributed across all the hardware that is still running if the number of fail domains in a cluster exceeds the minimum amount required for creating storage groups by at least one. This means having 9 fail domains for `block-4-2` and 4 fail domains in each fail realm for `mirror-3-dc`, which is recommended. |
| 52 | +{{ ydb-short-name }} clusters configured with one of these approaches can be used for production environments if they don't require stronger fault tolerance guarantees. |
74 | 53 |
|
75 | 54 | ## Capacity and performance considerations {#capacity} |
76 | 55 |
|
@@ -106,4 +85,6 @@ To survive an entire data center outage at the database level, assuming a cluste |
106 | 85 | ## See also |
107 | 86 |
|
108 | 87 | * [Documentation for DevOps Engineers](../devops/index.md) |
| 88 | +* [{#T}](../reference/configuration/index.md#domains-blob) |
109 | 89 | * [Example cluster configuration files](https://github.com/ydb-platform/ydb/tree/main/ydb/deploy/yaml_config_examples/) |
| 90 | +* [{#T}](../contributor/distributed-storage.md) |
0 commit comments