ydb-platform
diff --git a/‎ydb/docs/en/core/devops/manual/maintenance-without-downtime.md‎
Lines changed: 96 additions & 0 deletions b/‎ydb/docs/en/core/devops/manual/maintenance-without-downtime.md‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎ydb/docs/en/core/devops/manual/toc_p.yaml‎
Lines changed: 2 additions & 0 deletions b/‎ydb/docs/en/core/devops/manual/toc_p.yaml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎ydb/docs/en/core/maintenance/toc_i.yaml‎
Lines changed: 2 additions & 0 deletions b/‎ydb/docs/en/core/maintenance/toc_i.yaml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎ydb/docs/ru/core/maintenance/maintenance-without-outages.md‎ renamed to ‎ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md‎
Lines changed: 24 additions & 26 deletions b/‎ydb/docs/ru/core/maintenance/maintenance-without-outages.md‎ renamed to ‎ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md‎
Lines changed: 24 additions & 26 deletions
@@ -0,0 +1,96 @@
+# Maintenance without downtime
+
+A {{ ydb-short-name }} cluster periodically needs maintenance, such as upgrading its version or replacing broken disks. Maintenance can cause a cluster or its databases to become unavailable due to:
+- Going beyond the expectations of the affected [storage groups](../../concepts/databases.md#storage-groups) failure model.
+- Going beyond the expectations of the [State Storage](../../deploy/configuration/config.md#domains-state) failure model.
+- Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes).
+
+To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to acquire exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are acquired are considered unavailable from the CMS perspective and can be safely engaged in maintenance. The CMS will [check](#checking-algorithm) the current state of the cluster and acquire locks only if the maintenance complies with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits).
+
+{% note warning "Failures during maintenance" %}
+
+During maintenance activities whose safety is guaranteed by the CMS, failures unrelated to those activities may occur in the cluster. If the failures threaten the cluster's availability, urgently aborting the maintenance can help mitigate the risk of cluster downtime.
+
+{% endnote %}
+
+## Maintenance task {#maintenance-task}
+
+A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance.
+
+Supported actions:
+- Acquiring an exclusive lock on a cluster component (node or host).
+
+Actions in a task are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action.
+
+If an action cannot be performed at the time of the request, the CMS informs you of the reason and time it is worth *refreshing* the task and sets the action status to *pending*. When the task is refreshed, the CMS attempts to perform the pending actions again.
+
+*Performed* actions have a deadline after which they are considered *completed* and stop affecting the cluster. For example, an exclusive lock is released. An action can be completed early.
+
+{% note info "Protracted maintenance" %}
+
+If maintenance continues after the actions performed to make it safe have been completed, this is considered a failure in the cluster.
+
+{% endnote %}
+
+Completed actions are automatically removed from the task.
+
+### Availability mode {#availability-mode}
+
+In a maintenance task, you need to specify the cluster's availability mode to comply with when checking whether actions can be performed. The following modes are supported:
+- **Strong**: a mode that minimizes the risk of availability loss.
+    - No more than one unavailable [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) is allowed in each affected storage group.
+    - No more than one unavailable State Storage ring is allowed.
+- **Weak**: a mode that does not allow exceeding the failure model.
+    - For affected storage groups with the [block-4-2](../../deploy/configuration/config.md#reliability) scheme, no more than two unavailable VDisks are allowed.
+    - For affected storage groups with the [mirror-3-dc](../../deploy/configuration/config.md#reliability) scheme, up to four unavailable VDisks are allowed, three of which must be in the same data center. 
+    - No more than `(nto_select - 1) / 2` unavailable State Storage rings are allowed.
+- **Force**: a forced mode, the failure model is ignored. *Not recommended for use.*
+
+### Priority {#priority}
+
+You can specify the priority of a maintenance task. A lower value means a higher priority.
+
+The task's actions cannot be performed until all conflicting actions from tasks with a higher priority are completed. Tasks with the same priority have no advantage over each other.
+
+## Unavailable node limits {#unavailable-node-limits}
+
+In the CMS configuration, you can configure limits on the number of unavailable nodes for a database (tenant) or the cluster as a whole. Relative and absolute limits are supported.
+
+By default, each database and the cluster as a whole are allowed to have no more than 10% unavailable nodes.
+
+## Checking algorithm {#checking-algorithm}
+
+To check if the actions of a maintenance task can be performed, the CMS sequentially goes through each action group in the task and checks the action from the group:
+- If the action's object is a host, the CMS checks whether the action can be performed with all nodes running on the host. 
+- If the action's object is a node, the CMS checks:
+    - Whether there is a lock on the node.
+    - Whether it's possible to lock the node according to the limits of unavailable nodes.
+    - Whether it's possible to lock all VDisks of the node according to the availability mode.
+    - Whether it's possible to lock the State Storage ring of the node according to the availability mode.
+    - Whether it's possible to lock the node according to the limit of unavailable nodes on which cluster system tablets can run.
+
+The action can be performed if the checks are successful, and temporary locks are acquired on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are released.
+
+## Examples {#examples}
+
+The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto).
+
+### Rolling restart {##rolling-restart}
+
+To perform a rolling restart of the entire cluster, you can use the command:
+```
+$ ydbops restart --endpoint grpc://<cluster-fqdn> --availability-mode strong
+```
+If your systemd unit name is different from the default one, you may need to override it with `--systemd-unit` flag.
+
+The `ydbops` utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, the `ydbops` will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted.
+
+### Take out a node for maintenance {#node-maintenance}
+
+{% note info "Functionality in development" %}
+
+Functionality is expected in upcoming versions of the `ydbops`.
+
+{% endnote %}
+
+To take out a node for maintenance, you can use the `ydbops` utility. When taking a node out, the `ydbops` will acquire an exclusive lock on this node in CMS.
@@ -23,4 +23,6 @@ items:
   href: ../../maintenance/manual/cms.md
 - name: System views
   href: system-views.md
+- name: Maintenance without downtime
+  href: maintenance-without-downtime.md
 
@@ -11,5 +11,7 @@ items:
     include: { mode: link, path: manual/toc_p.yaml }
   - name: Changing an actor system's configuration
     href: manual/change_actorsystem_configs.md
+  - name: Maintenance without downtime
+    href: manual/maintenance-without-downtime.md
   - name: Updating configurations via CMS
     href: manual/cms.md
@@ -1,11 +1,11 @@
 # Обслуживание кластера без потери доступности
 
 Периодически кластер {{ ydb-short-name }} необходимо обслуживать, например, обновлять его версию или заменять сломавшиеся диски. Работы по обслуживанию могут привести к недоступности кластера или имеющихся баз данных из-за:
-- Превышения модели отказа затронутых [групп хранения](../concepts/databases.md#storage-groups).
-- Превышения модели отказа [State Storage](../deploy/configuration/config.md#domains-state).
-- Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../concepts/cluster/common_scheme_ydb.md#nodes).
+- Превышения модели отказа затронутых [групп хранения](../../concepts/databases.md#storage-groups).
+- Превышения модели отказа [State Storage](../../deploy/configuration/config.md#domains-state).
+- Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../../concepts/cluster/common_scheme_ydb.md#nodes).
 
-Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#check-task-actions-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits).
+Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#checking-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits).
 
 {% note warning "Поломки во время проведения работ" %}
 
@@ -18,7 +18,7 @@
 *Задача обслуживания* представляет собой набор *действий*, которые пользователь просит выполнить CMS для возможности проведения безопасного обслуживания.
 
 Поддерживаемые действия:
-- Взятие эксклюзивной блокировки на компонент кластера — узел или хост.
+- Взятие эксклюзивной блокировки на компонент кластера (узел или хост).
 
 В задаче действия делятся на группы. Действия из одной группы выполняются атомарно. На данный момент группы могут состоять только из одного действия.
 
@@ -37,14 +37,14 @@
 ### Режим доступности {#availability-mode}
 
 В задаче обслуживания необходимо указать режим доступности кластера, который должен соблюдаться при проверке возможности выполнения действий. Поддерживаются следующие режимы:
-- **Strong** — режим, минимизирующий риск потери доступности.
-    - Допускается не более одного недоступного [VDisk](../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения.
+- **Strong**: режим, минимизирующий риск потери доступности.
+    - Допускается не более одного недоступного [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения.
     - Допускается не более одного недоступного кольца State Storage.
-- **Weak** — режим, не позволяющий превысить модель отказа.
-    - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../administration/production-storage-config.md#reliability).
-    - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../administration/production-storage-config.md#reliability). 
+- **Weak**: режим, не позволяющий превысить модель отказа.
+    - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../../deploy/configuration/config.md#reliability).
+    - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../../deploy/configuration/config.md#reliability). 
     - Допускается не более `(nto_select - 1) / 2` недоступных колец State Storage.
-- **Force** — принудительный режим, модель отказа игнорируется. Не рекомендуется к использованию.
+- **Force**: принудительный режим, модель отказа игнорируется. *Не рекомендуется к использованию*.
 
 ### Приоритет {#priority}
 
@@ -58,7 +58,7 @@
 
 По умолчанию допускается не более 10% недоступных узлов для каждой базы данных и кластера в целом.
 
-## Алгоритм проверки действий задачи {#check-task-actions-algorithm}
+## Алгоритм проверки {#checking-algorithm}
 
 Для того, чтобы проверить можно ли выполнить действия задачи обслуживания, CMS последовательно идет по каждой группе действий в задаче и проверяет действие из группы:
 - Если объектом действия является хост, то CMS проверяет можно ли выполнить действие со всеми узлами, запущенными на хосте. 
@@ -75,24 +75,22 @@
 
 Утилита [ydbops](https://github.com/ydb-platform/ydbops) использует CMS для проведения обслуживания кластера без потери доступности. Также CMS можно использовать напрямую через [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto).
 
+### Rolling restart {#rolling-restart}
+
+Чтобы выполнить rolling restart всего кластера можно воспользоваться командой:
+```
+$ ydbops restart --endpoint grpc://<cluster-fqdn> --availability-mode strong
+```
+Если используемое имя systemd unit отличается от стандартного, его можно переопределить с помощью флага `--systemd-unit`.
+
+Утилита `ydbops` автоматически создаст задачу обслуживания на рестарт всего кластера, используя указанный режим доступности. По ходу продвижения `ydbops` будет обновлять задачу обслуживания и получать эксклюзивные блокировки на узлы в CMS, пока все узлы не будут перезапущены.
+
 ### Вывести узел для обслуживания {#node-maintenance}
 
 {% note info "Функциональность в разработке" %}
 
-Функциональность ожидается в ближайших версиях ydbops.
+Функциональность ожидается в ближайших версиях `ydbops`.
 
 {% endnote %}
 
-Для выведения узла для обслуживания можно воспользоваться командой:
-```
-$ ydbops node maintenance --host <node_fqdn>
-```
-При выполнении этой команды ydbops возьмет эксклюзивную блокировку на узел в CMS.
-
-### Rolling restart {#rolling-restart}
-
-Чтобы выполнить rolling restart всего кластера можно воспользоваться командой:
-```
-$ ydbops restart --endpoint grpc://<cluster-fqdn> --availability-mode strong
-```
-Утилита ydbops автоматически создаст задачу обслуживания на рестарт всего кластера, используя указанный режим доступности. По ходу продвижения ydbops будет обновлять задачу обслуживания и получать эксклюзивные блокировки на узлы в CMS, пока все узлы не будут перезапущены.
+Чтобы вывести узел для обслуживания можно воспользоваться утилитой `ydbops`. При выведении узла `ydbops` возьмет эксклюзивную блокировку на этот узел в CMS.