Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc](update) update concurrent control #1487

Merged
merged 2 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
modify unique-update-transaction.md
  • Loading branch information
zhannngchen committed Dec 13, 2024
commit a59680686bdbee3e6a7bf6b7d5257d80e91e2e5b
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "Updating Transaction",
"title": "Concurrency Control for Updates in the Primary Key Model",
"language": "en"
}
---
Expand All @@ -24,16 +24,28 @@ specific language governing permissions and limitations
under the License.
-->

## Update Concurrency Control
## Overview

By default, concurrent updates on the same table are not allowed in Doris.
Doris adopts a Multi-Version Concurrency Control (MVCC) mechanism to manage concurrent updates. Each data load operation is assigned a transaction, which ensures atomicity (i.e., the operation either fully succeeds or completely fails). Upon transaction commit, the system assigns a version number. When using the Unique Key model and loading multiple batches of data with duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher version number will overwrite data with a lower version number.

In certain scenarios, users may need to specify a sequence column in the table creation statement to customize the order in which data takes effect. For example, when synchronizing data into Doris using multiple concurrent processes, the data may arrive out of order. This could lead to older data overwriting newer data due to its delayed arrival. To address this, users can assign a lower sequence value to the older data and a higher sequence value to the newer data, enabling Doris to determine the update order based on the sequence values provided by the user.

Additionally, `UPDATE` statements differ significantly from updates performed via data loads at the implementation level. An `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data back. By default, `UPDATE` statements use table-level locks to provide transaction capabilities with Serializable isolation, meaning multiple `UPDATE` statements must be executed serially. However, users can bypass this restriction by modifying the configuration. For detailed instructions, refer to the relevant section below.

## UPDATE Concurrency Control

By default, concurrent `UPDATE`s on the same table are not allowed in Doris.

The main reason is that Doris currently supports row-level updates, which means that even if the user specifies to update only a specific column (e.g., `SET v2 = 1`), all other value columns will be overwritten as well (even though their values remain unchanged).

This poses a problem when multiple update operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data.
This poses a problem when multiple `UPDATE` operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data.

However, in practical applications, if the user can ensure that concurrent updates will not affect the same row simultaneously, they can manually enable the concurrent update restriction. This can be done by modifying the FE (Frontend) configuration `enable_concurrent_update`. When this configuration is set to `true`, the update command will not have transaction guarantees.

:::caution Note:
Enabling the `enable_concurrent_update` configuration may introduce certain performance risks.
:::

## Sequence Column

The Unique model primarily caters to scenarios that require unique primary keys, ensuring the uniqueness constraint. When loading data in the same batch or different batches, the replacement order is not guaranteed. The uncertainty in the replacement order results in ambiguity in the specific data loaded into the table.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "主键模型的更新事务",
"title": "主键模型的更新并发控制",
"language": "zh-CN"
}
---
Expand All @@ -24,18 +24,26 @@ specific language governing permissions and limitations
under the License.
-->

## Update 并发控制
## 概览

默认情况下,并不允许同一时间对同一张表并发进行多个 Update 操作。
Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Control)来管理并发更新。每次数据写入操作均会分配一个写入事务,该事务确保数据写入的原子性(即写入操作要么完全成功,要么完全失败)。在写入事务提交时,系统会为其分配一个版本号。当用户使用 Unique Key 模型并多次导入数据时,如果存在重复主键,Doris 会根据版本号确定覆盖顺序:版本号较高的数据会覆盖版本号较低的数据。

在某些场景中,用户可能需要通过在建表语句中指定 sequence 列来灵活调整数据的生效顺序。例如,当通过多线程并发同步数据到 Doris 时,不同线程的数据可能会乱序到达。这种情况下,可能出现旧数据因较晚到达而错误覆盖新数据的情况。为解决这一问题,用户可以为旧数据指定较低的 sequence 值,为新数据指定较高的 sequence 值,从而让 Doris 根据用户提供的 sequence值来正确确定数据的更新顺序。

此外,`UPDATE` 语句与通过导入实现更新在底层机制上存在较大差异。`UPDATE` 操作涉及两个步骤:从数据库中读取待更新的数据,以及写入更新后的数据。默认情况下,`UPDATE` 语句通过表级锁提供了 Serializable 隔离级别的事务能力,即多个 `UPDATE` 操作只能串行执行。用户也可以通过调整配置绕过这一限制,具体方法请参阅以下章节的详细说明。

## UPDATE 并发控制

默认情况下,并不允许同一时间对同一张表并发进行多个 `UPDATE` 操作。

主要原因是,Doris 目前支持的是行更新,这意味着,即使用户声明的是 `SET v2 = 1`,实际上,其他所有的 Value 列也会被覆盖一遍(尽管值没有变化)。

这就会存在一个问题,如果同时有两个 Update 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。
这就会存在一个问题,如果同时有两个 `UPDATE` 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。

但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。

:::caution
注意:开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::caution 注意:
开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::

## Sequence 列
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "主键模型的更新事务",
"title": "主键模型的更新并发控制",
"language": "zh-CN"
}
---
Expand All @@ -24,18 +24,26 @@ specific language governing permissions and limitations
under the License.
-->

## Update 并发控制
## 概览

默认情况下,并不允许同一时间对同一张表并发进行多个 Update 操作。
Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Control)来管理并发更新。每次数据写入操作均会分配一个写入事务,该事务确保数据写入的原子性(即写入操作要么完全成功,要么完全失败)。在写入事务提交时,系统会为其分配一个版本号。当用户使用 Unique Key 模型并多次导入数据时,如果存在重复主键,Doris 会根据版本号确定覆盖顺序:版本号较高的数据会覆盖版本号较低的数据。

在某些场景中,用户可能需要通过在建表语句中指定 sequence 列来灵活调整数据的生效顺序。例如,当通过多线程并发同步数据到 Doris 时,不同线程的数据可能会乱序到达。这种情况下,可能出现旧数据因较晚到达而错误覆盖新数据的情况。为解决这一问题,用户可以为旧数据指定较低的 sequence 值,为新数据指定较高的 sequence 值,从而让 Doris 根据用户提供的 sequence值来正确确定数据的更新顺序。

此外,`UPDATE` 语句与通过导入实现更新在底层机制上存在较大差异。`UPDATE` 操作涉及两个步骤:从数据库中读取待更新的数据,以及写入更新后的数据。默认情况下,`UPDATE` 语句通过表级锁提供了 Serializable 隔离级别的事务能力,即多个 `UPDATE` 操作只能串行执行。用户也可以通过调整配置绕过这一限制,具体方法请参阅以下章节的详细说明。

## UPDATE 并发控制

默认情况下,并不允许同一时间对同一张表并发进行多个 `UPDATE` 操作。

主要原因是,Doris 目前支持的是行更新,这意味着,即使用户声明的是 `SET v2 = 1`,实际上,其他所有的 Value 列也会被覆盖一遍(尽管值没有变化)。

这就会存在一个问题,如果同时有两个 Update 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。
这就会存在一个问题,如果同时有两个 `UPDATE` 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。

但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。

:::caution
注意:开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::caution 注意:
开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::

## Sequence 列
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "主键模型的更新事务",
"title": "主键模型的更新并发控制",
"language": "zh-CN"
}
---
Expand All @@ -24,18 +24,26 @@ specific language governing permissions and limitations
under the License.
-->

## Update 并发控制
## 概览

默认情况下,并不允许同一时间对同一张表并发进行多个 Update 操作。
Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Control)来管理并发更新。每次数据写入操作均会分配一个写入事务,该事务确保数据写入的原子性(即写入操作要么完全成功,要么完全失败)。在写入事务提交时,系统会为其分配一个版本号。当用户使用 Unique Key 模型并多次导入数据时,如果存在重复主键,Doris 会根据版本号确定覆盖顺序:版本号较高的数据会覆盖版本号较低的数据。

在某些场景中,用户可能需要通过在建表语句中指定 sequence 列来灵活调整数据的生效顺序。例如,当通过多线程并发同步数据到 Doris 时,不同线程的数据可能会乱序到达。这种情况下,可能出现旧数据因较晚到达而错误覆盖新数据的情况。为解决这一问题,用户可以为旧数据指定较低的 sequence 值,为新数据指定较高的 sequence 值,从而让 Doris 根据用户提供的 sequence值来正确确定数据的更新顺序。

此外,`UPDATE` 语句与通过导入实现更新在底层机制上存在较大差异。`UPDATE` 操作涉及两个步骤:从数据库中读取待更新的数据,以及写入更新后的数据。默认情况下,`UPDATE` 语句通过表级锁提供了 Serializable 隔离级别的事务能力,即多个 `UPDATE` 操作只能串行执行。用户也可以通过调整配置绕过这一限制,具体方法请参阅以下章节的详细说明。

## UPDATE 并发控制

默认情况下,并不允许同一时间对同一张表并发进行多个 `UPDATE` 操作。

主要原因是,Doris 目前支持的是行更新,这意味着,即使用户声明的是 `SET v2 = 1`,实际上,其他所有的 Value 列也会被覆盖一遍(尽管值没有变化)。

这就会存在一个问题,如果同时有两个 Update 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。
这就会存在一个问题,如果同时有两个 `UPDATE` 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。

但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。

:::caution
注意:开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::caution 注意:
开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::

## Sequence 列
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "主键模型的更新事务",
"title": "主键模型的更新并发控制",
"language": "zh-CN"
}
---
Expand All @@ -24,18 +24,26 @@ specific language governing permissions and limitations
under the License.
-->

## Update 并发控制
## 概览

默认情况下,并不允许同一时间对同一张表并发进行多个 Update 操作。
Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Control)来管理并发更新。每次数据写入操作均会分配一个写入事务,该事务确保数据写入的原子性(即写入操作要么完全成功,要么完全失败)。在写入事务提交时,系统会为其分配一个版本号。当用户使用 Unique Key 模型并多次导入数据时,如果存在重复主键,Doris 会根据版本号确定覆盖顺序:版本号较高的数据会覆盖版本号较低的数据。

在某些场景中,用户可能需要通过在建表语句中指定 sequence 列来灵活调整数据的生效顺序。例如,当通过多线程并发同步数据到 Doris 时,不同线程的数据可能会乱序到达。这种情况下,可能出现旧数据因较晚到达而错误覆盖新数据的情况。为解决这一问题,用户可以为旧数据指定较低的 sequence 值,为新数据指定较高的 sequence 值,从而让 Doris 根据用户提供的 sequence值来正确确定数据的更新顺序。

此外,`UPDATE` 语句与通过导入实现更新在底层机制上存在较大差异。`UPDATE` 操作涉及两个步骤:从数据库中读取待更新的数据,以及写入更新后的数据。默认情况下,`UPDATE` 语句通过表级锁提供了 Serializable 隔离级别的事务能力,即多个 `UPDATE` 操作只能串行执行。用户也可以通过调整配置绕过这一限制,具体方法请参阅以下章节的详细说明。

## UPDATE 并发控制

默认情况下,并不允许同一时间对同一张表并发进行多个 `UPDATE` 操作。

主要原因是,Doris 目前支持的是行更新,这意味着,即使用户声明的是 `SET v2 = 1`,实际上,其他所有的 Value 列也会被覆盖一遍(尽管值没有变化)。

这就会存在一个问题,如果同时有两个 Update 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。
这就会存在一个问题,如果同时有两个 `UPDATE` 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。

但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。

:::caution
注意:开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::caution 注意:
开启 `enable_concurrent_update` 配置后,会有一定的性能风险
:::

## Sequence 列
Expand Down
2 changes: 1 addition & 1 deletion sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@
"data-operate/update/unique-update",
"data-operate/update/update-of-unique-model",
"data-operate/update/update-of-aggregate-model",
"data-operate/update/unique-update-transaction"
"data-operate/update/unique-update-concurrent-control"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "Updating Transaction",
"title": "Concurrency Control for Updates in the Primary Key Model",
"language": "en"
}
---
Expand All @@ -24,16 +24,28 @@ specific language governing permissions and limitations
under the License.
-->

## Overview

Doris adopts a Multi-Version Concurrency Control (MVCC) mechanism to manage concurrent updates. Each data load operation is assigned a transaction, which ensures atomicity (i.e., the operation either fully succeeds or completely fails). Upon transaction commit, the system assigns a version number. When using the Unique Key model and loading multiple batches of data with duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher version number will overwrite data with a lower version number.

In certain scenarios, users may need to specify a sequence column in the table creation statement to customize the order in which data takes effect. For example, when synchronizing data into Doris using multiple concurrent processes, the data may arrive out of order. This could lead to older data overwriting newer data due to its delayed arrival. To address this, users can assign a lower sequence value to the older data and a higher sequence value to the newer data, enabling Doris to determine the update order based on the sequence values provided by the user.

Additionally, `UPDATE` statements differ significantly from updates performed via data loads at the implementation level. An `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data back. By default, `UPDATE` statements use table-level locks to provide transaction capabilities with Serializable isolation, meaning multiple `UPDATE` statements must be executed serially. However, users can bypass this restriction by modifying the configuration. For detailed instructions, refer to the relevant section below.

## Update Concurrency Control

By default, concurrent updates on the same table are not allowed in Doris.
By default, concurrent `UPDATE`s on the same table are not allowed in Doris.

The main reason is that Doris currently supports row-level updates, which means that even if the user specifies to update only a specific column (e.g., `SET v2 = 1`), all other value columns will be overwritten as well (even though their values remain unchanged).

This poses a problem when multiple update operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data.
This poses a problem when multiple `UPDATE` operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data.

However, in practical applications, if the user can ensure that concurrent updates will not affect the same row simultaneously, they can manually enable the concurrent update restriction. This can be done by modifying the FE (Frontend) configuration `enable_concurrent_update`. When this configuration is set to `true`, the update command will not have transaction guarantees.

:::caution Note:
Enabling the `enable_concurrent_update` configuration may introduce certain performance risks.
:::

## Sequence Column

The Unique model primarily caters to scenarios that require unique primary keys, ensuring the uniqueness constraint. When loading data in the same batch or different batches, the replacement order is not guaranteed. The uncertainty in the replacement order results in ambiguity in the specific data loaded into the table.
Expand Down
Loading