Skip to content

Commit

Permalink
[Doc] add job splitting info to BROKER LOAD sql doc (StarRocks#43435)
Browse files Browse the repository at this point in the history
  • Loading branch information
amber-create authored Apr 1, 2024
1 parent a87ce50 commit 1d2f582
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -691,6 +691,20 @@ In StarRocks v2.4 and earlier, if the total number of Broker Load jobs that are

Since StarRocks v2.5, if the total number of Broker Load jobs that are submitted within a specific period of time exceeds the maximum number, excessive jobs are queued and scheduled based on their priorities. You can specify a priority for a job by using the `priority` parameter described above. You can use [ALTER LOAD](../data-manipulation/ALTER_LOAD.md) to modify the priority of an existing job that is in the **QUEUEING** or **LOADING** state.

## Job splitting and concurrent running

A Broker Load job can be split into one or more tasks that concurrently run. The tasks within a load job are run within a single transaction. They must all succeed or fail. StarRocks splits each load job based on how you declare `data_desc` in the `LOAD` statement:

- If you declare multiple `data_desc` parameters, each of which specifies a distinct table, a task is generated to load the data of each table.

- If you declare multiple `data_desc` parameters, each of which specifies a distinct partition for the same table, a task is generated to load the data of each partition.

Additionally, each task can be further split into one or more instances, which are evenly distributed to and concurrently run on the BEs or CNs of your StarRocks cluster. StarRocks splits each task based on the FE parameter [`min_bytes_per_broker_scanner`](../../../administration/management/FE_configuration.md) and the number of BE or CN nodes. You can use the following formula to calculate the number of instances in an individual task:

**Number of instances in an individual task = min(Amount of data to be loaded by an individual task/`min_bytes_per_broker_scanner`, Number of BE/CN nodes)**

In most cases, only one `data_desc` is declared for each load job, each load job is split into only one task, and the task is split into the same number of instances as the number of BE or CN nodes.

## Examples

This section uses HDFS as an example to describe various load configurations.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -745,6 +745,20 @@ StarRocks v2.4 及以前版本中,如果某一时间段内提交的 Broker Loa
自 StarRocks v2.5 版本起,如果某一时间段内提交的 Broker Load 作业总数超过最大数量,则超出的作业会按照作业创建时指定的优先级被放到队列中排队等待调度。参见上面介绍的可选参数 `priority`。您可以使用 [ALTER LOAD](../data-manipulation/ALTER_LOAD.md) 语句修改处于 **QUEUEING** 状态或者 **LOADING** 状态的 Broker Load 作业的优先级。
## 作业拆分与并行执行
一个 Broker Load 作业会拆分成一个或者多个子任务并行处理,一个作业的所有子任务作为一个事务整体成功或失败。作业的拆分通过 `LOAD LABEL` 语句中的 `data_desc` 参数来指定:
- 如果声明多个 `data_desc` 参数对应导入多张不同的表,则每张表数据的导入会拆分成一个子任务。
- 如果声明多个 `data_desc` 参数对应导入同一张表的不同分区,则每个分区数据的导入会拆分成一个子任务。
每个子任务还会拆分成一个或者多个实例,然后这些实例会均匀地被分配到 BE(或 CN)上并行执行。实例的拆分由 FE 配置参数 [`min_bytes_per_broker_scanner`](../../../administration/management/FE_configuration.md) 和 BE(或 CN)节点数量决定,可以使用如下公式计算单个子任务的实例总数:
单个子任务的实例总数 = min(单个子任务待导入数据量的总大小/`min_bytes_per_broker_scanner`, BE/CN 节点数量)
一般情况下,一个导入作业只有一个 `data_desc`,只会拆分成一个子任务,子任务会拆分成与 BE(或 CN)节点数量相等的实例。
## 示例
本文以 HDFS 数据源为例,介绍各种导入配置。
Expand Down

0 comments on commit 1d2f582

Please sign in to comment.