Skip to content

Commit

Permalink
[Rebalancer] support partition rebalancer (apache#5010)
Browse files Browse the repository at this point in the history
RebalancerType could be configured via Config.rebalancer_type(BeLoad, Partition).
PartitionRebalancer is based on TwoDimensionalGreedyAlgo.
Two dims of Doris should be cluster & partition. And we only consider about the replica count, 
do not consider replica size.
apache#4845 for further details.
  • Loading branch information
vagetablechicken authored Dec 31, 2020
1 parent fd6fb90 commit d7a584a
Show file tree
Hide file tree
Showing 14 changed files with 1,668 additions and 43 deletions.
23 changes: 21 additions & 2 deletions docs/en/administrator-guide/operation/tablet-repair-and-balance.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,11 @@ At the same time, in order to ensure the weight of the initial priority, we stip

## Duplicate Equilibrium

Doris automatically balances replicas within the cluster. The main idea of balancing is to create a replica of some fragments on low-load nodes, and then delete the replicas of these fragments on high-load nodes. At the same time, because of the existence of different storage media, there may or may not exist one or two storage media on different BE nodes in the same cluster. We require that fragments of storage medium A be stored in storage medium A as far as possible after equalization. So we divide the BE nodes of the cluster according to the storage medium. Then load balancing scheduling is carried out for different BE node sets of storage media.
Doris automatically balances replicas within the cluster. Currently supports two rebalance strategies, BeLoad and Partition. BeLoad rebalance will consider about the disk usage and replica count for each BE. Partition rebalance just aim at replica count for each partition, this helps to avoid hot spots. If you want high read/write performance, you may need this. Note that Partition rebalance do not consider about the disk usage, pay more attention to it when you are using Partition rebalance. The strategy selection config is not mutable at runtime.

### BeLoad

The main idea of balancing is to create a replica of some fragments on low-load nodes, and then delete the replicas of these fragments on high-load nodes. At the same time, because of the existence of different storage media, there may or may not exist one or two storage media on different BE nodes in the same cluster. We require that fragments of storage medium A be stored in storage medium A as far as possible after equalization. So we divide the BE nodes of the cluster according to the storage medium. Then load balancing scheduling is carried out for different BE node sets of storage media.

Similarly, replica balancing ensures that a copy of the same table will not be deployed on the BE of the same host.

Expand All @@ -228,7 +232,22 @@ Disk usage and number of copies have a weight factor, which is **capacityCoeffic

The weight coefficient ensures that when disk utilization is too high, the backend load score will be higher to ensure that the BE load is reduced as soon as possible.

Tablet Scheduler updates CLS every 1 minute.
Tablet Scheduler updates CLS every 20 seconds.

### Partition

The main idea of `partition rebalancing` is to decrease the skew of partitions. The skew of the partition is defined as the difference between the maximum replica count of the partition over all bes and the minimum replica count over all bes.

So we only consider about the replica count, do not consider replica size(disk usage).
To fewer moves, we use TwoDimensionalGreedyAlgo which two dims are cluster & partition. It prefers a move that reduce the skew of the cluster when we want to rebalance a max skew partition.

#### Skew Info

The skew info is represented by `ClusterBalanceInfo`. `partitionInfoBySkew` is a multimap which key is the partition's skew, so we can get max skew partitions simply. `beByTotalReplicaCount` is a multimap which key is the total replica count of the backend.

`ClusterBalanceInfo` is in CLS, updated every 20 seconds.

When get more than one max skew partitions, we random select one partition to calculate the move.

### Equilibrium strategy

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -215,11 +215,15 @@ TabletScheduler 里等待被调度的分片会根据状态不同,赋予不同

## 副本均衡

Doris 会自动进行集群内的副本均衡。均衡的主要思想,是对某些分片,先在低负载的节点上创建一个副本,然后再删除这些分片在高负载节点上的副本。同时,因为不同存储介质的存在,在同一个集群内的不同 BE 节点上,可能存在一种或两种存储介质。我们要求存储介质为 A 的分片在均衡后,尽量依然存储在存储介质 A 中。所以我们根据存储介质,对集群的 BE 节点进行划分。然后针对不同的存储介质的 BE 节点集合,进行负载均衡调度。
Doris 会自动进行集群内的副本均衡。目前支持两种均衡策略,负载/分区。负载均衡适合需要兼顾节点磁盘使用率和节点副本数量的场景;而分区均衡会使每个分区的副本都均匀分布在各个节点,避免热点,适合对分区读写要求比较高的场景。但是,分区均衡不考虑磁盘使用率,使用分区均衡时需要注意磁盘的使用情况。策略只能在fe启动前配置,不支持运行时切换。

### 负载均衡

负载均衡的主要思想是,对某些分片,先在低负载的节点上创建一个副本,然后再删除这些分片在高负载节点上的副本。同时,因为不同存储介质的存在,在同一个集群内的不同 BE 节点上,可能存在一种或两种存储介质。我们要求存储介质为 A 的分片在均衡后,尽量依然存储在存储介质 A 中。所以我们根据存储介质,对集群的 BE 节点进行划分。然后针对不同的存储介质的 BE 节点集合,进行负载均衡调度。

同样,副本均衡会保证不会将同一个 Tablet 的副本部署在同一个 host 的 BE 上。

### BE 节点负载
#### BE 节点负载

我们用 ClusterLoadStatistics(CLS)表示一个 cluster 中各个 Backend 的负载均衡情况。TabletScheduler 根据这个统计值,来触发集群均衡。我们当前通过 **磁盘使用率****副本数量** 两个指标,为每个BE计算一个 loadScore,作为 BE 的负载分数。分数越高,表示该 BE 的负载越重。

Expand All @@ -229,7 +233,18 @@ Doris 会自动进行集群内的副本均衡。均衡的主要思想,是对

该权重系数保证当磁盘使用率过高时,该 Backend 的负载分数会更高,以保证尽快降低这个 BE 的负载。

TabletScheduler 会每隔 1 分钟更新一次 CLS。
TabletScheduler 会每隔 20s 更新一次 CLS。

### 分区均衡

分区均衡的主要思想是,将每个分区的在各个 Backend 上的 replica 数量差(即 partition skew),减少到最小。因此只考虑副本个数,不考虑磁盘使用率。
为了尽量少的迁移次数,分区均衡使用二维贪心的策略,优先均衡partition skew最大的分区,均衡分区时会尽量选择,可以使整个 cluster 的在各个 Backend 上的 replica 数量差(即 cluster skew/total skew)减少的方向。

#### skew 统计

skew 统计信息由`ClusterBalanceInfo`表示,其中,`partitionInfoBySkew`以 partition skew 为key排序,便于找到max partition skew;`beByTotalReplicaCount`则是以 Backend 上的所有 replica 个数为key排序。`ClusterBalanceInfo`同样保持在CLS中, 同样 20s 更新一次。

max partition skew 的分区可能有多个,采用随机的方式选择一个分区计算。

### 均衡策略

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -536,7 +536,7 @@ private Catalog(boolean isCheckpointCatalog) {
this.metaContext.setThreadLocalInfo();

this.stat = new TabletSchedulerStat();
this.tabletScheduler = new TabletScheduler(this, systemInfo, tabletInvertedIndex, stat);
this.tabletScheduler = new TabletScheduler(this, systemInfo, tabletInvertedIndex, stat, Config.tablet_rebalancer_type);
this.tabletChecker = new TabletChecker(this, systemInfo, tabletScheduler, stat);

this.pendingLoadTaskScheduler = new MasterTaskExecutor("pending_load_task_scheduler", Config.async_load_task_pool_size,
Expand Down
Loading

0 comments on commit d7a584a

Please sign in to comment.