-
Notifications
You must be signed in to change notification settings - Fork 55
rebalance command
rebalance
is used for:
- targeted broker storage rebalancing*
- incremental scaling
*In contrast to storage rebalancing in rebuild
(which requires that 100% of partitions for a targeted topic are relocated), rebalance
is used for partial partition rebalancing from most to least storage utilized brokers.
Rebalance takes an input topic list (similarly to rebuild: comma delimited with regex support) and a broker list. Typically the broker list would include all brokers that the target topics(s) currently occupy. Removing brokers is not allowed in rebalance; only adding additional, new brokers is permitted.
Rebalance uses the same broker/topic metrics mechanism as rebuild (both of which can be supplemented with metricsfetcher). Rebalance works by examining the free storage utilization on all referenced brokers and selecting those that are more than 20% below the harmonic mean (configurable via the --storage-threshold
parameter). Alternatively, brokers below a free storage in gigabytes can be targeted using the --storage-threshold-gb
parameter. For each broker targeted for partition offloading, partitions are planned for relocation to the least-utilized destination. Relocations can be scoped by rack.id
via the --locality-scoped
flag. For instance, if rack.id
values reflected physical data centers, performing a rebalance with a locality scope would rebalance partitions among brokers per each data center in isolation.
Destination broker suitability is determined as either:
- (locality scoped) the least utilized broker with the same
rack.id
as the offload target - (non locality scoped) the least utilized broker that wouldn't result in duplicate
rack.id
values in the resulting ISR
The --tolerance
flag specifies specifies limits on how much data can be moved from offload targets and to destination targets as a distance (in percent) from the storage free arithmetic mean. If using the default 10% and a mean storage free of 800GB, partition movement planning per target will stop when:
- the target free storage would exceed 880GB (mean+10%)
- any partition movement would push the most suitable destination below 720GB (mean-10%)
All partition movement planning halts when all offload targets have no possible relocations to schedule. An plan result and partition map are then printed out.
Fetching up-to-date metrics data with metricsfetcher:
$ metricsfetcher --broker-storage-query "avg:system.disk.free{cluster:kafka-test,device:/data}" --partition-size-query "max:kafka.log.partition.size{cluster:kafka-test} by {topic,partition}"
Submitting max:kafka.log.partition.size{cluster:kafka-test} by {topic,partition}.rollup(avg, 3600)
success
Submitting avg:system.disk.free{cluster:kafka-test,device:/data} by {broker_id}.rollup(avg, 3600)
success
Data written to ZooKeeper
Running rebuild for "test-topic" and providing all of the brokers "test-topic" partitions reside on:
$ topicmappr rebalance --topics "test-topic" --brokers 1200,1201,1202,1203,1205,1208,1209,1211,1212,1213,1214,1215,1216,1217,12
20,1223,1224,1225,1234,1235,1236,1247,1254,1255,1256,1267,1376 --storage-threshold 0.05 --tolerance 0.2 | grep -v no-op
Topics:
test-topic
Validating broker list:
OK
Rebalance parameters:
Free storage mean, harmonic mean: 2299.03GB, 2199.97GB
Broker free storage limits (with a 20.00% tolerance from mean):
Sources limited to <= 2758.83GB
Destinations limited to >= 1839.22GB
Brokers targeted for partition offloading (>= 5.00% threshold below hmean):
1203
1209
1211
1212
1214
1217
1224
1225
1247
1255
1256
1376
Broker 1203 relocations planned:
[800.20GB] test-topic p117 -> 1200
Broker 1209 relocations planned:
[827.74GB] test-topic p119 -> 1235
Broker 1211 relocations planned:
[602.12GB] test-topic p125 -> 1236
Broker 1212 relocations planned:
[825.81GB] test-topic p22 -> 1208
Broker 1214 relocations planned:
[678.96GB] test-topic p59 -> 1213
[510.32GB] test-topic p37 -> 1213
Broker 1217 relocations planned:
[none]
Broker 1224 relocations planned:
[692.60GB] test-topic p118 -> 1220
Broker 1225 relocations planned:
[255.21GB] test-topic p75 -> 1216
Broker 1247 relocations planned:
[none]
Broker 1255 relocations planned:
[660.11GB] test-topic p20 -> 1235
Broker 1256 relocations planned:
[none]
Broker 1376 relocations planned:
[none]
Partition map changes:
test-topic p20: [1255 1203] -> [1235 1203] replaced broker
test-topic p22: [1211 1212] -> [1211 1208] replaced broker
test-topic p37: [1217 1214] -> [1217 1213] replaced broker
test-topic p59: [1236 1214] -> [1236 1213] replaced broker
test-topic p75: [1225 1209] -> [1216 1209] replaced broker
test-topic p117: [1203 1247] -> [1200 1247] replaced broker
test-topic p118: [1247 1224] -> [1247 1220] replaced broker
test-topic p119: [1225 1209] -> [1225 1235] replaced broker
test-topic p125: [1212 1211] -> [1212 1236] replaced broker
Broker distribution:
degree [min/max/avg]: 2/7/4.30 -> 2/7/4.81
-
Broker 1200 - leader: 5, follower: 3, total: 8
Broker 1201 - leader: 4, follower: 4, total: 8
Broker 1202 - leader: 5, follower: 5, total: 10
Broker 1203 - leader: 4, follower: 5, total: 9
Broker 1205 - leader: 5, follower: 5, total: 10
Broker 1208 - leader: 4, follower: 5, total: 9
Broker 1209 - leader: 5, follower: 4, total: 9
Broker 1211 - leader: 5, follower: 4, total: 9
Broker 1212 - leader: 5, follower: 4, total: 9
Broker 1213 - leader: 4, follower: 6, total: 10
Broker 1214 - leader: 5, follower: 3, total: 8
Broker 1215 - leader: 5, follower: 5, total: 10
Broker 1216 - leader: 6, follower: 5, total: 11
Broker 1217 - leader: 5, follower: 5, total: 10
Broker 1220 - leader: 5, follower: 5, total: 10
Broker 1223 - leader: 5, follower: 5, total: 10
Broker 1224 - leader: 5, follower: 4, total: 9
Broker 1225 - leader: 4, follower: 5, total: 9
Broker 1234 - leader: 5, follower: 5, total: 10
Broker 1235 - leader: 4, follower: 6, total: 10
Broker 1236 - leader: 4, follower: 6, total: 10
Broker 1247 - leader: 5, follower: 5, total: 10
Broker 1254 - leader: 5, follower: 5, total: 10
Broker 1255 - leader: 4, follower: 5, total: 9
Broker 1256 - leader: 5, follower: 5, total: 10
Broker 1267 - leader: 5, follower: 4, total: 9
Broker 1376 - leader: 5, follower: 5, total: 10
Storage free change estimations:
range: 2031.15GB -> 971.02GB
range spread: 130.47% -> 53.45%
std. deviation: 521.41GB -> 305.21GB
-
Broker 1200: 3587.97 -> 2787.77 (-800.20GB, -22.30%)
Broker 1201: 2708.39 -> 2708.39 (+0.00GB, 0.00%)
Broker 1202: 2209.01 -> 2209.01 (+0.00GB, 0.00%)
Broker 1203: 1865.20 -> 2665.40 (+800.20GB, 42.90%)
Broker 1205: 2120.30 -> 2120.30 (+0.00GB, 0.00%)
Broker 1208: 3224.55 -> 2398.75 (-825.81GB, -25.61%)
Broker 1209: 1912.19 -> 2739.93 (+827.74GB, 43.29%)
Broker 1211: 1873.23 -> 2475.35 (+602.12GB, 32.14%)
Broker 1212: 1916.88 -> 2742.69 (+825.81GB, 43.08%)
Broker 1213: 3165.90 -> 1976.62 (-1189.28GB, -37.57%)
Broker 1214: 1556.82 -> 2746.10 (+1189.28GB, 76.39%)
Broker 1215: 2091.04 -> 2091.04 (+0.00GB, 0.00%)
Broker 1216: 2150.41 -> 1895.21 (-255.21GB, -11.87%)
Broker 1217: 1816.75 -> 1816.75 (+0.00GB, 0.00%)
Broker 1220: 2877.80 -> 2185.20 (-692.60GB, -24.07%)
Broker 1223: 2347.95 -> 2347.95 (+0.00GB, 0.00%)
Broker 1224: 1977.97 -> 2670.58 (+692.60GB, 35.02%)
Broker 1225: 1960.09 -> 2215.30 (+255.21GB, 13.02%)
Broker 1234: 2109.06 -> 2109.06 (+0.00GB, 0.00%)
Broker 1235: 3369.32 -> 1881.47 (-1487.85GB, -44.16%)
Broker 1236: 2656.35 -> 2054.22 (-602.12GB, -22.67%)
Broker 1247: 1956.20 -> 1956.20 (+0.00GB, 0.00%)
Broker 1254: 2416.52 -> 2416.52 (+0.00GB, 0.00%)
Broker 1255: 1850.83 -> 2510.94 (+660.11GB, 35.67%)
Broker 1256: 1986.07 -> 1986.07 (+0.00GB, 0.00%)
Broker 1267: 2301.33 -> 2301.33 (+0.00GB, 0.00%)
Broker 1376: 2065.64 -> 2065.64 (+0.00GB, 0.00%)
New partition maps:
test-topic.json
Results after applying test-topic.json
(red bars indicate start, finish events from autothrottle):
The rebalance command can effectively be used for scaling a topic incrementally (introducing new brokers in addition to existing brokers). This is done by providing the existing brokers list hosting a topic along with additional brokers.
The default --storage-threshold
of 0.2
is best suited for targeting moderate to extreme outlier brokers in a normal rebalance scenario. In a scaling scenario, it is likely desired to draw partitions from most or all of the original brokers to relocate to the newly provided brokers.
There's several ways to do this:
- setting an explicit
--storage-threshold-gb
value - lowering the
--storage-threshold
value
If a scale up is intended that will target all original brokers, it's highly recommended to add an equal number of brokers per rack.id
used. Otherwise, brokers will not be able to schedule relocations unless --locality-scoped
is set to false
.
Enabling --verbose
will give per offload target, per partition placement decision information.
- It has few, large partitions and even the smallest one available would free up too much storage on the source or consume too much on any destination.
- All partitions examined were too large to find an optimal relocation. Increasing the
--partition-limit
flag beyond the default of 30 increases the likelihood of finding a possible relocation (if the broker holds more than 30 partitions). - No suitable destination brokers have enough free storage. Possible actions:
- adding additional brokers to the congested
rack.id
locality - disabling locality scoping (
--locality-scoped=false
) - relaxing the
--tolerance
(this may result in poor storage free range spread)
- adding additional brokers to the congested
The storage range is a key metric in improving storage balance. Sometimes this can be a result of offload targets being unable to schedule relocations (see above). In other cases, changing the --tolerance
up or down in 0.02 increments can improve results. This could require trial and error because no single tolerance value (which sets source and destination broker high/low storage limits) is universally optimal. Factors such as partition counts, distribution, sizes, broker counts, replica locality and other constraints make this a difficult problem to optimize for.
Likewise, which brokers to target for offloading is an influencing factor. Larger --storage-threshold
values (such as the default 20%) are intended to target outlier brokers. If balance is somewhat good to begin with, lower values (such as 5% in the example) can be used to target more brokers, which opens more opportunity for improved balance. At some point, it may be best to use the rebuild command with the storage placement functionality and just build a storage optimal map from scratch on a new set of target brokers.