Skip to content

rebalance command

Jamie Alquiza edited this page Nov 21, 2018 · 22 revisions

Rebalance

rebalance is used for:

  • targeted broker storage rebalancing*
  • incremental scaling

*In contrast to storage rebalancing in rebuild (which requires that 100% of partitions for a targeted topic are relocated), rebalance is used for partial partition rebalancing from most to least storage utilized brokers.

Rebalance takes an input topic list (similarly to rebuild: comma delimited with regex support) and a broker list. Typically the broker list would include all brokers that the target topics(s) currently occupy. Removing brokers is not allowed in rebalance; only adding additional, new brokers is permitted.

Rebalance uses the same broker/topic metrics mechanism as rebuild (both of which can be supplemented with metricsfetcher). Rebalance works by examining the free storage utilization on all referenced brokers and selecting those that are more than 20% below the harmonic mean (configurable via the --storage-threshold parameter). Alternatively, brokers below a free storage in gigabytes can be targeted using the --storage-threshold-gb parameter. For each broker targeted for partition offloading, partitions are planned for relocation to the least-utilized destination. Relocations can be scoped by rack.id via the --locality-scoped flag. For instance, if rack.id values reflected physical data centers, performing a rebalance with a locality scope would rebalance partitions among brokers per each data center in isolation.

Destination broker suitability is determined as either:

  • (locality scoped) the least utilized broker with the same rack.id as the offload target
  • (non locality scoped) the least utilized broker that wouldn't result in duplicate rack.id values in the resulting ISR

The --tolerance flag specifies specifies limits on how much data can be moved from offload targets and to destination targets as a distance (in percent) from the storage free arithmetic mean. If using the default 10% and a mean storage free of 800GB, partition movement planning per target will stop when:

  • the target free storage would exceed 880GB (mean+10%)
  • any partition movement would push the most suitable destination below 720GB (mean-10%)

All partition movement planning halts when all offload targets have no possible relocations to schedule. An plan result and partition map are then printed out.

Rebalancing Example

Fetching up-to-date metrics data with metricsfetcher:

$ metricsfetcher --broker-storage-query "avg:system.disk.free{cluster:kafka-test,device:/data}" --partition-size-query "max:kafka.log.partition.size{cluster:kafka-test} by {topic,partition}"
Submitting max:kafka.log.partition.size{cluster:kafka-test} by {topic,partition}.rollup(avg, 3600)
success
Submitting avg:system.disk.free{cluster:kafka-test,device:/data} by {broker_id}.rollup(avg, 3600)
success

Data written to ZooKeeper

Running rebuild for "test-topic" and providing all of the brokers "test-topic" partitions reside on:

$ topicmappr rebalance --topics "test-topic" --brokers 1200,1201,1202,1203,1205,1208,1209,1211,1212,1213,1214,1215,1216,1217,12
20,1223,1224,1225,1234,1235,1236,1247,1254,1255,1256,1267,1376 --storage-threshold 0.05 --tolerance 0.2 | grep -v no-op

Topics:
  test-topic

Validating broker list:
  OK

Rebalance parameters:
  Free storage mean, harmonic mean: 2299.03GB, 2199.97GB
  Broker free storage limits (with a 20.00% tolerance from mean):
    Sources limited to <= 2758.83GB
    Destinations limited to >= 1839.22GB

Brokers targeted for partition offloading (>= 5.00% threshold below hmean):
  1203
  1209
  1211
  1212
  1214
  1217
  1224
  1225
  1247
  1255
  1256
  1376

Broker 1203 relocations planned:
    [800.20GB] test-topic p117 -> 1200

Broker 1209 relocations planned:
    [827.74GB] test-topic p119 -> 1235

Broker 1211 relocations planned:
    [602.12GB] test-topic p125 -> 1236

Broker 1212 relocations planned:
    [825.81GB] test-topic p22 -> 1208

Broker 1214 relocations planned:
    [678.96GB] test-topic p59 -> 1213
    [510.32GB] test-topic p37 -> 1213

Broker 1217 relocations planned:
  [none]

Broker 1224 relocations planned:
    [692.60GB] test-topic p118 -> 1220

Broker 1225 relocations planned:
    [255.21GB] test-topic p75 -> 1216

Broker 1247 relocations planned:
  [none]

Broker 1255 relocations planned:
    [660.11GB] test-topic p20 -> 1235

Broker 1256 relocations planned:
  [none]

  Broker 1376 relocations planned:
  [none]

Partition map changes:
  test-topic p20: [1255 1203] -> [1235 1203] replaced broker
  test-topic p22: [1211 1212] -> [1211 1208] replaced broker
  test-topic p37: [1217 1214] -> [1217 1213] replaced broker
  test-topic p59: [1236 1214] -> [1236 1213] replaced broker
  test-topic p75: [1225 1209] -> [1216 1209] replaced broker
  test-topic p117: [1203 1247] -> [1200 1247] replaced broker
  test-topic p118: [1247 1224] -> [1247 1220] replaced broker
  test-topic p119: [1225 1209] -> [1225 1235] replaced broker
  test-topic p125: [1212 1211] -> [1212 1236] replaced broker

Broker distribution:
  degree [min/max/avg]: 2/7/4.30 -> 2/7/4.81
  -
  Broker 1200 - leader: 5, follower: 3, total: 8
  Broker 1201 - leader: 4, follower: 4, total: 8
  Broker 1202 - leader: 5, follower: 5, total: 10
  Broker 1203 - leader: 4, follower: 5, total: 9
  Broker 1205 - leader: 5, follower: 5, total: 10
  Broker 1208 - leader: 4, follower: 5, total: 9
  Broker 1209 - leader: 5, follower: 4, total: 9
  Broker 1211 - leader: 5, follower: 4, total: 9
  Broker 1212 - leader: 5, follower: 4, total: 9
  Broker 1213 - leader: 4, follower: 6, total: 10
  Broker 1214 - leader: 5, follower: 3, total: 8
  Broker 1215 - leader: 5, follower: 5, total: 10
  Broker 1216 - leader: 6, follower: 5, total: 11
  Broker 1217 - leader: 5, follower: 5, total: 10
  Broker 1220 - leader: 5, follower: 5, total: 10
  Broker 1223 - leader: 5, follower: 5, total: 10
  Broker 1224 - leader: 5, follower: 4, total: 9
  Broker 1225 - leader: 4, follower: 5, total: 9
  Broker 1234 - leader: 5, follower: 5, total: 10
  Broker 1235 - leader: 4, follower: 6, total: 10
  Broker 1236 - leader: 4, follower: 6, total: 10
  Broker 1247 - leader: 5, follower: 5, total: 10
  Broker 1254 - leader: 5, follower: 5, total: 10
  Broker 1255 - leader: 4, follower: 5, total: 9
  Broker 1256 - leader: 5, follower: 5, total: 10
  Broker 1267 - leader: 5, follower: 4, total: 9
  Broker 1376 - leader: 5, follower: 5, total: 10

Storage free change estimations:
  range: 2031.15GB -> 971.02GB
  range spread: 130.47% -> 53.45%
  std. deviation: 521.41GB -> 305.21GB
  -
  Broker 1200: 3587.97 -> 2787.77 (-800.20GB, -22.30%)
  Broker 1201: 2708.39 -> 2708.39 (+0.00GB, 0.00%)
  Broker 1202: 2209.01 -> 2209.01 (+0.00GB, 0.00%)
  Broker 1203: 1865.20 -> 2665.40 (+800.20GB, 42.90%)
  Broker 1205: 2120.30 -> 2120.30 (+0.00GB, 0.00%)
  Broker 1208: 3224.55 -> 2398.75 (-825.81GB, -25.61%)
  Broker 1209: 1912.19 -> 2739.93 (+827.74GB, 43.29%)
  Broker 1211: 1873.23 -> 2475.35 (+602.12GB, 32.14%)
  Broker 1212: 1916.88 -> 2742.69 (+825.81GB, 43.08%)
  Broker 1213: 3165.90 -> 1976.62 (-1189.28GB, -37.57%)
  Broker 1214: 1556.82 -> 2746.10 (+1189.28GB, 76.39%)
  Broker 1215: 2091.04 -> 2091.04 (+0.00GB, 0.00%)
  Broker 1216: 2150.41 -> 1895.21 (-255.21GB, -11.87%)
  Broker 1217: 1816.75 -> 1816.75 (+0.00GB, 0.00%)
  Broker 1220: 2877.80 -> 2185.20 (-692.60GB, -24.07%)
  Broker 1223: 2347.95 -> 2347.95 (+0.00GB, 0.00%)
  Broker 1224: 1977.97 -> 2670.58 (+692.60GB, 35.02%)
  Broker 1225: 1960.09 -> 2215.30 (+255.21GB, 13.02%)
  Broker 1234: 2109.06 -> 2109.06 (+0.00GB, 0.00%)
  Broker 1235: 3369.32 -> 1881.47 (-1487.85GB, -44.16%)
  Broker 1236: 2656.35 -> 2054.22 (-602.12GB, -22.67%)
  Broker 1247: 1956.20 -> 1956.20 (+0.00GB, 0.00%)
  Broker 1254: 2416.52 -> 2416.52 (+0.00GB, 0.00%)
  Broker 1255: 1850.83 -> 2510.94 (+660.11GB, 35.67%)
  Broker 1256: 1986.07 -> 1986.07 (+0.00GB, 0.00%)
  Broker 1267: 2301.33 -> 2301.33 (+0.00GB, 0.00%)
  Broker 1376: 2065.64 -> 2065.64 (+0.00GB, 0.00%)

New partition maps:
  test-topic.json

Results after applying test-topic.json (red bars indicate start, finish events from autothrottle):

Scaling Example

The rebalance command can effectively be used for scaling a topic incrementally (introducing new brokers in addition to existing brokers). This is done by providing the existing brokers list hosting a topic along with additional brokers.

The default --storage-threshold of 0.2 is best suited for targeting moderate to extreme outlier brokers in a normal rebalance scenario. In a scaling scenario, it is likely desired to draw partitions from most or all of the original brokers to relocate to the newly provided brokers.

There's several ways to do this:

  • setting an explicit --storage-threshold-gb value
  • lowering the --storage-threshold value

If a scale up is intended that will target all original brokers, it's highly recommended to add an equal number of brokers per rack.id used. Otherwise, brokers will not be able to schedule relocations unless --locality-scoped is set to false.

Troubleshooting

Enabling --verbose will give per offload target, per partition placement decision information.

An offload target will not list any partitions scheduled for relocation:

  • It has few, large partitions and even the smallest one available would free up too much storage on the source or consume too much on any destination.
  • All partitions examined were too large to find an optimal relocation. Increasing the --partition-limit flag beyond the default of 30 increases the likelihood of finding a possible relocation (if the broker holds more than 30 partitions).
  • No suitable destination brokers have enough free storage. Possible actions:
    • adding additional brokers to the congested rack.id locality
    • disabling locality scoping (--locality-scoped=false)
    • relaxing the --tolerance (this may result in poor storage free range spread)

Storage utilization range isn't improving

The storage range is a key metric in improving storage balance. Sometimes this can be a result of offload targets being unable to schedule relocations (see above). In other cases, changing the --tolerance up or down in 0.02 increments can improve results. This could require trial and error because no single tolerance value (which sets source and destination broker high/low storage limits) is universally optimal. Factors such as partition counts, distribution, sizes, broker counts, replica locality and other constraints make this a difficult problem to optimize for.

Likewise, which brokers to target for offloading is an influencing factor. Larger --storage-threshold values (such as the default 20%) are intended to target outlier brokers. If balance is somewhat good to begin with, lower values (such as 5% in the example) can be used to target more brokers, which opens more opportunity for improved balance. At some point, it may be best to use the rebuild command with the storage placement functionality and just build a storage optimal map from scratch on a new set of target brokers.

Clone this wiki locally