A significant shortcoming of Ceph's CRUSH algorithm is uniform data placement. The availble storage capacity of a Ceph cluster for new data is limited by how full the most-full OSD is at a given time. CRUSH assigns Placement Groups (PG)s to OSDs probabilistically, where probabilities are adjusted by storage device, node, rack, etc. (aggregate) sizes. This can result in non-uniform distribution of data, which means that some of your precious storage capacity cannot be used. The default mitigation is Ceph's built-in Manager balancer, but it only considers PG shard numbers, not their sizes.
This is an alternative Ceph balancer implementation.
- The upstream balancer optimizes for (weighted) equal number of PGs on each OSD for each pool.
- This balancer optimizes for equal OSD storage utilization and PG placement across all pools.
- This balancer yields better results when multiple CRUSH roots or a very large number of RADOS pools are present.
For most clusters, the mgr balancer works well.
For heterogeneous clusters with a lot of device and server capacity variance and many pools, placement may be very bad - the reason this balancer was created.
Each RADOS pool is split into placement groups (PGs).
The number of PGs for each pool is configured via the pg_num pool attribute, which may or may not be managed by the PG autoscaler.
When the PG autoscaler is used, it advised to increase the default values of mon_target_pg_per_osd and mon_max_pg_per_osd from their default values of 100 and 250 to 300 and 600, respectively.
This is important for balancing as the defaults result in PGs that each may grow to hold a relatively large amount of data, which confounds the balancer's ability to distribute them uniformly.
Clusters with OSD media larger than, say, 15TB may find it advantageous to also deploy more than one OSD per device.
Hence, one PG roughly has size pg_size = pool_size/pg_num.
PGs are placed on OSDs (a disk), using constraints defined via CRUSH.
Usually PGs are replicated across servers, so that when one server goes down, enough OSDs in other servers survive, and data remains available.
The number and kind of OSDs to be used for a given PG's shards is configured via the pool's size property and associated CRUSH rule.
It is best in most cases for each PG to store 200-300 shards: when the PG autoscaler is in use, the mon_target_pg_per_osd option sets a ceiling on its budget. This PG ratio is shown by the ceph osd df command in the PGS column.
There are two kind of pools: replicated and erasure coded.
The utilized space on one OSD is the sum of all PG shards that are stored on it.
A PG shard is the data of one of the participating OSDs in one PG.
For replica-pools, the shard size equals pg_size / pool_size - one full copy.
For EC-pools, the shard size equals pg_size / (pool_size * k), k is the number of data chunks in the EC profile.
CRUSH organizes all the datacenters, racks, servers, OSDs in a tree structure: the CRUSH topology. Each subtree usually has the weight of all the OSDs below it, and PGs are now distributed evenly, weighted by the sub-tree size at each tree level. That way servers/racks/datacenters/... with larger aggreate raw capacities get more data than small ones, but the relative amount is the same.
In theory, each OSD should thus be filled exactly the same relative amount, all are e.g. 30% full. In practice, not so much:
One cluster, which was the motivation to create this balancer, had devices (same device class, weighted at 1.0) ranging from 55% to 80% size utilization.
The reason is this: the cluster has many pools of different sizes (at time of writing 46), OSD sizes vary from 1T to 14T, 4 to 40 OSDs per server.
And the mgr balancer couldn't handle this.
Ceph's included balancer optimizes by PG count on devices. It does so by analyzing each pool independently, and then tries to move each pool's PGs so that each participating device has equal normalized PG counts. Normalized means placing double the PGs on a double-sized OSD.
Example: placing 600 PGs on a 2T and 4T OSD means each 1T gets 600PGs/(4T+2T) = 100PGs/T, thus the 2T OSD gets 200PGs, the 4T OSD 400.
PG counts are powers of two, so distributing them really equally will almost never work.
Because of this, the best possible solution is some OSDs having an offset of 1 PG to the ideal count.
As a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the +1 more often than others.
At worst one OSD is the +1 for each pool in the cluster.
This OSD can then end up, for example, being 80% full, since it is a +1 for 20 times.
The one that never is the +1 is 50% full.
That's bad and not balanced at all.
Additionally, the shard sizes of PGs are not equal - they shrink and grow with pool usage, whereas the PG count will remain exactly the same, which the mgr-balancer uses, so it does nothing.
To make things worse, if there's a huge server in the cluster which is so big, CRUSH can't place data often enough on it to fill it to the same level as any other server, the balancer will fail moving PGs across servers that actually would have space. This happens since it sees only this server's OSDs as "underfull", but each PG has one shard on that server already, so no data can be moved to it. Even though there are likely other OSDs in the cluster (which may have variations from 90% to 50% full, and determine the available space because one of them is the fullest overall), these other OSDs are not balanced - the only considered balancing target is the big empty server.
So we have two main issues:
- If you have multiple bigger pools, the
+1-PG placement does not consider the global view (i.e. where other pools place the +1 PGs) - If there's too-empty buckets (which can't be filled more because of crush constraints), other buckets are no longer balanced
To solve this, the main optimization goal is equal OSD utilization:
Generate candidate PG movements and validate them against the crush constraints, PG counts, utilization estimations, ...
To get PG candidates, order all OSDs by utilization (optionally only for one crush root).
Utilization is estimated from all PGs where the OSD is in the up set (due to ongoing partial PG transfers).
From the fullest OSD, try to move a "suitable" PG shard on it to the least-utilized OSD.
If this violates constraints, try the next least-utilized OSD, and so on, or try a different PG.
Once a suitable OSD is found, check if the new placement decreases the cluster utilization variance. If this is the case, record the PG movement and try to move another PG with the same approach.
That way the balancer generates new "upmap items", i.e. movement instructions for a PG from some OSDs to better ones, which you can apply if you're satisfied with the results.
If this is done forever, all OSDs will have very little utilization variance, or CRUSH constraints prevent us from doing more PG movements.
Simplified pseudo code:
# given an OSD we want to empty, which pg do we select?
def get_pg_move_candidates(sourceosd):
def compare(pg, otherpg):
return (pg.remapped > otherpg.remapped or
pg.upmap_item_count < otherpg.upmap_item_count or
pg.size > otherpg.size)
return sort(pgs_on_osd(sourceosd), compare)
try_limit = 1
max_movements = 10
while True:
next_move:
if len(movements) >= max_movements:
stop()
# try to empty the fullest OSDs
for i, from_osd in enumerate(osds_by_utilization_asc(crushroot)):
if i > try_limit:
finish('could not empty {try_limit} fullest devices, stopping')
stop()
# try a suitable PG to move it away from a full osd
for pg in get_pg_move_candidates(from_osd):
# move it to an empty osd
for to_osd in osds_by_utilization_desc(candidate_osds_for(pg)):
# only move if constraints allow it
if (is_crush_move_valid(pg, from_osd, to_osd) and
respects_pool_balance(pg, from_osd, to_osd) and
cluster_utilization_variance_is_better(pg, from_osd, to_osd)):
movements.append((pg, from_osd, to_osd))
goto next_move
# resulting movements
for movement in movements:
print(f"ceph osd pg-upmap-items {generate_upmap(movement)}")Runtime:
- Worst-case (the fullest OSD can't be emptied more):
O(OSDs * PGs) - If, after that, we tried the second-to fullest, third, and all others, it would be:
O(OSDs²*PGs)
Likely this can be optimized further.
# to generate max 10 pg movements:
./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps
# -> if you're satisfied, run: $ bash /tmp/balance-upmaps
# but it can do more than balance!
# there's some examples below.
./placementoptimizer.py --help
You can use the balancer's size predictions to view how much space your cluster actually has free. Ceph's integrated "free space" number is rather off, it doesn't consider the real pg placement and resulting space limits at all.
Pool utilization:
./placementoptimizer.py show # pool free sapace for the current cluster state (in the acting-state)
./placementoptimizer.py show --pgstate up # when all movements are done (i.e. the up-state is reached)
OSD utilization:
./placementoptimizer.py show --osds --sort-utilization
./placementoptimizer.py show --osds --per-pool-count
./placementoptimizer.py show --osds --per-pool-count --sort-utilization --only-crushclass hdd
Ongoing movement status
./placementoptimizer.py remap show
./placementoptimizer.py remap show --by-osd
./placementoptimizer.py remap show --by-osd --osds 13337,4242
For debugging (or archiving), you can store all needed cluster state in a .xz file.
./placementoptimizer.py -v gather /tmp/lolfile.xz
To use this file instead of the "live" cluster state, use --state in all the usual commands:
./placementoptimizer.py -v show --state /tmp/lolfile.xz
./placementoptimizer.py remap show --state /tmp/lolfile.xz
./placementoptimizer.py balance --state /tmp/lolfile.xz
The script is not the prettiest (yet), but produces balancing-improvement movements.
Ideally, with some further improvements and tuning, it could be integrated in upstream-Ceph as an alternative balancer implementation.
So if you have any idea and suggestion how to improve things, please submit issues and pull requests.
If you have questions, suggestions, encounter any problem, please join our chat and ask!
- Matrix Chat:
#sfttech:matrix.org
Of course, create issues and pull requests.
Released under the GNU General Public License version 3 or later, see COPYING and LICENSE for details.