Skip to content

Constraints to de-prioritize nodes from becoming shard allocation targets #43350

@vigyasharma

Description

@vigyasharma

The Problem

In clusters where each node has a reasonable number of shards, when a node is replaced, (or a small number of node(s) added); all shards of newly created indexes get allocated on the new empty node(s). It puts a lot of stress on the single (or few) new node, deteriorating overall cluster performance.

This happens because the index-balance factor in weight function is unable to offset shard-balance when other nodes are filled with shards. It causes the new node to always have minimum weight and get picked as the target for allocation (until the new node has approx. mean number of shards/node in the cluster).

Further, index-balance does kick in after the node is relatively filled (as compared to other nodes), and it moves out the recently allocated shards. Thus for the same shards, which are newly created and usually getting indexing traffic, we end up doing twice the work in allocation.

Current Workaround

A work around for this is to set the total-shards-per-node index/cluster setting , which limits shards of an index on a single node.

This however, is a hard limit enforced by Allocation Deciders. If breached on all nodes, the shards go unassigned causing yellow/red clusters. Configuring this setting requires careful calculation around number of nodes, and must be revised when the cluster is scaled down.

Proposal

We propose an allocation constraint mechanism, that de-prioritizes nodes from getting picked for allocation if they breach certain constraints. Whenever an allocation constraint is breached for a shard (or index) on a node, we add a high positive constant to the node's weight. This increased weight makes the node less preferable for allocating the shard. Unlike deciders, however, this is not a hard filter. If no other nodes are eligible to accept shards (say due to deciders like disk watermarks), the shard can still be allocated on nodes with breached constraints.

Constraint Based Weights - Step Function Diagram

Index Shards Per Node Constraint

This constraint controls the number of shards of an index on a single node. indexShardsPerNodeConstraint is breached if the number of shards of an index allocated on a node, exceeds average shards per node for that index.

long expIndexShardsOnNode = node.numShards(index) + numAdditionalShards;
long allowedShardsPerNode = Math.round(Math.ceil(balancer.avgShardsPerNode(index)));
boolean shardPerNodeLimitBreached = (expIndexShardsOnNode - allowedShardsPerNode) > 0;

indexShardsPerNodeConstraint getting breached, causes shards of the newly created index to get assigned to other nodes, thus preventing indexing hot spots. Post unassigned shard allocation, rebalance fills up the node with other index shards, without having to move out the already allocated shards. Since this does not prevent allocation, we do not run into unassigned shards due to breached constraints.

The allocator flow now becomes:

deciders to filter out ineligible nodes -> constraints to de-prioritize certain nodes by increasing their weight -> node selection as allocation target.

Extension

This framework can be extended with other rules by modeling them as boolean constraints. One possiible example is primary shard density on a node, as required by issue #41543. A constraint that requires #primaries on a node <= avg-primaries on a node, could prevent scenarios with all primaries on few nodes and all replicas on others.

Comparing Multiple Constraints

For every constraint breached, we add a high positive constant to the weight function. This is a step function approach where we consider all nodes on the lower step (lower weight) as more eligible than those on a higher step. The high positive constant ensures that node weight doesn't go as high as the next step unless it breaches a constraint.

Since the constant is added for every constraint breached, i.e. c * HIGH_CONSTANT, nodes with one constraints breached are preferred over nodes with two constraints breached and so on. All nodes with the same number of constraints breached simply resort back to the first part of weight function, which is based on shard count.

We could potentially keep different weights for different constraints with some sort of ranking among them. But this can quickly become a maintenance and extension nightmare. Adding new constraints will require going through every other constraint weight and deciding on the right weight and place in priority.

Next Steps

If this idea makes sense and seems like a useful add to the Elasticsearch community, we can raise a PR for these changes.

This change is targeted to solve problems listed in #17213, #37638, #41543, #12279, #29437

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions