Skip to content

Rework ILM to not Require Inspecting all Indices on every Cluster State Update  #80407

Open
@original-brownbear

Description

@original-brownbear

At the moment ILM scales somewhat poorly as we move to very large numbers of indices. The reason for this is that org.elasticsearch.xpack.ilm.IndexLifecycleService#clusterChanged does a full inspection of all indices in the cluster state to see if there is work to be done by ILM.

This inspection of all the indices itself is fairly expensive because it requires parsing per-index metadata into LifecycleExecutionState (repeatedly) and more importantly calls the expensive org.elasticsearch.xpack.ilm.IndexLifecycleRunner#getCurrentStep(org.elasticsearch.xpack.ilm.PolicyStepsRegistry, java.lang.String, org.elasticsearch.cluster.metadata.IndexMetadata, org.elasticsearch.xpack.core.ilm.LifecycleExecutionState) in a hot loop.

Ideally, ILM should be refactored into something more similar to the SnapshotService which will only do a full inspection of all snapshots+shards on a master failover, but otherwise keeps track of its internal state directly on the master node.
Concretely, this would mean that when an index moves from one state to another state, the requires actions would just be chained logically through a series of callbacks rollover-step -> do rollover -> next-step instead of the current model where the step transitions are triggered by the changes in the cluster state that the previous step caused.

This would make ILM scale pretty much O(1) outside of the master-failover scenario.

Relates #77466

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions