Description
ILM rollover action can fail for a variety of reasons. The following are 2 examples:
- Cluster state publishing/master timeout
- Flood stage threshold/write blocks on indices
Without automatic retries, any failures in the rollover action can result in an undesirable behavior where the current index can continue to grow many times beyond the max_size set with shards end up being hundreds of Gbs in size - unless the administrator detects this in time to manually intervene.
Depending on the type of failure, manual intervention may also require more than just calling the _retry endpoint for ILM. For example, if the failure is failed to process cluster event (index-aliases) within 30s
, it will result in an index created without an update to the alias so that the index is not used. So the admin will have to realize this and do some cleanup of the index before attempting the _retry call.
It will be great if ILM can automatic retry (and perform any clean up actions required). For the first use case above, this means realizing that an index was created without a successful update of the alias and be able to clean up automatically before retrying. For the second use case above, this means after the admin has addressed flood stage watermark being hit (i.e., disk capacity and removing write blocks from the indices), be able to automatically retry without requiring admins to remember to go back to ILM and issue a retry after removing write blocks from Elasticsearch indices.
Discussed with @gwbrown. We decided to create this separate issue to special-case rollover for automatic retries given the implications of an unnoticed failed rollover causing other performance issues in the cluster (due to the unbounded growth of index size until hitting flood stage), before implementing a more generic retry strategy that works for all lLM actions.