Description
If an index is a CCR leader or follower index then the delete and shrink action should wait proceding any operations. This is to avoid problems described under original problem description.
Leader indices
ILM needs to query the indices stats api and check the shard history retention leases in order to determine whether an index is a leader index.
If an index is a leader index then the delete and shrink actions first need to execute the following steps:
- Set the
index.lifecycle.indexing_complete
index setting totrue
. - Periodically query the indices stats api and check whether there are no shard history retention leases for the leader index.
After this it is safe the proceed any steps that are part of the ILM delete and shrink actions.
Follower indices
ILM needs to check an index's custom index metadata to check whether an index is a follower index.
If an index is a follower index then the shrink action first needs to execute the following steps:
- Wait for the
index.lifecycle.indexing_complete
index setting to be replicated from the leader index. - Then after that wait for the follower index's global check point to be equal to the leader index's global check point.
- Pause index following for the follower index. (This will release any shard history retention leases a follow index has on its leader index)
- Close the follower index.
- Unfollow the follower index. (Only closed indices can be unfollowed, because it changes the internal engine for all shards.)
- Open the unfollowed index.
After this it is safe the proceed any steps that are part of the ILM shrink action.
Tasks
- Change the delete and shrink actions to safely handle CCR leader indices. Ensure ILM policies run safely on leader indices #38140
- Implement an Unfollow action [ILM] Add unfollow action #36970
- Inject Unfollow action/steps before Shrink and Rollover actions Inject Unfollow before Rollover and Shrink #37625
Original problem description
Currently if a user wishes to use CCR and ILM together on the same index they can run into problems. To help describe these problems imagine we have two clusters (for this discussion I'm going to call them leader
and follower
) and we are using CCR's auto-follow on the follower cluster to follow any indices on the leader cluster matching test-*
.
Now, because in our scenario we have a time series use case it would also be good to have ILM manage the indices, so on the leader we set up a policy on the leader cluster which uses rollover, warm allocation, forcemerge, and shrink. Then we add the policy name to the index template for test-*
, bootstrap ILM by creating the first index and now we have ILM working on our leader cluster and managing the test-*
indices.
Problem 1 - Setting up a policy for the following indices
Having the test-*
indices managed by ILM on the leader cluster is great but equally we would like ILM to manage the following indices on the follower cluster too. However, we can't use the exact same policy on the follower cluster because the following index will not have the write alias and even if it did we don't want the following index to rollover on its own criteria, we want it to mirror the leader index. This means the following index needs an indication that the leader has rolled over and moved to the warm phase so the following index also knows it can move to the warm phase.
Problem 2 - The leader index and the shrink action
In ILM the shrink action allocates one copy of each shard to a single node, then performs the shrink operation and then deletes the original (un-shrunk) index and sets an alias on the new (shrunken) index with the same name as the original index. This allows the naive user to search the index as if it was still the same index but under the covers the index is a different index.
The problem when combining this with CCR is that the following index may not be completely up to date with the leader index at the point the shrink action is performed, meaning that it may suddenly discover the leader index no longer exists and not be able to progress since there is no way for it to know that the index is equivalent to the shrunken index on the leader and means that the follower and leader cluster are indefinitely out of sync.
One solution to this would be for the un-shrunken following index to delete itself and for there to be a separate auto-follow rule to sync the shrunken indices from the leader. The problem with this is that it requires all the follower shrunken index to be synced from scratch copying all the same data as it had already in the un-shrunken index which is a waste of resources but more importantly means there is a period where the follower cluster will actually be getting further out of sync with the leader since its thrown away the un-shrunken index and is waiting to fully sync the shrunken index from the leader.