Skip to content

[ILM] Allow ILM and CCR to work well together #34648

Closed
@colings86

Description

@colings86

If an index is a CCR leader or follower index then the delete and shrink action should wait proceding any operations. This is to avoid problems described under original problem description.

Leader indices

ILM needs to query the indices stats api and check the shard history retention leases in order to determine whether an index is a leader index.

If an index is a leader index then the delete and shrink actions first need to execute the following steps:

  • Set the index.lifecycle.indexing_complete index setting to true.
  • Periodically query the indices stats api and check whether there are no shard history retention leases for the leader index.

After this it is safe the proceed any steps that are part of the ILM delete and shrink actions.

Follower indices

ILM needs to check an index's custom index metadata to check whether an index is a follower index.
If an index is a follower index then the shrink action first needs to execute the following steps:

  • Wait for the index.lifecycle.indexing_complete index setting to be replicated from the leader index.
  • Then after that wait for the follower index's global check point to be equal to the leader index's global check point.
  • Pause index following for the follower index. (This will release any shard history retention leases a follow index has on its leader index)
  • Close the follower index.
  • Unfollow the follower index. (Only closed indices can be unfollowed, because it changes the internal engine for all shards.)
  • Open the unfollowed index.

After this it is safe the proceed any steps that are part of the ILM shrink action.

Tasks

Original problem description

Currently if a user wishes to use CCR and ILM together on the same index they can run into problems. To help describe these problems imagine we have two clusters (for this discussion I'm going to call them leader and follower) and we are using CCR's auto-follow on the follower cluster to follow any indices on the leader cluster matching test-*.

Now, because in our scenario we have a time series use case it would also be good to have ILM manage the indices, so on the leader we set up a policy on the leader cluster which uses rollover, warm allocation, forcemerge, and shrink. Then we add the policy name to the index template for test-*, bootstrap ILM by creating the first index and now we have ILM working on our leader cluster and managing the test-* indices.

Problem 1 - Setting up a policy for the following indices

Having the test-* indices managed by ILM on the leader cluster is great but equally we would like ILM to manage the following indices on the follower cluster too. However, we can't use the exact same policy on the follower cluster because the following index will not have the write alias and even if it did we don't want the following index to rollover on its own criteria, we want it to mirror the leader index. This means the following index needs an indication that the leader has rolled over and moved to the warm phase so the following index also knows it can move to the warm phase.

Problem 2 - The leader index and the shrink action

In ILM the shrink action allocates one copy of each shard to a single node, then performs the shrink operation and then deletes the original (un-shrunk) index and sets an alias on the new (shrunken) index with the same name as the original index. This allows the naive user to search the index as if it was still the same index but under the covers the index is a different index.

The problem when combining this with CCR is that the following index may not be completely up to date with the leader index at the point the shrink action is performed, meaning that it may suddenly discover the leader index no longer exists and not be able to progress since there is no way for it to know that the index is equivalent to the shrunken index on the leader and means that the follower and leader cluster are indefinitely out of sync.

One solution to this would be for the un-shrunken following index to delete itself and for there to be a separate auto-follow rule to sync the shrunken indices from the leader. The problem with this is that it requires all the follower shrunken index to be synced from scratch copying all the same data as it had already in the un-shrunken index which is a waste of resources but more importantly means there is a period where the follower cluster will actually be getting further out of sync with the leader since its thrown away the un-shrunken index and is waiting to fully sync the shrunken index from the leader.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions