Enrich indices can be deleted if cluster leader fails over during policy execution

Enrich policy execution can take a significant amount of time due to the reindexing step. Because of this, enrich policies are limited to only have one execution at a time per policy. This is managed by a set of locks on the master node. 

Enrich indices are cleaned up during a background process to avoid search shards disappearing randomly during ingest. This background maintenance process runs on the master node so that it can reference the lock states for all policies. This allows a long running policy to execute without the fear of the maintenance task deleting the index before it can be created.

There exists an edge case such that a policy that takes sufficiently long to execute can lead to all enrich indices for a policy to be deleted in the event that the acting master fails over.

Assuming a cluster with an enrich policy `my_policy` which has been previously executed, and thus has an enrich index `.enrich-my_policy-0000000000001`. The index holds an alias named `.enrich-my_policy` which demarcates it as the "current" enrich index for the policy.

Scenario:
1. Execute `my_policy` enrich policy.
2. Master node obtains the `my_policy` lock and submits the request to a data node to perform.
3. Data node creates new index `.enrich-my_policy-0000000000002` and configures the mapping with the required meta fields.
4. The data node begins reindexing data into the new index. This process must take longer than 15 minutes for this bug to surface.
5. The master node is shutdown suddenly. All enrich lock state is lost. 
6. The policy execution request likely times out/fails for the client, but the policy is still executing on the data node.
7. A new master node takes control of the cluster.
8. Fifteen minutes after becoming the cluster leader, the master node executes the enrich maintenance task. The task locates `.enrich-my_policy-0000000000002`. The task sees that there are no locks indicating that the execution is in progress for this and assumes the index is abandoned. The maintenance task removes the index.
9. The policy that is still executing the reindex operation continues to submit bulk indexing requests. The bulk index operation automatically recreates `.enrich-my_policy-0000000000002`, but without the metadata in the index mapping that identifies it as an enrich index.
10. The policy execution completes and the interloper enrich index is promoted to the "current" one by setting its alias.
11. Some time later, the maintenance task executes again. This time, it removes the previous valid enrich index `.enrich-my_policy-0000000000001` because it does not hold the "current" alias for the policy. The maintenance task also removes the `.enrich-my_policy-0000000000002` index again, because even though it has the current alias, it does not have the metadata in its mappings to identify it as an enrich index.
12. At this point, the `my_policy` enrichment policy is left with no valid enrich indices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enrich indices can be deleted if cluster leader fails over during policy execution #99725

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enrich indices can be deleted if cluster leader fails over during policy execution #99725

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions