Skip to content

Enrich indices can be deleted if cluster leader fails over during policy execution #99725

@jbaiera

Description

@jbaiera

Enrich policy execution can take a significant amount of time due to the reindexing step. Because of this, enrich policies are limited to only have one execution at a time per policy. This is managed by a set of locks on the master node.

Enrich indices are cleaned up during a background process to avoid search shards disappearing randomly during ingest. This background maintenance process runs on the master node so that it can reference the lock states for all policies. This allows a long running policy to execute without the fear of the maintenance task deleting the index before it can be created.

There exists an edge case such that a policy that takes sufficiently long to execute can lead to all enrich indices for a policy to be deleted in the event that the acting master fails over.

Assuming a cluster with an enrich policy my_policy which has been previously executed, and thus has an enrich index .enrich-my_policy-0000000000001. The index holds an alias named .enrich-my_policy which demarcates it as the "current" enrich index for the policy.

Scenario:

  1. Execute my_policy enrich policy.
  2. Master node obtains the my_policy lock and submits the request to a data node to perform.
  3. Data node creates new index .enrich-my_policy-0000000000002 and configures the mapping with the required meta fields.
  4. The data node begins reindexing data into the new index. This process must take longer than 15 minutes for this bug to surface.
  5. The master node is shutdown suddenly. All enrich lock state is lost.
  6. The policy execution request likely times out/fails for the client, but the policy is still executing on the data node.
  7. A new master node takes control of the cluster.
  8. Fifteen minutes after becoming the cluster leader, the master node executes the enrich maintenance task. The task locates .enrich-my_policy-0000000000002. The task sees that there are no locks indicating that the execution is in progress for this and assumes the index is abandoned. The maintenance task removes the index.
  9. The policy that is still executing the reindex operation continues to submit bulk indexing requests. The bulk index operation automatically recreates .enrich-my_policy-0000000000002, but without the metadata in the index mapping that identifies it as an enrich index.
  10. The policy execution completes and the interloper enrich index is promoted to the "current" one by setting its alias.
  11. Some time later, the maintenance task executes again. This time, it removes the previous valid enrich index .enrich-my_policy-0000000000001 because it does not hold the "current" alias for the policy. The maintenance task also removes the .enrich-my_policy-0000000000002 index again, because even though it has the current alias, it does not have the metadata in its mappings to identify it as an enrich index.
  12. At this point, the my_policy enrichment policy is left with no valid enrich indices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions