-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Enrich policy execution can take a significant amount of time due to the reindexing step. Because of this, enrich policies are limited to only have one execution at a time per policy. This is managed by a set of locks on the master node.
Enrich indices are cleaned up during a background process to avoid search shards disappearing randomly during ingest. This background maintenance process runs on the master node so that it can reference the lock states for all policies. This allows a long running policy to execute without the fear of the maintenance task deleting the index before it can be created.
There exists an edge case such that a policy that takes sufficiently long to execute can lead to all enrich indices for a policy to be deleted in the event that the acting master fails over.
Assuming a cluster with an enrich policy my_policy which has been previously executed, and thus has an enrich index .enrich-my_policy-0000000000001. The index holds an alias named .enrich-my_policy which demarcates it as the "current" enrich index for the policy.
Scenario:
- Execute
my_policyenrich policy. - Master node obtains the
my_policylock and submits the request to a data node to perform. - Data node creates new index
.enrich-my_policy-0000000000002and configures the mapping with the required meta fields. - The data node begins reindexing data into the new index. This process must take longer than 15 minutes for this bug to surface.
- The master node is shutdown suddenly. All enrich lock state is lost.
- The policy execution request likely times out/fails for the client, but the policy is still executing on the data node.
- A new master node takes control of the cluster.
- Fifteen minutes after becoming the cluster leader, the master node executes the enrich maintenance task. The task locates
.enrich-my_policy-0000000000002. The task sees that there are no locks indicating that the execution is in progress for this and assumes the index is abandoned. The maintenance task removes the index. - The policy that is still executing the reindex operation continues to submit bulk indexing requests. The bulk index operation automatically recreates
.enrich-my_policy-0000000000002, but without the metadata in the index mapping that identifies it as an enrich index. - The policy execution completes and the interloper enrich index is promoted to the "current" one by setting its alias.
- Some time later, the maintenance task executes again. This time, it removes the previous valid enrich index
.enrich-my_policy-0000000000001because it does not hold the "current" alias for the policy. The maintenance task also removes the.enrich-my_policy-0000000000002index again, because even though it has the current alias, it does not have the metadata in its mappings to identify it as an enrich index. - At this point, the
my_policyenrichment policy is left with no valid enrich indices.