Skip to content

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

@skumawat2025

Description

@skumawat2025

What is the bug?
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.

// creation, deletion workflow have to be executed sequentially,
// because they are sharing the same metadata document.
SMStateMachine(client, job, metadata, settings, threadPool, indicesManager)
.handlePolicyChange()
.currentState(metadata.creation.currentState)
.next(creationTransitions)
.apply {
val deleteMetadata = metadata.deletion
if (deleteMetadata != null) {
this.currentState(deleteMetadata.currentState)
.next(deletionTransitions)
}
}
} finally {
if (!releaseLockForScheduledJob(context, lock)) {
log.error("Could not release lock [${lock.lockId}] for ${job.id}.")
}

Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.

On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix.

} catch (ex: Exception) {
val message = "There was an exception at ${now()} while executing Snapshot Management policy ${job.policyName}, please check logs."
job.notificationConfig?.sendFailureNotification(client, job.policyName, message, job.user, log)
@Suppress("InstanceOfCheckForException")

How can one reproduce the bug?
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.

What is the expected behavior?
The system should not send false positive notifications to users for internal metadata update failures.

Do you have any screenshots?

[2024-12-19T02:49:12,259][ERROR][c.o.i.s.e.SMStateMachine [xxxxxx]] [c15aefb119d1092fc32d73e9e5ef8c22] Failed to update metadata.
[.ism-config/QHnWuqpwS46e7r0qCLwuNQ][[.ism-config][4]] VersionConflictEngineException[[xxxxxx-sm-metadata]: version conflict, required seqNo [754565], primary term [1]. current document has seqNo [754720] and primary term [1]]
[2024-12-19T02:49:12,259][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:getSingleConfig-get snapshot-error-notification
[2024-12-19T02:49:12,939][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:sendMessage:statusCode=200, statusText=Success, message id: a3db63d7-295e-5608-b188-3c0aa2b6a1c2
[2024-12-19T02:49:12,941][WARN ][o.o.i.u.JobSchedulerUtils] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock for job xxxxxx-sm-policy
[2024-12-19T02:49:12,941][ERROR][o.o.i.s.SMRunner         ] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock [.ism-config-xxxxxx-sm-policy] for xxxxxx-sm-policy.

Do you have any additional context?
Add any other context about the problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions