[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy

**What is the bug?**
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.
https://github.com/opensearch-project/index-management/blob/eb6afa86416c9a357feafaffa3973e414a31ae3b/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/SMRunner.kt#L104-L120

Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.

On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix. 
https://github.com/opensearch-project/index-management/blob/eb6afa86416c9a357feafaffa3973e414a31ae3b/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/engine/SMStateMachine.kt#L124-L127

**How can one reproduce the bug?**
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.


**What is the expected behavior?**
The system should not send false positive notifications to users for internal metadata update failures.


**Do you have any screenshots?**
```
[2024-12-19T02:49:12,259][ERROR][c.o.i.s.e.SMStateMachine [xxxxxx]] [c15aefb119d1092fc32d73e9e5ef8c22] Failed to update metadata.
[.ism-config/QHnWuqpwS46e7r0qCLwuNQ][[.ism-config][4]] VersionConflictEngineException[[xxxxxx-sm-metadata]: version conflict, required seqNo [754565], primary term [1]. current document has seqNo [754720] and primary term [1]]
[2024-12-19T02:49:12,259][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:getSingleConfig-get snapshot-error-notification
[2024-12-19T02:49:12,939][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:sendMessage:statusCode=200, statusText=Success, message id: a3db63d7-295e-5608-b188-3c0aa2b6a1c2
[2024-12-19T02:49:12,941][WARN ][o.o.i.u.JobSchedulerUtils] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock for job xxxxxx-sm-policy
[2024-12-19T02:49:12,941][ERROR][o.o.i.s.SMRunner         ] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock [.ism-config-xxxxxx-sm-policy] for xxxxxx-sm-policy.
```

**Do you have any additional context?**
Add any other context about the problem.


	// creation, deletion workflow have to be executed sequentially,
	// because they are sharing the same metadata document.
	SMStateMachine(client, job, metadata, settings, threadPool, indicesManager)
	.handlePolicyChange()
	.currentState(metadata.creation.currentState)
	.next(creationTransitions)
	.apply {
	val deleteMetadata = metadata.deletion
	if (deleteMetadata != null) {
	this.currentState(deleteMetadata.currentState)
	.next(deletionTransitions)
	}
	}
	} finally {
	if (!releaseLockForScheduledJob(context, lock)) {
	log.error("Could not release lock [${lock.lockId}] for ${job.id}.")
	}

	} catch (ex: Exception) {
	val message = "There was an exception at ${now()} while executing Snapshot Management policy ${job.policyName}, please check logs."
	job.notificationConfig?.sendFailureNotification(client, job.policyName, message, job.user, log)
	@Suppress("InstanceOfCheckForException")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions