-
Couldn't load subscription status.
- Fork 130
Description
What is the bug?
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.
index-management/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/SMRunner.kt
Lines 104 to 120 in eb6afa8
| // creation, deletion workflow have to be executed sequentially, | |
| // because they are sharing the same metadata document. | |
| SMStateMachine(client, job, metadata, settings, threadPool, indicesManager) | |
| .handlePolicyChange() | |
| .currentState(metadata.creation.currentState) | |
| .next(creationTransitions) | |
| .apply { | |
| val deleteMetadata = metadata.deletion | |
| if (deleteMetadata != null) { | |
| this.currentState(deleteMetadata.currentState) | |
| .next(deletionTransitions) | |
| } | |
| } | |
| } finally { | |
| if (!releaseLockForScheduledJob(context, lock)) { | |
| log.error("Could not release lock [${lock.lockId}] for ${job.id}.") | |
| } |
Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.
On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix.
Lines 124 to 127 in eb6afa8
| } catch (ex: Exception) { | |
| val message = "There was an exception at ${now()} while executing Snapshot Management policy ${job.policyName}, please check logs." | |
| job.notificationConfig?.sendFailureNotification(client, job.policyName, message, job.user, log) | |
| @Suppress("InstanceOfCheckForException") |
How can one reproduce the bug?
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.
What is the expected behavior?
The system should not send false positive notifications to users for internal metadata update failures.
Do you have any screenshots?
[2024-12-19T02:49:12,259][ERROR][c.o.i.s.e.SMStateMachine [xxxxxx]] [c15aefb119d1092fc32d73e9e5ef8c22] Failed to update metadata.
[.ism-config/QHnWuqpwS46e7r0qCLwuNQ][[.ism-config][4]] VersionConflictEngineException[[xxxxxx-sm-metadata]: version conflict, required seqNo [754565], primary term [1]. current document has seqNo [754720] and primary term [1]]
[2024-12-19T02:49:12,259][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:getSingleConfig-get snapshot-error-notification
[2024-12-19T02:49:12,939][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:sendMessage:statusCode=200, statusText=Success, message id: a3db63d7-295e-5608-b188-3c0aa2b6a1c2
[2024-12-19T02:49:12,941][WARN ][o.o.i.u.JobSchedulerUtils] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock for job xxxxxx-sm-policy
[2024-12-19T02:49:12,941][ERROR][o.o.i.s.SMRunner ] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock [.ism-config-xxxxxx-sm-policy] for xxxxxx-sm-policy.
Do you have any additional context?
Add any other context about the problem.