-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot controller cannot recover from missing volume snapshot class error #333
Comments
Thoughts which solution is preferred? |
/assign |
I'm thinking that if a contentObj is nil, we should always try to create a new content. I think we can get rid of this check here ( @yuxiangqian what do you think? |
@xing-yang the error status will still not be cleared even if we remove the check and proceed with VS Content object creation. That may mislead user to think volume object has an vsc missing error? |
@saikat-royc I think we need to continue to work on this PR which is to update snapshot status based on the content status: #284 |
On a success, can we clear the error? |
Yes. We should handle that in updateSnapshotStatus. |
Yeah, that might make sense. Or at least retry unless we know it's an unrecoverable error (if such an error exists?) rather than retrying only on certain known errors as the current code does in syncUnreadySnapshot. |
+1 on always retrying no matter what the error is |
Thanks for all the input. I think the next steps is to make #284 handle clearing out errors from volume snapshot object, and as part of this issue #333 remove the if check and retry unconditionally. |
@saikat-royc #284 is merged now. Do you want to work on the 2nd part that is to remove the if check and retry unconditionally? |
Yes will do @xing-yang |
In the current implementation of the snapshot controller, in checkAndUpdateSnapshotClass()
if a missing volume snapshot class is detected, an error status is stamped on the volume snapshot object.
Periodic sync, does not clear the error status. Side effect of this is that, even if the volume snapshot class is detected in the subsequent resyncs, syncUnreadySnapshot() never triggers the snapshot content creation object. Because following condition never evaluates to true (snapshot.Status == nil || snapshot.Status.Error == nil || isControllerUpdateFailError(snapshot.Status.Error)), and the volume snapshot workflow is stuck.
Possible fixes:
Do not update any error status on the volume object i.e skip calling updateSnapshotErrorStatusWithEvent() from checkAndUpdateSnapshotClass(), and only log an error message. The state machine would fail gracefully while creating a VS content object.
Do not update the error status on volume object, but generate an event. (this needs additional changes to ensure that only 1 event is generated, maybe stamp an annotation of missing VSC on the volume object, before generating event)
Update the error status as it is done today, but when we detect a VSC in subsequent resync clear the error status (this needs to ensure we check the error msg reason and clear only the VSC missing error status)
Update error status for missing VSC as it is done today, but handle this VSC missing error in syncUnreadySnapshot() and proceed with VS content creation. VS content creation would fail gracefully if the VSC is still missing.
The text was updated successfully, but these errors were encountered: