Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot controller cannot recover from missing volume snapshot class error #333

Closed
saikat-royc opened this issue Jul 15, 2020 · 13 comments · Fixed by #335
Closed

Snapshot controller cannot recover from missing volume snapshot class error #333

saikat-royc opened this issue Jul 15, 2020 · 13 comments · Fixed by #335
Assignees

Comments

@saikat-royc
Copy link
Contributor

saikat-royc commented Jul 15, 2020

In the current implementation of the snapshot controller, in checkAndUpdateSnapshotClass()
if a missing volume snapshot class is detected, an error status is stamped on the volume snapshot object.
Periodic sync, does not clear the error status. Side effect of this is that, even if the volume snapshot class is detected in the subsequent resyncs, syncUnreadySnapshot() never triggers the snapshot content creation object. Because following condition never evaluates to true (snapshot.Status == nil || snapshot.Status.Error == nil || isControllerUpdateFailError(snapshot.Status.Error)), and the volume snapshot workflow is stuck.

Possible fixes:

  1. Do not update any error status on the volume object i.e skip calling updateSnapshotErrorStatusWithEvent() from checkAndUpdateSnapshotClass(), and only log an error message. The state machine would fail gracefully while creating a VS content object.

  2. Do not update the error status on volume object, but generate an event. (this needs additional changes to ensure that only 1 event is generated, maybe stamp an annotation of missing VSC on the volume object, before generating event)

  3. Update the error status as it is done today, but when we detect a VSC in subsequent resync clear the error status (this needs to ensure we check the error msg reason and clear only the VSC missing error status)

  4. Update error status for missing VSC as it is done today, but handle this VSC missing error in syncUnreadySnapshot() and proceed with VS content creation. VS content creation would fail gracefully if the VSC is still missing.

@saikat-royc
Copy link
Contributor Author

Thoughts which solution is preferred?
@msau42 @mattcary @xing-yang

@saikat-royc
Copy link
Contributor Author

/assign

@mattcary
Copy link
Contributor

With #4 would the error status stay on the VS? If so that seems wrong, if not it seems equivalent to #3 (?)

Anyway #3 seems best to me with my limited knowledge of the situation.

@xing-yang
Copy link
Collaborator

I'm thinking that if a contentObj is nil, we should always try to create a new content. I think we can get rid of this check here (
https://github.com/kubernetes-csi/external-snapshotter/blob/v2.2.0-rc1/pkg/common-controller/snapshot_controller.go#L456). At this point, we already know it is dynamic provisioning but a contentObj is not created yet. So I think we should always retry to create a new content here.

@yuxiangqian what do you think?

@saikat-royc
Copy link
Contributor Author

@xing-yang the error status will still not be cleared even if we remove the check and proceed with VS Content object creation. That may mislead user to think volume object has an vsc missing error?

@xing-yang
Copy link
Collaborator

@saikat-royc I think we need to continue to work on this PR which is to update snapshot status based on the content status: #284

@msau42
Copy link
Collaborator

msau42 commented Jul 15, 2020

On a success, can we clear the error?

@xing-yang
Copy link
Collaborator

xing-yang commented Jul 15, 2020

Yes. We should handle that in updateSnapshotStatus.

@mattcary
Copy link
Contributor

Yeah, that might make sense. Or at least retry unless we know it's an unrecoverable error (if such an error exists?) rather than retrying only on certain known errors as the current code does in syncUnreadySnapshot.

@msau42
Copy link
Collaborator

msau42 commented Jul 15, 2020

+1 on always retrying no matter what the error is

@saikat-royc
Copy link
Contributor Author

saikat-royc commented Jul 15, 2020

Thanks for all the input. I think the next steps is to make #284 handle clearing out errors from volume snapshot object, and as part of this issue #333 remove the if check and retry unconditionally.
@xing-yang I have put my comments in your patch 284.

@xing-yang
Copy link
Collaborator

@saikat-royc #284 is merged now. Do you want to work on the 2nd part that is to remove the if check and retry unconditionally?

@saikat-royc
Copy link
Contributor Author

Yes will do @xing-yang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants