Snapshot controller cannot recover from missing volume snapshot class error #333

saikat-royc · 2020-07-15T16:36:34Z

In the current implementation of the snapshot controller, in checkAndUpdateSnapshotClass()
if a missing volume snapshot class is detected, an error status is stamped on the volume snapshot object.
Periodic sync, does not clear the error status. Side effect of this is that, even if the volume snapshot class is detected in the subsequent resyncs, syncUnreadySnapshot() never triggers the snapshot content creation object. Because following condition never evaluates to true (snapshot.Status == nil || snapshot.Status.Error == nil || isControllerUpdateFailError(snapshot.Status.Error)), and the volume snapshot workflow is stuck.

Possible fixes:

Do not update any error status on the volume object i.e skip calling updateSnapshotErrorStatusWithEvent() from checkAndUpdateSnapshotClass(), and only log an error message. The state machine would fail gracefully while creating a VS content object.
Do not update the error status on volume object, but generate an event. (this needs additional changes to ensure that only 1 event is generated, maybe stamp an annotation of missing VSC on the volume object, before generating event)
Update the error status as it is done today, but when we detect a VSC in subsequent resync clear the error status (this needs to ensure we check the error msg reason and clear only the VSC missing error status)
Update error status for missing VSC as it is done today, but handle this VSC missing error in syncUnreadySnapshot() and proceed with VS content creation. VS content creation would fail gracefully if the VSC is still missing.

saikat-royc · 2020-07-15T16:37:30Z

Thoughts which solution is preferred?
@msau42 @mattcary @xing-yang

saikat-royc · 2020-07-15T16:37:37Z

/assign

mattcary · 2020-07-15T16:42:36Z

With #4 would the error status stay on the VS? If so that seems wrong, if not it seems equivalent to #3 (?)

Anyway #3 seems best to me with my limited knowledge of the situation.

xing-yang · 2020-07-15T18:19:21Z

I'm thinking that if a contentObj is nil, we should always try to create a new content. I think we can get rid of this check here (
https://github.com/kubernetes-csi/external-snapshotter/blob/v2.2.0-rc1/pkg/common-controller/snapshot_controller.go#L456). At this point, we already know it is dynamic provisioning but a contentObj is not created yet. So I think we should always retry to create a new content here.

@yuxiangqian what do you think?

saikat-royc · 2020-07-15T18:31:00Z

@xing-yang the error status will still not be cleared even if we remove the check and proceed with VS Content object creation. That may mislead user to think volume object has an vsc missing error?

xing-yang · 2020-07-15T18:37:45Z

@saikat-royc I think we need to continue to work on this PR which is to update snapshot status based on the content status: #284

msau42 · 2020-07-15T18:37:58Z

On a success, can we clear the error?

xing-yang · 2020-07-15T18:41:44Z

Yes. We should handle that in updateSnapshotStatus.

mattcary · 2020-07-15T18:46:49Z

Yeah, that might make sense. Or at least retry unless we know it's an unrecoverable error (if such an error exists?) rather than retrying only on certain known errors as the current code does in syncUnreadySnapshot.

msau42 · 2020-07-15T19:00:08Z

+1 on always retrying no matter what the error is

saikat-royc · 2020-07-15T19:20:53Z

Thanks for all the input. I think the next steps is to make #284 handle clearing out errors from volume snapshot object, and as part of this issue #333 remove the if check and retry unconditionally.
@xing-yang I have put my comments in your patch 284.

xing-yang · 2020-07-22T12:58:41Z

@saikat-royc #284 is merged now. Do you want to work on the 2nd part that is to remove the if check and retry unconditionally?

saikat-royc · 2020-07-22T14:38:18Z

Yes will do @xing-yang

k8s-ci-robot assigned saikat-royc Jul 15, 2020

saikat-royc mentioned this issue Jul 15, 2020

Update Error in Snapshot Status #284

Merged

mattcary mentioned this issue Jul 17, 2020

REQUEST: New membership for mattcary kubernetes/org#2043

Closed

6 tasks

saikat-royc mentioned this issue Jul 22, 2020

Call dynamic VS content creation unconditionally #335

Merged

k8s-ci-robot closed this as completed in #335 Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot controller cannot recover from missing volume snapshot class error #333

Snapshot controller cannot recover from missing volume snapshot class error #333

saikat-royc commented Jul 15, 2020 •

edited

Loading

saikat-royc commented Jul 15, 2020

saikat-royc commented Jul 15, 2020

mattcary commented Jul 15, 2020

xing-yang commented Jul 15, 2020

saikat-royc commented Jul 15, 2020

xing-yang commented Jul 15, 2020

msau42 commented Jul 15, 2020

xing-yang commented Jul 15, 2020 •

edited

Loading

mattcary commented Jul 15, 2020

msau42 commented Jul 15, 2020

saikat-royc commented Jul 15, 2020 •

edited

Loading

xing-yang commented Jul 22, 2020

saikat-royc commented Jul 22, 2020

Snapshot controller cannot recover from missing volume snapshot class error #333

Snapshot controller cannot recover from missing volume snapshot class error #333

Comments

saikat-royc commented Jul 15, 2020 • edited Loading

saikat-royc commented Jul 15, 2020

saikat-royc commented Jul 15, 2020

mattcary commented Jul 15, 2020

xing-yang commented Jul 15, 2020

saikat-royc commented Jul 15, 2020

xing-yang commented Jul 15, 2020

msau42 commented Jul 15, 2020

xing-yang commented Jul 15, 2020 • edited Loading

mattcary commented Jul 15, 2020

msau42 commented Jul 15, 2020

saikat-royc commented Jul 15, 2020 • edited Loading

xing-yang commented Jul 22, 2020

saikat-royc commented Jul 22, 2020

saikat-royc commented Jul 15, 2020 •

edited

Loading

xing-yang commented Jul 15, 2020 •

edited

Loading

saikat-royc commented Jul 15, 2020 •

edited

Loading