Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

judemars · 2023-08-15T04:59:46Z

This is a follow-on from #121.

We are still seeing "error syncing claim: node not found" with "final error received, removing pvc x from claims in progress" @songsunny made a fix for this, however, the fix does not handle the final/non-final error. So the fix can cause the PD to leak in this scenario: #139 (comment)

sunnylovestiramisu · 2023-08-16T21:17:37Z

/sig storage
/kind bug

sunnylovestiramisu · 2023-08-22T22:52:03Z

Let's continue the conversation here. @msau42 @jsafrane

If the provisioning has started, shouldn't the state changed to ProvisioningInBackground?

If it is ProvisioningInBackground and node is not found, we do not go down the code path: return ctrl.provisionVolumeErrorHandling(ctx, ProvisioningReschedule, err, claim, operation)

Instead we continue to the code:

err = fmt.Errorf("failed to get target node: %v", err)
ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
return ProvisioningNoChange, err

jsafrane · 2023-08-23T11:24:20Z

@sunnylovestiramisu that will work only when we expect that the missing node comes back online. Is this the case we're trying to solve here? Because if the node was permanently deleted, then the provisioner will end in an endless loop.

To fix the issue for all possible library users, the library should save node somewhere (PVC annotation?!) and give it to the provisioner, so it can retry until it gets a final success / error without getting the Node from the API server.

But a whole node in PVC annotations is really ugly. CSI provisioner will need just node name from it to get CSINode and driver's topology labels from it*. So... should there be an extra call, say ExtractNodeInfo, where the provisioner would get whatever it wants from the SelectedNode (and perhaps even CSINode) and return a shorter string that the library would save in the PVC? Then the library could give the string to every Provision() call until a final success/error. This is change of the library API. Is it worth the effort?

*) There is another question what should happen when CSINode is missing here: https://github.com/kubernetes-csi/external-provisioner/blob/3739170578f68aaf0594f631cae6d270bbfdc83e/pkg/controller/topology.go#L273

sunnylovestiramisu · 2023-08-23T15:50:31Z

@jsafrane we are trying to solve the case that the provisioning did start already on the backend but then the node got deleted afterwards.

jsafrane · 2023-08-24T10:23:10Z

Question is, do you expect the node to come back and optimize for this case? That's manageable.

sunnylovestiramisu · 2023-08-24T18:45:42Z

The node with the same node name will not come back if it gets deleted. But if a node with the same node name but a different UID may come back. Or if there is any other cases that I miss.

msau42 · 2023-08-24T19:27:19Z

There's 2 main reasons why a Node about may disappear:

The node got autoscaled down. In this case the node is not expected to come back.
The node got pre-emped and recreated. The node may come back, and often it comes back quickly.

The original motivation for #121 was for the autoscaling case where the node never comes back.

jsafrane · 2023-08-25T08:37:27Z

The original motivation for #121 was for the autoscaling case where the node never comes back

That's the hard case. The provisioner lib then needs to store the whole node somewhere, so it can reconstruct it and give it to the provisioner in the next Provision() calls, until it succeeds / fails with a final error.

Since storing the whole node is ugly, we could extend the provisioner API, so the provisioner can distill a shorter string from the Node and then the library would store it (in PVC annotations?) and give it to the provisioner in all subsequent Provision calls.

sunnylovestiramisu · 2023-08-25T17:27:07Z

What happens if a node with the exact name but different uid comes back? Does this shorter string provide enough information for the provisioner to distinguish the node and decide what action to take?

jsafrane · 2023-08-28T11:44:46Z

Question is, if selectedNode annotation is valid when someone deletes a node and creates a new one with the same name. Assuming yes, the new node is still the right one, then I think the shorter string could be used only as a fallback, when the Node object is not in the API server (or in informer).

sunnylovestiramisu · 2023-08-28T18:08:12Z

But how do we know if a node will come back or not? How do we know if it will be recreated from symptoms? Let say we store the nodeName in annotations.

The node got autoscaled down. In this case the node is not expected to come back.
<-- We will see apierrs.IsNotFound(err) to be true always. And we still have nodeName in annotations. What do we do next?
The node got pre-emped and recreated. The node may come back, and often it comes back quickly.
<-- We will see apierrs.IsNotFound(err) to be true in a short period of time. And we still have nodeName in annotations. And we use that nodeName to skip the ctrl.provisionVolumeErrorHandling(ctx, ProvisioningReschedule, err, claim, operation)?

msau42 · 2023-08-28T21:01:23Z

I don't think if it matters if the node will come back or not. What we return depends on if the operation returned final or non-final error. Whether or not the node exists or not determines how we retry in the non-final case.

And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in #139 was trying to do).

I think the challenge with getting a new node is what happens if the topology also changes? Then a fix would involve:

The driver being able to handle a change in requested topology for an existing operation
provisioner being able to detect that the returned topology doesn't match the request and coordinating a deletion + retry

jsafrane · 2023-09-06T13:36:41Z

And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in #139 was trying to do).

What if there is no node? Someone has deleted the Pod and PVC and provisioning is in progress. The provisioning should continue and eventually succeed or fail, but it needs some node to continue.

k8s-triage-robot · 2024-01-27T18:47:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-26T19:37:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

msau42 · 2024-02-27T00:32:08Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-05-27T00:44:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xing-yang · 2024-06-05T17:33:22Z

/remove-lifecycle stale

xing-yang · 2024-06-05T17:40:11Z

/triage accepted

msau42 · 2024-08-02T22:48:16Z

/lifecycle frozen

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels Aug 16, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 27, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2024

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 5, 2024

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

judemars commented Aug 15, 2023

sunnylovestiramisu commented Aug 16, 2023

sunnylovestiramisu commented Aug 22, 2023

jsafrane commented Aug 23, 2023

sunnylovestiramisu commented Aug 23, 2023

jsafrane commented Aug 24, 2023

sunnylovestiramisu commented Aug 24, 2023 •

edited

Loading

msau42 commented Aug 24, 2023

jsafrane commented Aug 25, 2023

sunnylovestiramisu commented Aug 25, 2023

jsafrane commented Aug 28, 2023

sunnylovestiramisu commented Aug 28, 2023 •

edited

Loading

msau42 commented Aug 28, 2023

jsafrane commented Sep 6, 2023

k8s-triage-robot commented Jan 27, 2024

k8s-triage-robot commented Feb 26, 2024

msau42 commented Feb 27, 2024

k8s-triage-robot commented May 27, 2024

xing-yang commented Jun 5, 2024

xing-yang commented Jun 5, 2024

msau42 commented Aug 2, 2024

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

Comments

judemars commented Aug 15, 2023

sunnylovestiramisu commented Aug 16, 2023

sunnylovestiramisu commented Aug 22, 2023

jsafrane commented Aug 23, 2023

sunnylovestiramisu commented Aug 23, 2023

jsafrane commented Aug 24, 2023

sunnylovestiramisu commented Aug 24, 2023 • edited Loading

msau42 commented Aug 24, 2023

jsafrane commented Aug 25, 2023

sunnylovestiramisu commented Aug 25, 2023

jsafrane commented Aug 28, 2023

sunnylovestiramisu commented Aug 28, 2023 • edited Loading

msau42 commented Aug 28, 2023

jsafrane commented Sep 6, 2023

k8s-triage-robot commented Jan 27, 2024

k8s-triage-robot commented Feb 26, 2024

msau42 commented Feb 27, 2024

k8s-triage-robot commented May 27, 2024

xing-yang commented Jun 5, 2024

xing-yang commented Jun 5, 2024

msau42 commented Aug 2, 2024

sunnylovestiramisu commented Aug 24, 2023 •

edited

Loading

sunnylovestiramisu commented Aug 28, 2023 •

edited

Loading