PV remapClaimRefNS was being skipped when there was no snapshot #3708

sseago · 2021-04-20T15:24:12Z

Signed-off-by: Scott Seago sseago@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Previously (#3007 ) a fix was submitted which handles PV ClaimRef remapping and PV renaming when there is a snapshot, but it did not handle the default case of no snapshot, no restic, but still restoring the PV. This extends the same behavior to the default case (no snapshot, no restic, no delete reclaim policy): the ClaimRef gets its namespace remapped if necessary, and the PV gets renamed if necessary.

Tests included to confirm the fix.

Does your change fix a particular issue?

Fixes #3707

Please indicate you've done the following:

[ x] Accepted the DCO. Commits without the DCO will delay acceptance.
[ x] Created a changelog file or added /kind changelog-not-required.
Updated the corresponding documentation in site/content/docs/main.

nrb

I think it looks ok, but it's mostly a copy paste from what I can tell.

Is it possible to move the remap logic to a function that is shared?

nrb · 2021-05-17T21:06:19Z

pkg/restore/restore.go

@@ -1035,6 +1035,37 @@ func (ctx *restoreContext) restoreItem(obj *unstructured.Unstructured, groupReso
 		default:
 			ctx.log.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and its reclaim policy is not Delete.")

+			shouldRenamePV, err := shouldRenamePV(ctx, obj, resourceClient)


Hmm, since this is the same logic as the first case in the switch, I'm wondering if we should consolidate this into a function, or maybe do some of the logic before the switch.

The downside of moving it before is that it may be more work even in cases where there is no mapping to take place.

@nrb So I looked into this, but it looks like it would be difficult to drop all of the logic here into one function. Basically the logic added to the default case does 3 things:

Check whether we should rename the PV

Rename the PV if 1) is true

Remap the namespace if we need to

So 1+2 and 3 are somewhat independent of each other. Currently 2) and 3) are done in a different order in the two code paths, but there's really no need for that. However, the part that makes putting all of this code into a func is the fact that in the hasSnapshot case, there is some logic required between 1) and 2) which is not there in the default case. One thing I could consolidate is the block of code inside the if shouldRenamePV block -- I could put that into a a new func renamePV(...) which would be called inside the if block in both cases, if that would be enough of an improvement in terms of avoiding duplication.

zubron · 2021-09-01T21:27:37Z

pkg/restore/restore.go

+				errs.Add(namespace, err)
+				return warnings, errs
+			}
+			if shouldRenamePV {


While reviewing this change with others on the team, @dsu-igeek pointed out that although we are renaming the PV, remapping the ClaimRef namespace, and resetting the volume binding info, this doesn't clear the PV source, which means that the new renamed PV will still be pointing to the original backing storage currently in use by the original PV.

I experimented with this locally and saw that there doesn't seem to be anything to prevent multiple PVs from using the same backing storage (if they are on the same node). This means that data can be written by a pod in the original namespace and this can be read by a pod in the remapped namespace (as well as being able to perform other destructive operations).

I don't think it's possible to "clear" the source, as it is necessary to reference a volume type. Should we be provisioning a new PV in this case to go with the remapped namespace and PVC?

Without this do we get an error or do we get the PVC mapped to the original PV? I like the "principle of least surprise". To me, when I do a restore, I expect that I will get my original data back. In the case where we are restoring from a snapshot we get that. If we reconnect to the existing volume then we get that. Those are our existing cases. If we reconnect to the same volume but it's also in use somewhere else, that seems like a recipe for trouble when it works and a weird error when it doesn't (Bridget told me that she initially tested with the pods being on the same worker node and in that case the restore pod/PVC/PV was connected to the same directory as described above but if the pods wound up on different nodes, the restored pod would get an error that it couldn't attach to the storage). If we make a new empty volume that's probably not what is expected.

@dsu-igeek So without this, the following was the observed behavior (from the issue filed):

What steps did you take and what happened:
Restored a PV (without a snapshot) and the PVC it's bound to into a different namespace. The PVC was restored but "pending", with an error that the PV was bound to another claim. The reason for this is that the ClaimRef was not updated to the new namespace. The ClaimRef remapping is only currently done in restore/restore.go when there's a snapshot. It's also needed in the non-snapshot case.

So basically the PVC post-restore was not usable.

It was bound "to another claim" because its claimRef was different than the PVC that was restored with it.

@zubron Regarding the "not clearing the PV source", how is that different from a normal restore (without a namespace change) where the PV and PVC are restored. Since we're not using restic, a new PV isn't provisioned. Doesn't the "no namespace change" use case also result in the same scenario around two PVs using the same backing storage?

The particular use case where we hit this actually relies on the PV not changing its backing store data. We have a "Move PV/PVC" use case where in a migration, we do a backup of PV and PVC without restic or snapshots. Then after shutting down the application using the volume on the source cluster, we restore the backup to the destination cluster, and the transferred PV/PVC definitions allow the destination cluster to access the original (NFS) source volume. This works just fine in the "no namespace change" use case. The bug fixed by this PR was preventing it from working in the namespace change on restore scenario.

@sseago If we're restoring into the same namespace then the assumption would be that the PVC in that namespace had been removed or was still there (and we wouldn't overwrite it) and we wouldn't wind up sharing. I think it's still kind of dangerous and we should figure out these scenarios properly. Let's discuss in community meeting how to handle your case.

@dsu-igeek Sounds like a plan. So here are the two scenarios that we're dealing with (one works without this fix, one does not):

Backup namespace with PVCs and PVs in one cluster, no restic or snapshot copy; shut down applications using volumes; restore to different cluster. PV/PVC pair in new cluster points to same underlying storage as old cluster, effectively the PVC was "moved", so no data copy was needed. new cluster points to same (external) NFS share as the old cluster was using.

Same as 1) but change namespace on restore. With this fix, everything works just as in 1). Without this fix, the PVC remains "pending" forever. I suspect that the danger that's being discussed with this fix is still there in the same-namespace case, so I'm not certain that any new danger is introduced with this fix, but we can discuss it on the call next week.

sseago · 2021-09-10T11:58:39Z

@dsu-igeek @zubron I've updated the PR to remove the PV rename, so we're just fixing the pv/pvc mapping to handle namespace remapping. Unit tests updated accordingly, and we're also looking for the expected restore warning in the case where the PV already exists (since velero will discard it).

Signed-off-by: Scott Seago <sseago@redhat.com>

dsu-igeek

LGTM

zubron

Thanks, @sseago!

…re-tanzu#3708) Signed-off-by: Scott Seago <sseago@redhat.com>

github-actions bot requested review from carlisia and dsu-igeek April 20, 2021 15:24

github-actions bot added the has-unit-tests label Apr 20, 2021

github-actions bot assigned sseago Apr 20, 2021

sseago force-pushed the remap-pv-no-snapshot branch from c37b04e to d3b4234 Compare April 20, 2021 15:27

github-actions bot added the has-changelog label Apr 20, 2021

This was referenced Apr 20, 2021

When remapping namespaces, PVs without snapshots don't get their ClaimRefs remapped. #3707

Closed

PV rename/remapClaimRefNS were being skipped when there was no snapshot openshift/velero#87

Closed

sseago mentioned this pull request May 13, 2021

PV rename/remapClaimRefNS were being skipped when there was no snapshot openshift/velero#93

Merged

dsu-igeek added this to the v1.7.0 milestone May 14, 2021

carlisia requested review from nrb and removed request for dsu-igeek May 17, 2021 16:34

nrb suggested changes May 17, 2021

View reviewed changes

zubron requested review from zubron and removed request for carlisia June 3, 2021 14:11

zubron requested a review from dsu-igeek July 12, 2021 21:16

zubron reviewed Sep 1, 2021

View reviewed changes

sseago force-pushed the remap-pv-no-snapshot branch from d3b4234 to 03ddf12 Compare September 10, 2021 11:56

PV remapClaimRefNS was being skipped when there was no snapshot

8f21ac4

Signed-off-by: Scott Seago <sseago@redhat.com>

sseago force-pushed the remap-pv-no-snapshot branch from 03ddf12 to 8f21ac4 Compare September 10, 2021 12:02

sseago changed the title ~~PV rename/remapClaimRefNS were being skipped when there was no snapshot~~ PV remapClaimRefNS was being skipped when there was no snapshot Sep 10, 2021

dsu-igeek modified the milestones: v1.7.0, v1.8.0 Sep 22, 2021

dsu-igeek approved these changes Nov 10, 2021

View reviewed changes

zubron approved these changes Nov 10, 2021

View reviewed changes

zubron merged commit 9834890 into vmware-tanzu:main Nov 10, 2021

danfengliu pushed a commit to danfengliu/velero that referenced this pull request Jan 25, 2022

PV remapClaimRefNS was being skipped when there was no snapshot (vmwa…

2604eb8

…re-tanzu#3708) Signed-off-by: Scott Seago <sseago@redhat.com>

gyaozhou pushed a commit to gyaozhou/velero-read that referenced this pull request May 14, 2022

PV remapClaimRefNS was being skipped when there was no snapshot (vmwa…

7e08da0

…re-tanzu#3708) Signed-off-by: Scott Seago <sseago@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PV remapClaimRefNS was being skipped when there was no snapshot #3708

PV remapClaimRefNS was being skipped when there was no snapshot #3708

sseago commented Apr 20, 2021 •

edited

Loading

nrb left a comment

nrb May 17, 2021

sseago May 26, 2021

zubron Sep 1, 2021

dsu-igeek Sep 2, 2021

sseago Sep 2, 2021

sseago Sep 2, 2021

sseago Sep 2, 2021

sseago Sep 2, 2021

sseago Sep 2, 2021

dsu-igeek Sep 3, 2021

sseago Sep 3, 2021

sseago commented Sep 10, 2021

dsu-igeek left a comment

zubron left a comment

PV remapClaimRefNS was being skipped when there was no snapshot #3708

PV remapClaimRefNS was being skipped when there was no snapshot #3708

Conversation

sseago commented Apr 20, 2021 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

nrb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sseago commented Sep 10, 2021

dsu-igeek left a comment

Choose a reason for hiding this comment

zubron left a comment

Choose a reason for hiding this comment

sseago commented Apr 20, 2021 •

edited

Loading