Implement distributed snapshotting #585

nearora-msft · 2021-08-30T20:45:05Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:
This PR will allow snapshot sidecar controller to be deployed on every node along with the CSI drivers that handle local volumes:

Added a parameter "--enable-distributed-snapshotting" as a command line option for snapshot controller, which should be set to true to enable distributed snapshotting.
Added a parameter "--node-deployment" as a command line option for snapshotter sidecar, which should be set to true when the sidecar is being deployed on a per node basis.
For these changes to work, NODE_NAME environment variable must also be set while deploying the sidecar controller.
With these changes, the common snapshot controller checks the node affinity for the PV related to the given VolumeSnapshot. The node affinity is matched against a list of nodes and a matching node is found. The nodeName for this matching Node is added as a label to the VolumeSnapshotContent as "snapshot.storage.kubernetes.io/managed-by=nodeName".
The locally deployed sidecar controller filters out the VolumeSnapshotContent objects based on the label mentioned above and will only pick those objects whose labels match with the node that the sidecar is deployed on.
This change also adds more rules to the RBAC settings for the common snapshot-controller so that it can get/list node related information.

Which issue(s) this PR fixes:

Fixes #484

Special notes for your reviewer:
Pending changes: Unit tests and documentation.

Does this PR introduce a user-facing change?:

Adds support for distributed snapshotting.

k8s-ci-robot · 2021-08-30T20:45:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

nearora-msft · 2021-08-30T20:51:28Z

/test all

xing-yang · 2021-08-30T21:53:07Z

/assign @yuxiangqian

xing-yang · 2021-08-30T21:54:49Z

Does csi hostpath driver support distributed snapshotting?

nearora-msft · 2021-08-30T21:57:41Z

Does csi hostpath driver support distributed snapshotting?

I tested these changes with csi hostpath driver and it works as expected.

deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml

xing-yang · 2021-09-09T14:17:01Z

README.md

@@ -165,6 +165,8 @@ Read more about how to install the example webhook [here](deploy/kubernetes/webh

 * `--worker-threads`: Number of worker threads for running create snapshot and delete snapshot operations. Default value is 10.

+* `--node-deployment`: Enables deploying the sidecar controller together with a CSI driver on nodes to manage node-local volumes. Off by default.


Can you add more instructions in the "Usage" section on how to use this properly.

xing-yang · 2021-09-09T14:18:15Z

I tested these changes with csi hostpath driver and it works as expected.

@nearora-msft Can you please provide more details on how you deployed it and how you tested it. Some testing results will be helpful as well.

yuxiangqian

I assume this is just part of the CL?

cmd/csi-snapshotter/main.go

yuxiangqian · 2021-10-06T05:04:45Z

cmd/csi-snapshotter/main.go

+	if *enableNodeDeployment {
+		node := os.Getenv("NODE_NAME")
+		if node == "" {
+			klog.Fatal("The NODE_NAME environment variable must be set when using --enable-node-deployment.")


should it just exit?

Should there be a specific exit statement? klog.Fatal() logs the error and then calls os.Exit(255).

pkg/common-controller/snapshot_controller.go

yuxiangqian · 2021-10-06T06:56:02Z

pkg/common-controller/snapshot_controller.go

+		return "", nil
+	}
+
+	for _, node := range nodes.Items {


The fact that the first matching node will ALWAYS be picked to be labeled is a bit concerning though TBH I do not have a solution. It might result in a situation all snapshot contents will be labeled with the same node name, and the sidecar running on that node became hot.

Can there be multiple nodes that can match with the node affinity for a local volume?

AFAIK there can be only a single node that matches the node affinity

pkg/common-controller/snapshot_controller.go

pkg/utils/util.go

xing-yang · 2021-10-22T02:34:04Z

Hi @nearora-msft Can you please address the comments? Thanks.

zhucan · 2021-10-25T03:12:39Z

@nearora-msft Can you rebase it? I will test it.

awels · 2021-11-01T14:26:48Z

@nearora-msft Are you still working on this? If not, I would be willing to complete the work.

nearora-msft · 2021-11-01T14:47:22Z

@nearora-msft Are you still working on this? If not, I would be willing to complete the work.

Yes, sorry about the delay. Yes, still working on it. Planning to addressing the comments today.

nearora-msft · 2021-11-01T14:48:30Z

Hi @nearora-msft Can you please address the comments? Thanks.

Yes, will do

nearora-msft · 2021-12-21T01:14:14Z

/retest

nearora-msft · 2021-12-21T18:39:06Z

@nearora-msft Can you address the review comments? I'd like to get this in the 5.0.0 release. Thanks.

Addressed the comments and tested with the latest revision.

xing-yang · 2021-12-22T22:39:51Z

/test pull-kubernetes-csi-external-snapshotter-1-23-on-kubernetes-master

xing-yang · 2021-12-22T22:41:09Z

/test pull-kubernetes-csi-external-snapshotter-alpha-1-22-on-kubernetes-1-22

xing-yang · 2021-12-22T22:47:15Z

pkg/utils/util.go

@@ -106,7 +106,8 @@ const (
 	VolumeSnapshotContentInvalidLabel = "snapshot.storage.kubernetes.io/invalid-snapshot-content-resource"
 	// VolumeSnapshotInvalidLabel is applied to invalid snapshot as a label key. The value does not matter.
 	// See https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/177-volume-snapshot/tighten-validation-webhook-crd.md#automatic-labelling-of-invalid-objects
-	VolumeSnapshotInvalidLabel = "snapshot.storage.kubernetes.io/invalid-snapshot-resource"
+	VolumeSnapshotInvalidLabel          = "snapshot.storage.kubernetes.io/invalid-snapshot-resource"
+	VolumeSnapshotContentManagedByLabel = "snapshot.storage.kubernetes.io/managed-by"


Can you add a comment to explain what this label is for?

Please add something like this to clarify what is in this label: It specifies the node name which handles the snapshot for the volume local to that node.

pohly · 2021-12-23T10:31:58Z

cmd/csi-snapshotter/main.go

+	metricsPath          = flag.String("metrics-path", "/metrics", "The HTTP path where prometheus metrics will be exposed. Default is `/metrics`.")
+	retryIntervalStart   = flag.Duration("retry-interval-start", time.Second, "Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default is 1 second.")
+	retryIntervalMax     = flag.Duration("retry-interval-max", 5*time.Minute, "Maximum retry interval of failed volume snapshot creation or deletion. Default is 5 minutes.")
+	enableNodeDeployment = flag.Bool("node-deployment", false, "Enables deploying the sidecar controller together with a CSI driver on nodes to manage snapshots for node-local volumes.")


Does it make sense to run the snapshotter with --node-deployment and without --enable-distributed-snapshotting?

If no, then --node-deployment can be removed and it can be enabled together with --enable-distributed-snapshotting.

We need to have a flag to determine whether the sidecar is enabled for distributed snapshotting and it needs to be disabled by default. With a new feature like this, we can't enable it by default.

@nearora-msft, can you add a section in README with the title "Distributed Snapshotting", and clarify both flags are needed for this to work?

Added here : https://github.com/nearora-msft/external-snapshotter/blob/implement-distributed-snapshotting/README.md#distributed-snapshotting

xing-yang · 2021-12-25T13:10:44Z

/test pull-kubernetes-csi-external-snapshotter-1-23-on-kubernetes-master

xing-yang · 2021-12-25T13:10:59Z

/test pull-kubernetes-csi-external-snapshotter-alpha-1-22-on-kubernetes-1-22

xing-yang · 2021-12-25T14:03:56Z

/lgtm
/approve

k8s-ci-robot · 2021-12-25T14:04:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nearora-msft, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xing-yang · 2021-12-25T14:05:51Z

@nearora-msft I'm merging the PR. Thanks for your hard work!

Can you add some description for this enable-distributed-snapshotting flag in snapshot controller in your PR description?

Please also backport this PR to release-5.0 branch.

nearora-msft · 2021-12-27T18:31:38Z

@nearora-msft I'm merging the PR. Thanks for your hard work!

Can you add some description for this enable-distributed-snapshotting flag in snapshot controller in your PR description?

Please also backport this PR to release-5.0 branch.

Sure

…tting Backport #585 to release 5.0

awels · 2022-01-04T16:40:27Z

Yes @nearora-msft thank you for the work. Have you thought about the reverse problem? Now that we can properly take snapshots, we need to restore them. In my testing with WFFC storage I found that we have a similar problem on restoring (same problem exists for csi clone). What happens if the scheduler schedules the pod that uses the new PVC that restores on a node that is not the same node as where the snapshot resides.

There are 2 options on solving it, but only one makes good sense IMO:

The CSI driver has some mechanism of copying data from other nodes to restore. I don't think CSI drivers want to be in the business of copying data from node to node.
Make the k8s scheduler aware of distributed snapshotting/csi clone, and implement a node filter based on the dataSource of the PVC and force the scheduler to only schedule pods on the node that contains the dataSource.

2 makes the most sense to me, but I am writing just in case I completely missed something obvious that solves the restore problem.

nearora-msft · 2022-01-04T18:43:49Z

@awels Yes, 2 makes more sense to me as well.

As per this implementation of distributed provisioning, for local volumes, if binding mode is set to WaitForFirstConsumer, the pvc will contain the name of the node in the selectedNode annotation and the pv will volume would be created on that node. I can see this annotation being set here.

We could probably do something similar for snapshot restore. We could set the same annotation in case of snapshot restore, regardless of the binding mode. The annotation will contain info about the node that is handling the snapshot and the external-provisioner will make sure that the snapshot is restored on that node. We'd probably have to make changes in the external-provisioner as well.

@pohly please correct me if I didn't interpret this correctly.

awels · 2022-01-04T20:07:36Z

As per this implementation of distributed provisioning, for local volumes, if binding mode is set to WaitForFirstConsumer, the pvc will contain the name of the node in the selectedNode annotation and the pv will volume would be created on that node. I can see this annotation being set here.

Yes, but AFAICT there is no mechanism to ensure that the nodeName being set by the scheduler is the same node as the dataSource source (PVC or snapshot). Which is where my filter comment came from. IMO we need to solve the problem in the scheduler, which will then put the right node in the annotation, and from there everything should work. Now making a scheduler extension that does this doesn't seem terribly hard, the problem is in getting all the pods being created to use the extension when a csi driver that needs it is in the cluster, that is the part I am struggling with.

nearora-msft · 2022-01-04T20:24:29Z

As per this implementation of distributed provisioning, for local volumes, if binding mode is set to WaitForFirstConsumer, the pvc will contain the name of the node in the selectedNode annotation and the pv will volume would be created on that node. I can see this annotation being set here.

Yes, but AFAICT there is no mechanism to ensure that the nodeName being set by the scheduler is the same node as the dataSource source (PVC or snapshot). Which is where my filter comment came from. IMO we need to solve the problem in the scheduler, which will then put the right node in the annotation, and from there everything should work. Now making a scheduler extension that does this doesn't seem terribly hard, the problem is in getting all the pods being created to use the extension when a csi driver that needs it is in the cluster, that is the part I am struggling with.

Ah, ok. Got it. I can't think of a solution at the top of my head but will update the thread if I can think of something.

k8s-ci-robot requested review from Jiawei0227 and yuxiangqian August 30, 2021 20:45

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 30, 2021

nearora-msft marked this pull request as ready for review August 30, 2021 20:58

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2021

k8s-ci-robot assigned yuxiangqian Aug 30, 2021

xing-yang self-assigned this Aug 30, 2021

xing-yang reviewed Sep 9, 2021

View reviewed changes

xing-yang mentioned this pull request Sep 23, 2021

Distributed health monitor controller kubernetes-csi/external-health-monitor#87

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2021

yuxiangqian reviewed Oct 6, 2021

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 1, 2021

nearora-msft force-pushed the implement-distributed-snapshotting branch from 0db781e to b94354f Compare November 1, 2021 16:32

nearora-msft force-pushed the implement-distributed-snapshotting branch from 21ef58a to b5680b9 Compare December 21, 2021 03:51

nearora-msft force-pushed the implement-distributed-snapshotting branch from 0ce0da3 to 73543dc Compare December 22, 2021 17:45

xing-yang reviewed Dec 22, 2021

View reviewed changes

nearora-msft force-pushed the implement-distributed-snapshotting branch from 73543dc to fa1110c Compare December 23, 2021 02:08

pohly reviewed Dec 23, 2021

View reviewed changes

nearora-msft force-pushed the implement-distributed-snapshotting branch from fa1110c to bd102f6 Compare December 24, 2021 19:31

feat: Implement distributed snapshotting

21fc337

nearora-msft force-pushed the implement-distributed-snapshotting branch 2 times, most recently from ddafcc5 to 21fc337 Compare December 24, 2021 20:17

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 25, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 25, 2021

k8s-ci-robot merged commit 7bc7d91 into kubernetes-csi:master Dec 25, 2021

nearora-msft mentioned this pull request Dec 27, 2021

Backport #585 to release 5.0 #634

Merged

k8s-ci-robot added a commit that referenced this pull request Dec 28, 2021

Merge pull request #634 from nearora-msft/nearora-distributed-snapsho…

f35e686

…tting Backport #585 to release 5.0

awels mentioned this pull request Jan 11, 2022

Take dataSource topology into account when scheduling a pod using unbound WFFC storage kubernetes/kubernetes#107479

Closed

pohly mentioned this pull request Feb 22, 2022

distributed resizing kubernetes-csi/external-resizer#142

Open

		@@ -165,6 +165,8 @@ Read more about how to install the example webhook [here](deploy/kubernetes/webh

		* `--worker-threads`: Number of worker threads for running create snapshot and delete snapshot operations. Default value is 10.

		* `--node-deployment`: Enables deploying the sidecar controller together with a CSI driver on nodes to manage node-local volumes. Off by default.

Implement distributed snapshotting #585

Implement distributed snapshotting #585

Conversation

nearora-msft commented Aug 30, 2021 • edited Loading

k8s-ci-robot commented Aug 30, 2021

nearora-msft commented Aug 30, 2021

xing-yang commented Aug 30, 2021

xing-yang commented Aug 30, 2021

nearora-msft commented Aug 30, 2021

Choose a reason for hiding this comment

xing-yang commented Sep 9, 2021

yuxiangqian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nearora-msft Nov 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nearora-msft Nov 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Oct 22, 2021

zhucan commented Oct 25, 2021

awels commented Nov 1, 2021

nearora-msft commented Nov 1, 2021

nearora-msft commented Nov 1, 2021

nearora-msft commented Dec 21, 2021

nearora-msft commented Dec 21, 2021

xing-yang commented Dec 22, 2021

xing-yang commented Dec 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Dec 25, 2021

xing-yang commented Dec 25, 2021

xing-yang commented Dec 25, 2021

k8s-ci-robot commented Dec 25, 2021

xing-yang commented Dec 25, 2021

nearora-msft commented Dec 27, 2021

awels commented Jan 4, 2022

nearora-msft commented Jan 4, 2022 • edited Loading

awels commented Jan 4, 2022

nearora-msft commented Jan 4, 2022

nearora-msft commented Aug 30, 2021 •

edited

Loading

nearora-msft Nov 1, 2021 •

edited

Loading

nearora-msft Nov 1, 2021 •

edited

Loading

nearora-msft commented Jan 4, 2022 •

edited

Loading