Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed snapshotting #585

Conversation

nearora-msft
Copy link
Contributor

@nearora-msft nearora-msft commented Aug 30, 2021

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:
This PR will allow snapshot sidecar controller to be deployed on every node along with the CSI drivers that handle local volumes:

  1. Added a parameter "--enable-distributed-snapshotting" as a command line option for snapshot controller, which should be set to true to enable distributed snapshotting.
  2. Added a parameter "--node-deployment" as a command line option for snapshotter sidecar, which should be set to true when the sidecar is being deployed on a per node basis.
  3. For these changes to work, NODE_NAME environment variable must also be set while deploying the sidecar controller.
  4. With these changes, the common snapshot controller checks the node affinity for the PV related to the given VolumeSnapshot. The node affinity is matched against a list of nodes and a matching node is found. The nodeName for this matching Node is added as a label to the VolumeSnapshotContent as "snapshot.storage.kubernetes.io/managed-by=nodeName".
  5. The locally deployed sidecar controller filters out the VolumeSnapshotContent objects based on the label mentioned above and will only pick those objects whose labels match with the node that the sidecar is deployed on.
  6. This change also adds more rules to the RBAC settings for the common snapshot-controller so that it can get/list node related information.

Which issue(s) this PR fixes:

Fixes #484

Special notes for your reviewer:
Pending changes: Unit tests and documentation.

Does this PR introduce a user-facing change?:

Adds support for distributed snapshotting.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 30, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 30, 2021
@nearora-msft
Copy link
Contributor Author

/test all

@nearora-msft nearora-msft marked this pull request as ready for review August 30, 2021 20:58
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2021
@xing-yang
Copy link
Collaborator

/assign @yuxiangqian

@xing-yang
Copy link
Collaborator

Does csi hostpath driver support distributed snapshotting?

@nearora-msft
Copy link
Contributor Author

Does csi hostpath driver support distributed snapshotting?

I tested these changes with csi hostpath driver and it works as expected.

README.md Outdated
@@ -165,6 +165,8 @@ Read more about how to install the example webhook [here](deploy/kubernetes/webh

* `--worker-threads`: Number of worker threads for running create snapshot and delete snapshot operations. Default value is 10.

* `--node-deployment`: Enables deploying the sidecar controller together with a CSI driver on nodes to manage node-local volumes. Off by default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more instructions in the "Usage" section on how to use this properly.

@xing-yang
Copy link
Collaborator

I tested these changes with csi hostpath driver and it works as expected.

@nearora-msft Can you please provide more details on how you deployed it and how you tested it. Some testing results will be helpful as well.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2021
Copy link
Contributor

@yuxiangqian yuxiangqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is just part of the CL?

cmd/csi-snapshotter/main.go Show resolved Hide resolved
if *enableNodeDeployment {
node := os.Getenv("NODE_NAME")
if node == "" {
klog.Fatal("The NODE_NAME environment variable must be set when using --enable-node-deployment.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it just exit?

Copy link
Contributor Author

@nearora-msft nearora-msft Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a specific exit statement? klog.Fatal() logs the error and then calls os.Exit(255).

pkg/common-controller/snapshot_controller.go Outdated Show resolved Hide resolved
return "", nil
}

for _, node := range nodes.Items {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the first matching node will ALWAYS be picked to be labeled is a bit concerning though TBH I do not have a solution. It might result in a situation all snapshot contents will be labeled with the same node name, and the sidecar running on that node became hot.

Copy link
Contributor Author

@nearora-msft nearora-msft Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be multiple nodes that can match with the node affinity for a local volume?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there can be only a single node that matches the node affinity

pkg/common-controller/snapshot_controller.go Outdated Show resolved Hide resolved
pkg/utils/util.go Outdated Show resolved Hide resolved
@xing-yang
Copy link
Collaborator

Hi @nearora-msft Can you please address the comments? Thanks.

@zhucan
Copy link
Member

zhucan commented Oct 25, 2021

@nearora-msft Can you rebase it? I will test it.

@awels
Copy link

awels commented Nov 1, 2021

@nearora-msft Are you still working on this? If not, I would be willing to complete the work.

@nearora-msft
Copy link
Contributor Author

@nearora-msft Are you still working on this? If not, I would be willing to complete the work.

Yes, sorry about the delay. Yes, still working on it. Planning to addressing the comments today.

@nearora-msft
Copy link
Contributor Author

Hi @nearora-msft Can you please address the comments? Thanks.

Yes, will do

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 1, 2021
@nearora-msft nearora-msft force-pushed the implement-distributed-snapshotting branch from 0db781e to b94354f Compare November 1, 2021 16:32
@nearora-msft
Copy link
Contributor Author

/retest

@nearora-msft nearora-msft force-pushed the implement-distributed-snapshotting branch from 21ef58a to b5680b9 Compare December 21, 2021 03:51
@nearora-msft
Copy link
Contributor Author

@nearora-msft Can you address the review comments? I'd like to get this in the 5.0.0 release. Thanks.

Addressed the comments and tested with the latest revision.

@nearora-msft nearora-msft force-pushed the implement-distributed-snapshotting branch from 0ce0da3 to 73543dc Compare December 22, 2021 17:45
@xing-yang
Copy link
Collaborator

/test pull-kubernetes-csi-external-snapshotter-1-23-on-kubernetes-master

@xing-yang
Copy link
Collaborator

/test pull-kubernetes-csi-external-snapshotter-alpha-1-22-on-kubernetes-1-22

@@ -106,7 +106,8 @@ const (
VolumeSnapshotContentInvalidLabel = "snapshot.storage.kubernetes.io/invalid-snapshot-content-resource"
// VolumeSnapshotInvalidLabel is applied to invalid snapshot as a label key. The value does not matter.
// See https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/177-volume-snapshot/tighten-validation-webhook-crd.md#automatic-labelling-of-invalid-objects
VolumeSnapshotInvalidLabel = "snapshot.storage.kubernetes.io/invalid-snapshot-resource"
VolumeSnapshotInvalidLabel = "snapshot.storage.kubernetes.io/invalid-snapshot-resource"
VolumeSnapshotContentManagedByLabel = "snapshot.storage.kubernetes.io/managed-by"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain what this label is for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add something like this to clarify what is in this label: It specifies the node name which handles the snapshot for the volume local to that node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nearora-msft nearora-msft force-pushed the implement-distributed-snapshotting branch from 73543dc to fa1110c Compare December 23, 2021 02:08
metricsPath = flag.String("metrics-path", "/metrics", "The HTTP path where prometheus metrics will be exposed. Default is `/metrics`.")
retryIntervalStart = flag.Duration("retry-interval-start", time.Second, "Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default is 1 second.")
retryIntervalMax = flag.Duration("retry-interval-max", 5*time.Minute, "Maximum retry interval of failed volume snapshot creation or deletion. Default is 5 minutes.")
enableNodeDeployment = flag.Bool("node-deployment", false, "Enables deploying the sidecar controller together with a CSI driver on nodes to manage snapshots for node-local volumes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to run the snapshotter with --node-deployment and without --enable-distributed-snapshotting?

If no, then --node-deployment can be removed and it can be enabled together with --enable-distributed-snapshotting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a flag to determine whether the sidecar is enabled for distributed snapshotting and it needs to be disabled by default. With a new feature like this, we can't enable it by default.

@nearora-msft, can you add a section in README with the title "Distributed Snapshotting", and clarify both flags are needed for this to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nearora-msft nearora-msft force-pushed the implement-distributed-snapshotting branch from fa1110c to bd102f6 Compare December 24, 2021 19:31
@nearora-msft nearora-msft force-pushed the implement-distributed-snapshotting branch 2 times, most recently from ddafcc5 to 21fc337 Compare December 24, 2021 20:17
@xing-yang
Copy link
Collaborator

/test pull-kubernetes-csi-external-snapshotter-1-23-on-kubernetes-master

@xing-yang
Copy link
Collaborator

/test pull-kubernetes-csi-external-snapshotter-alpha-1-22-on-kubernetes-1-22

@xing-yang
Copy link
Collaborator

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nearora-msft, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 25, 2021
@k8s-ci-robot k8s-ci-robot merged commit 7bc7d91 into kubernetes-csi:master Dec 25, 2021
@xing-yang
Copy link
Collaborator

@nearora-msft I'm merging the PR. Thanks for your hard work!

Can you add some description for this enable-distributed-snapshotting flag in snapshot controller in your PR description?

Please also backport this PR to release-5.0 branch.

@nearora-msft
Copy link
Contributor Author

@nearora-msft I'm merging the PR. Thanks for your hard work!

Can you add some description for this enable-distributed-snapshotting flag in snapshot controller in your PR description?

Please also backport this PR to release-5.0 branch.

Sure

k8s-ci-robot added a commit that referenced this pull request Dec 28, 2021
@awels
Copy link

awels commented Jan 4, 2022

Yes @nearora-msft thank you for the work. Have you thought about the reverse problem? Now that we can properly take snapshots, we need to restore them. In my testing with WFFC storage I found that we have a similar problem on restoring (same problem exists for csi clone). What happens if the scheduler schedules the pod that uses the new PVC that restores on a node that is not the same node as where the snapshot resides.

There are 2 options on solving it, but only one makes good sense IMO:

  1. The CSI driver has some mechanism of copying data from other nodes to restore. I don't think CSI drivers want to be in the business of copying data from node to node.
  2. Make the k8s scheduler aware of distributed snapshotting/csi clone, and implement a node filter based on the dataSource of the PVC and force the scheduler to only schedule pods on the node that contains the dataSource.

2 makes the most sense to me, but I am writing just in case I completely missed something obvious that solves the restore problem.

@nearora-msft
Copy link
Contributor Author

nearora-msft commented Jan 4, 2022

@awels Yes, 2 makes more sense to me as well.

As per this implementation of distributed provisioning, for local volumes, if binding mode is set to WaitForFirstConsumer, the pvc will contain the name of the node in the selectedNode annotation and the pv will volume would be created on that node. I can see this annotation being set here.

We could probably do something similar for snapshot restore. We could set the same annotation in case of snapshot restore, regardless of the binding mode. The annotation will contain info about the node that is handling the snapshot and the external-provisioner will make sure that the snapshot is restored on that node. We'd probably have to make changes in the external-provisioner as well.

@pohly please correct me if I didn't interpret this correctly.

@awels
Copy link

awels commented Jan 4, 2022

As per this implementation of distributed provisioning, for local volumes, if binding mode is set to WaitForFirstConsumer, the pvc will contain the name of the node in the selectedNode annotation and the pv will volume would be created on that node. I can see this annotation being set here.

Yes, but AFAICT there is no mechanism to ensure that the nodeName being set by the scheduler is the same node as the dataSource source (PVC or snapshot). Which is where my filter comment came from. IMO we need to solve the problem in the scheduler, which will then put the right node in the annotation, and from there everything should work. Now making a scheduler extension that does this doesn't seem terribly hard, the problem is in getting all the pods being created to use the extension when a csi driver that needs it is in the cluster, that is the part I am struggling with.

@nearora-msft
Copy link
Contributor Author

As per this implementation of distributed provisioning, for local volumes, if binding mode is set to WaitForFirstConsumer, the pvc will contain the name of the node in the selectedNode annotation and the pv will volume would be created on that node. I can see this annotation being set here.

Yes, but AFAICT there is no mechanism to ensure that the nodeName being set by the scheduler is the same node as the dataSource source (PVC or snapshot). Which is where my filter comment came from. IMO we need to solve the problem in the scheduler, which will then put the right node in the annotation, and from there everything should work. Now making a scheduler extension that does this doesn't seem terribly hard, the problem is in getting all the pods being created to use the extension when a csi driver that needs it is in the cluster, that is the part I am struggling with.

Ah, ok. Got it. I can't think of a solution at the top of my head but will update the thread if I can think of something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distributed snapshotting
7 participants