Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Flexvolume plugin discovery proposal. #833

Merged
merged 4 commits into from
Aug 22, 2017
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions contributors/design-proposals/flexvolume-deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# **Dynamic Flexvolume Plugin Discovery**

## **Objective**

Kubelet and controller-manager do not need to be restarted manually in order for new Flexvolume plugins to be recognized.

## **Background**

Beginning in version 1.8, the Kubernetes Storage SIG is putting a stop to accepting in-tree volume plugins and advises all storage providers to implement out-of-tree plugins. Currently, there are two recommended implementations: Container Storage Interface (CSI) and Flexvolume.

[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md) provides a single interface that storage vendors can implement in order for their storage solutions to work across many different container orchestrators, and volume plugins are out-of-tree by design. This is a large effort, the full implementation of CSI is several quarters away, and there is a need for an immediate solution for storage vendors to continue adding volume plugins.

[Flexvolume](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md) is an in-tree plugin that has the ability to run any storage solution by executing volume commands against a user-provided driver on the Kubernetes host, and this currently exists today. However, the process of setting up Flexvolume is very manual, pushing it out of consideration for many users. Problems include having to copy the driver to a specific location in each node, manually restarting kubelet, and user's limited access to machines.

An automated deployment technique is discussed in [Recommended Deployment Method](#recommended-driver-deployment-method). The crucial change required to enable this method is allowing kubelet and controller manager to dynamically discover plugin changes.


## **Overview**

When there is a modification of the driver directory, a notification is sent to the filesystem watch from kubelet or controller manager. When kubelet or controller-manager searches for plugins (such as when a volume needs to be mounted), if there is a signal from the watch, it probes the driver directory and loads currently installed drivers as volume plugins.


## **Detailed Design**
Copy link
Member

@jsafrane jsafrane Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a random point in the proposal to start a new separate thread so the discussion is organized)

Mount propagation

I am preparing mount propagation for release 1.8 + possibility to run mount utilities (mount.glusterfs, /usr/bin/rbd, flex drivers, ...) in pods instead of running them on the host. Initially, I planned not to support flex volumes in the first release and get the implementation solid first, but now that there is someone else preparing dynamic probing of flex plugin it should not be hard to extend it.

See #589 for full details.

tl;dr, the proposal expects that pod with mount utilities will put a unix domain socket into /var/lib/kubelet/plugin-sockets/<plugin name>, e.g. /var/lib/kubelet/plugin-sockets/kubernetes.io/glusterfs. That can be easily extended to flex volumes:

  • Flex volume author puts all necessary utilities, usual /usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor>~<driver>/<driver> and a volume-exec daemon (shipped by Kubernetes) into an image.
  • System admin runs the image as a DaemonSet with privileged pods that have shared mount-propagation on /var/lib/kubelet from the pod to the host. Pods in the daemon set run volume-exec daemon with proper parameters (namely the volume plugin name).
    • volume-exec daemon puts an unix socket into /var/lib/kubelet/plugin-sockets/<vendor>~<driver>/<driver> on the host.
  • Probe in kubelet scans plugin-sockets and registers a new flex volume plugin for every socket it finds there. This is the same as discovery of new drivers as designed in this proposal.
  • When kubelet wants to call the driver, it checks if plugin-sockets/<vendor>~<driver>/<driver> socket exists.
    • If if the socket does not exist, it uses plain old os.Exec to execute the driver.
    • If the socket exists, it uses a gRPC API provided by volume-exec to execute stuff in the pod that runs volume-exec. As result, /usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor>~<driver>/<driver> is executed in the pod. Due to shared mount propagation, the driver can mount stuff and kubelet will see it.
    • (all this is already part of Proposal: containerized mount utilities in pods #589, there will be very little changes to flex volume implementation).

Installation of a driver is quite simple, no need to copy the driver from the pod to the host, creating one socket is enough. Upgrade of the daemon set is more complicated though, as any fuse daemons now run in the pod and not on the host. If the pod is killed during the update all volumes that it served are unmounted. See #589 for details.

To sum it up, if the probe as suggested in this proposal is implemented, it should be fairly easy to extend it to scan also for the sockets. That's the only necessary change and we get completely containerized flex volumes. Question is if we then need to mess up with copying drivers from pods to the host as proposed here. On the other way, #589 is just a proposal and it may be changed and who knows if it catches 1.8 and in what shape.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cool. Can consider it for 1.9


In the volume plugin code, introduce a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. Change the type of plugins inside the volume plugin manager's plugin list to `PluginStub`.

`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use notify, if you are just going to cache the results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is discussed in Alternative Design (3).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If probe() were called a reasonable amount of times per second, would you reconsider this point? Just trying to bound complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely. Unfortunately I was seeing bursts of tens of milliseconds between calls. If we change the Find*() logic so that if there's already a match then don't check Flex, then we don't need a watch.


`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, mere presence of a new directory means that there is a new flex driver there? File copy is not an atomic operation and it may happen that only part of the driver script was copied there. You should wait until whole driver has been copied there. And big question is how do you recognize that...

IMO, installation of a new driver (or driver update) should be atomic operation on the fs, e.g. link(2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jan, excellent point. The only atomic file op is rename, so we should be sure that flex EXPLICITLY ignores files that start with a .. The installer must copy to flex/.../.mydriver and then rename that to mydriver. Let's make this as explicit as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That assumes that the driver is a single file. The installer must rename multiple files one by one (and rename does not work well with directories).

I am ok with requiring the driver to be a single file, with no helper scripts around, but it must be clearly defined somewhere in flex documentation and a release note.


Iterating through the list of plugins inside `InitPlugins()` from `volume/plugins.go`, if the plugin is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.

Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the flexvolume implements attach/detach interface, how are you going to extend the controller-manager (assuming you are using a daemonset)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "extend"?

If you are asking how the drivers get deployed through DaemonSet to a master: DaemonSet will add a pod to the master node regardless of whether the node is set to schedulable.

If you are asking how the controller-manager picks up the newly added attach/detach procedures when the driver is replaced: during plugin probe a FlexVolumeAttachablePlugin is created, which replaces the previous plugin. The attachable plugin will enable attach/detach calls from AttachDetachController.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DaemonSet will add a pod to the master node

Not all masters run kubelet. E.g. GKE.


There is a possibility that a probe occurs at the same time the DaemonSet updates the driver, so the prober's view of drivers is inconsistent. However, this is very rare and when it does occur, the next `Probe()`call, which occurs shortly after, will be consistent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this approach.



## **Alternative Designs**

1) Make `PluginProber` a separate component, and pass it around as a dependency.

Pros: Avoids the common `PluginStub` interface. There isn't much shared functionality between `VolumePlugin` and `PluginProber`. The only purpose this shared abstraction serves is for `PluginProber` to reuse the existing machinery of plugins list.

Cons: Would have to increase dependency surface area, notably `KubeletDeps`.

I'm currently undecided whether to use this design or the `PluginStub` design.

2) Use a polling model instead of a watch for probing for driver changes.

Pros: Simpler to implement.

Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Flexvolume drivers are not something that will be added or modified often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is probe called so often?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called for every FindPluginBy*(), because currently it iterates through all plugins and errors out when multiple plugins are found. As for why FindPluginBy*() is called often, I don't know. I can only tell that from logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should track that backwards - it is surprising to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are (at least) two reasons:

  1. In every DSW populator loop, it's called for volumes of every existing pod, in order to re-associate each pod with its required volumes.

  2. Certain volume types are set to constantly remount, triggering a Probe every time the remount occurs.


3) Use a polling model + cache. Poll every x seconds/minutes.

Pros: Mostly mitigates issues with the previous approach.

Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly.

4) Have the `flexVolumePluginList` cache live in `PluginProber` instead of `VolumePluginMgr`.

Pros: `VolumePluginMgr` doesn't need to treat Flexvolume plugins any differently from other plugins.

Cons: `PluginProber` doesn't have the function to validate a plugin. This function lives in `VolumePluginMgr`. Alternatively, the function can be passed into `PluginProber`.


## **Security Considerations**
Copy link
Contributor

@php-coder php-coder Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also mention how it's supposed to work with Pod Security Policy? Which actions will be required? Do we need to create a special policy maybe?

Copy link
Member

@liggitt liggitt Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that level of detail is required... you'd need whatever permissions are required to mount and write files to a directory in the kubelet's flexvolume driver dir. Again, I think this proposal should focus on the kubelet/controllermanager aspects, and the mechanism for ensuring atomic load of drivers (write to dot-prefixed dir or file, rename or symlink+rename, etc), over delivery mechanisms.

Do we need to create a special policy maybe?

PSP setup is going to vary by install. Documenting how to grant permission to use hostPath volume mounts is already included in https://kubernetes.io/docs/concepts/policy/pod-security-policy/#controlling-volumes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd need whatever permissions are required to mount and write files to a directory in the kubelet's flexvolume driver dir

That's what I wanted to see here. If it would be explicitly mentioned, it will be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the necessary security policy in the example Deployment DaemonSet spec.


The Flexvolume driver directory can be continuously modified (accidentally or maliciously), making every` Probe()` call trigger a disk read, and `Probe()` calls could happen every couple of milliseconds and in bursts (i.e. lots of calls at first and then silence for some time). This may decrease kubelet's or controller manager's disk IO usage, impacting the performance of other system operations.

As a safety measure, add a 1-second minimum delay between the processing of filesystem watch signals.


## **Testing Plan**

Add new unit tests in `plugin_tests.go` to cover new probing functionality and the heterogeneous plugin types in the plugins list.

Add e2e tests that follow the user story. Write one for initial driver installation, one for an update for the same driver, one for adding another driver, and one for removing a driver.

## **Recommended Driver Deployment Method**

This section describes one possible method to automatically deploy Flexvolume drivers. The goal is that drivers must be deployed on nodes (and master when attach is required) without having to manually access any machine instance.

Driver Installation:

* Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She
1. prepares her Flexvolume driver directory, with driver names in `[vendor~]driver/driver` format (e.g. `k8s~nfs/nfs`, see [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)).
2. creates an image by copying her driver and the [deployment script](#driver-deployment-script) to a busybox base image.
3. makes her image available Bob, a cluster admin.
* Bob modifies the existing deployment DaemonSet spec with the name of the given image, and creates the DaemonSet.
* Charlie, an end user, creates volumes using the installed plugin.

The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec.

Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host (as described in the [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)). The deployment will replace the existing driver directory on the host with contents in `/flexvolume`. Thus, in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.

### Driver Deployment Script

The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.

### Deployment DaemonSet
``` yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this daemonset run on master nodes too by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unless the master node is registered as schedulable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is actually a bug, but I noticed that my local storage daemonset does attempt to get scheduled on the master node, even though it is unschedulable. And it fails to run on master because there are not enough resources.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@verult can you confirm? I think this PR made daemonset respect taint. kubernetes/kubernetes#41172

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the DaemonSet with a master node having more memory than usual, and yes it does run on master by default, even though it's marked as SchedulingDisabled.

metadata:
name: flex-set
spec:
template:
metadata:
name: flex-deploy
labels:
app: flex-deploy
spec:
containers:
- image: <deployment_image>
name: flex-deploy
securityContext:
privileged: true
volumeMounts:
- mountPath: /flexmnt
name: flexvolume-mount
volumes:
- name: flexvolume-mount
hostPath:
path: <host_driver_directory>
Copy link

@bassam bassam Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if the author of this daemonset does not need to make assumptions about the host environment. For example, if --volume-plugin-dir varies for the nodes in the cluster it would still be good to author a single daemonset that install the flex volume correctly.

Could we make --volume-plugin-dir available in the downward api so that it can be used as an env variable in the daemonset spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a useful improvement for a subsequent iteration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this I'm worried that the flex volume will end up in a different place, and there will be little in terms of diagnosis for what actually happened. Note also that this could help the kubelet in container problem.

Copy link
Contributor Author

@verult verult Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a specific example in mind where --volume-plugin-dir has to be set differently between nodes?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@verult no specific example, just pointing out that if it is set then the daemonset would break. I'm hoping we can arrive at a "universal" daemonset that can install the flex volume independent of how the host is configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point. For most OS distros it's fine to use the default plugin path. --volume-plugin-dir is usually set to a non-default value due to filesystem limitations. I think this scenario is rare enough that if it does occur, the cluster admin could resort to deploying individual pods, or pods + Daemonset with taints.

AFAIK downward API only exposes pod and container attributes, not kubelet options.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider the versioning the flex volume drivers. It's likely that the flex volume interface would change in future versions, or that the flex volume author has different implementations. Is it possible to get the version of K8S running through downward api? This could be passed to the flex volume deploy script for it to install the correct version.

Similarly when kubernetes is upgraded, I'm not sure I understand how flex volumes are upgraded. Consider the case where we have a 1.8 cluster with a flex volume designed for 1.8. If in 1.9, K8S adds a new method to flex, and a cluster is upgraded from 1.8-->1.9 how does the new flex volume get deployed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bassam Versioning came up in new flex-volume proposal which never materialized. I will look into adding support to negotiate versioning with the existing API.

```

### Alternatives
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if init containers could be used too? For rook we plan on deploying a daemonset for a rook agent on each node, could we use an init container to deploy the flex volume? this would avoid the infinite loop.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havent used init-containers. Are init-containers spawned and run on the same host as the parent daemonset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, theoretically that should be possible!


* Using Jobs instead of DaemonSets to deploy.

Pros: Designed for containers that eventually terminate. No need to have the container go into an infinite loop.

Cons: Does not guarantee every node has a pod running. Pod anti-affinity can be used to ensure no more than one pod runs on the same node, but since the Job spec requests a constant number of pods to run to completion, Jobs cannot ensure that pods are scheduled on new nodes.

## **Open Questions**

* How does this system work with containerized kubelet?
* If DaemonSet deployment fails, how are errors shown to the user?
* Are there any SELinux implications?