List/Watch/Get of objects associated with node

wojtek-t · wojtek-t · commit cb15b6f0642d · 2017-03-14T11:55:46.000+01:00
diff --git a/contributors/design-proposals/secure_watch.md b/contributors/design-proposals/secure_watch.md
@@ -0,0 +1,208 @@
+# Background
+
+ As part of increasing security of a cluster, we are planning to limit the ability
+of a given Kubelet (in general: node), to be able to read only resources associated
+with it. Those resources, in particular means: secrets, configmaps &
+persistentvolumeclaims. This is needed to avoid situation when compromising node
+de facto means compromising a cluster. For more details & discussions see
+https://github.com/kubernetes/kubernetes/issues/40476.
+ 
+ However, by some extension to this effort, we would like to improve scalability
+of the system, by significantly reducing amount of api calls coming from kubelets.
+As of now, to avoid situation that kubelet is watching all secrets/configmaps/...
+in the system, it is not using watch. Instead of that, it is retrieving indidual pods,
+by sending individual GET requests. However, it is sending those requests periodically
+to enable automatic updates of mounted secrets/configmaps/... In large clusters,
+this is generating huge unnecessary load, as this load in principle should be
+watch-based. We would like to address this together with solving the authorization issue.
+
+# Proposal
+
+ In this proposal, I'm not focusing on how exactly security should be done - I'm just
+sketching very high level approach and exact authorization mechanism should be discussed
+separately.
+
+ At the high level, what we want to achieve is to enable LIST and WATCH requests to
+support filtering of "objects only attached to pods bound to a given node". We obviously
+want to be able to authorize other types of requests (in particular GETs), so that the
+design has to be consistent with that.
+
+ To solve this, I propose to introduce a new filtering mechanism (next to label selector
+and field selector): ```node selector ``` (we probably need better name though). Its
+semantic will be to filter only objects that are attached to pods bound to a given node
+and it will be supported for some predefined set of object types.
+
+# Detailed design
+
+ We will introduce the following ```node selector ``` filtering mechanism:
+
+```
+// TODO: Consider making an interface for it.
+type NodeSelector struct {
+  // TODO: Should this be repeated field to allow for some fancy controllers
+  // that will have access to multiple nodes?
+  nodeName string
+}
+```
+
+ The NodeSelector field will be added to ```ListOptions ``` (next to label & field
+selectors) and will be supported only by LIST and WATCH requests.
+
+ With that mechanism in place, all List/Watch requests coming from kubelets will have
+to have this field correctly set. We will create a dedicated admission plugin that will
+be responsible for checking if a given ```node selector ``` is allowed from a given
+client (the exact mechanism for doing this is out of scope for this doc) and either
+rejecting this request or letting it go. Note that doing this may require modifying
+admission attributes from implementation point of view.
+
+ TODO: Consider adding NodeSelector to ```GetOptions ``` - if we would do that,
+we could have unified patern from authorizing all requests from nodes and have the
+admission plugin to be relatively simple.
+
+
+### Semantic of watch
+
+ To make all our existing "list-watch-related" frameworks work correctly for this
+kind of watch, we would like to preserve all the crucial invariants of watch. This
+in particular means:
+
+1. There is at most one watch event for a given resource version.
+2. Watch events are delivered in the increasing order of resource versions.
+
+ In the ideal situation, if a new pod that is referencing object X appears (and
+the we are watching objects of that type), we send add event for object X.
+However, that would break the above assumptions, because:
+
+1. There can be more objects referenced by a given pod (so we can't send all of
+them with a rv corresponding to that pod add/update/delete)
+2. If we decide for sending those with their original rv`s, then we could
+potentially go back in time.
+
+ As a result, we propose the following semantic:
+
+1. If a pod is being created/updated/deleted, no watch events are delivered.
+It's responsibility of a user to grab current version of all objects that are
+being referenced by this pods.
+2. From that point, events for all objects being referenced by this pod (e.g.
+add/update/delete of secret being referenced by this pod) are being delivered
+to the watcher as long as the pod exists on the node.
+
+ With this semantic we will be able to reuse e.g. reflector framework with just
+minor modifications.
+
+TODO: Describe how this should be used.
+
+
+### Determining whether object is referenced by pods from a given node.
+
+ The tricky part in determining whether an object X is referenced by any pods
+bound to a given node is to avoid different kind of race conditions and do it
+in deterministic way.
+
+ The crucial requirements here are:
+
+1. Whenever "list" request returns a list of objects and a resource version "rv",
+starting a watch from the returned "rv" will never drop any events.
+2. For a given watch request (with resource version "rv"), the returned stream
+of events is always the same (e.g. very slow laggin watch may not cause dropped
+events). 
+
+ To satisfy the above requirements, we can't really rely only on the existing
+tools/frameworks that we have. Theoretically, we would be able to build the
+mapping from an object to list of nodes it is bound to using standard
+reflector/informer framework. But having separate informers for different kinds
+of objects (which is the only possible way to use them) may result in races.
+As an example, imagine a pod P referencing secret S, bounding pod P to node N
+happening right before creating secret S. Then the following race may potentially
+happen:
+1. user observes binding of pod P, he tries to retrieve the current value of a
+secret and it still doesn't exist. From now she knows that as soon as it will
+be created, it should be delivered via watch to him.
+2. however, pod informer is lagging in our code, and we first observe secret
+creation. It's not referenced yet (we didn't yet observe the pod), so we don't
+send any event.
+3. we observe the pod binding, but according to our semantics, we don't send
+any event.
+And we didn't send the expected "add secret" event to the watcher.
+
+ To solve the problem reliably, we need to be able to correctly serialize the
+watch events between different object types in a deterministic way.
+
+ One potential solution would be to identify this watch by a resource version
+combined from resource versions coming from different object kinds (e.g.
+pods have rv = rv1, secrets have rv = rv2, ...). Then we could keep the history
+of objects being processed to update the in-memory mapping and that would be
+deterministic. But the order might have been different e.g. after restarting
+apiserver. Which also means that it wouldn't work in HA setups (assuming we
+don't have some external "serializing" component).
+
+ Fortunately, we can solve it in much simpler way, with one additional assumption:
+
+1. All object types necessary to determine the in-memory mapping share the same
+resource version series.
+
+ With that assumption, we can have a "multi-object-type" watch that will serialize
+events for different object types for us. Having exactly watch responsible for
+delivering all objects (pods, secrets, ...) will guarantee that if we are currently
+at resource version rv, we procesed object of all types up to rv and nothing with
+resource version greater than rv. Which is exactly what we need.
+
+NOTE: We are not going to implement a generic "multi-object-type" watch and
+expose it in the api (which is much bigger task). This will be purely implementation
+detail hidden in the code (see more details in the next section).
+
+
+### Implementation details
+
+ Once the request is authorized, this is reaching the ```core ``` part of apiserver,
+and it needs to support the ```node selector ```. With the requirements and
+design decisions made above, we would like to make the changes as local as possible.
+Thus, we will solve this at the apiserver storage layer, by making the following
+changes:
+
+1. change ```SelectionPredicate ``` to contain also ```NodeSelector ``` and
+propagate those from generic registry.
+
+2. create a ```ObjectToNodesMapping ``` class (TODO: come up with better name)
+that will be maintaining (in-memory) mapping from an object (kind/namespace/name)
+into list of nodes to which we have at least one pod referencing this object bound.
+To achieve it, we will implement a simplified version of "multi-object-type"
+watch as following:
+
+- we will instantiate a separate etcd implementation of ```storage.Interface ```
+with a different codec.
+
+- the codec will be working only on for the newly created ```etcdObject ``` type;
+the ```etcdObject ``` will contain key & value coming directly from etcd as its
+fields. We will need to inject some mechanism for setting the key into the
+existing etcd-based implementation (similar to storage.Versioner)
+TODO: Can we do it simpler?
+
+- using that we will create a reflector that will be listing+watching everything;
+By making SelectionPredicate an interface, we will create a super small and simple
+implementation of it to filter returned ```etcdObject ```based on their etcd key
+to filter out all uninterested object types.
+
+- we will create a dedicated store implementation that from incoming stream
+of ```etcdObjects ```, based on the etcd key will be determining the object type,
+then using the original codec will be decoding them into real objects and then
+triggering the appropriate handler function.
+
+
+3. having that, we will create a new class ```NodeSelectorFilterer ``` (TODO:
+come up with better name) that will be implementing ```storage.Interface ```.
+
+4. ```NodeSelectorFilterer ``` will be a wrapper around what we currently use
+as a storage (implementation of the interface for etcd + cacher (cacher is not
+required though)).
+
+4. every request except from LIST and WATCH requests with set ```NodeSelector ```
+will be forwarded to the wrapper implementation.
+
+5. LIST and WATCH request with set ```NodeSelector ``` will be served directly
+from ```NodeSelectorFilterer ``` will be served her based on the contents of
+the store described above (it will contain some limited cached history for
+objects that can be watched) similarly to what we do in "cacher + watchCache".
+
+6. Correctly initialize the storage for every registry by wrapping the already
+existing one with the ```NodeSelectorFilterer ```.