|
| 1 | +# Background |
| 2 | + |
| 3 | + As part of increasing security of a cluster, we are planning to limit the ability |
| 4 | +of a given Kubelet (in general: node), to be able to read only resources associated |
| 5 | +with it. Those resources, in particular means: secrets, configmaps & |
| 6 | +persistentvolumeclaims. This is needed to avoid situation when compromising node |
| 7 | +de facto means compromising a cluster. For more details & discussions see |
| 8 | +https://github.com/kubernetes/kubernetes/issues/40476. |
| 9 | + |
| 10 | + However, by some extension to this effort, we would like to improve scalability |
| 11 | +of the system, by significantly reducing amount of api calls coming from kubelets. |
| 12 | +As of now, to avoid situation that kubelet is watching all secrets/configmaps/... |
| 13 | +in the system, it is not using watch. Instead of that, it is retrieving indidual pods, |
| 14 | +by sending individual GET requests. However, it is sending those requests periodically |
| 15 | +to enable automatic updates of mounted secrets/configmaps/... In large clusters, |
| 16 | +this is generating huge unnecessary load, as this load in principle should be |
| 17 | +watch-based. We would like to address this together with solving the authorization issue. |
| 18 | + |
| 19 | +# Proposal |
| 20 | + |
| 21 | + In this proposal, I'm not focusing on how exactly security should be done - I'm just |
| 22 | +sketching very high level approach and exact authorization mechanism should be discussed |
| 23 | +separately. |
| 24 | + |
| 25 | + At the high level, what we want to achieve is to enable LIST and WATCH requests to |
| 26 | +support filtering of "objects only attached to pods bound to a given node". We obviously |
| 27 | +want to be able to authorize other types of requests (in particular GETs), so that the |
| 28 | +design has to be consistent with that. |
| 29 | + |
| 30 | + To solve this, I propose to introduce a new filtering mechanism (next to label selector |
| 31 | +and field selector): ```node selector ``` (we probably need better name though). Its |
| 32 | +semantic will be to filter only objects that are attached to pods bound to a given node |
| 33 | +and it will be supported for some predefined set of object types. |
| 34 | + |
| 35 | +# Detailed design |
| 36 | + |
| 37 | + We will introduce the following ```node selector ``` filtering mechanism: |
| 38 | + |
| 39 | +``` |
| 40 | +// TODO: Consider making an interface for it. |
| 41 | +type NodeSelector struct { |
| 42 | + // TODO: Should this be repeated field to allow for some fancy controllers |
| 43 | + // that will have access to multiple nodes? |
| 44 | + nodeName string |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | + The NodeSelector field will be added to ```ListOptions ``` (next to label & field |
| 49 | +selectors) and will be supported only by LIST and WATCH requests. |
| 50 | + |
| 51 | + With that mechanism in place, all List/Watch requests coming from kubelets will have |
| 52 | +to have this field correctly set. We will create a dedicated admission plugin that will |
| 53 | +be responsible for checking if a given ```node selector ``` is allowed from a given |
| 54 | +client (the exact mechanism for doing this is out of scope for this doc) and either |
| 55 | +rejecting this request or letting it go. Note that doing this may require modifying |
| 56 | +admission attributes from implementation point of view. |
| 57 | + |
| 58 | + TODO: Consider adding NodeSelector to ```GetOptions ``` - if we would do that, |
| 59 | +we could have unified patern from authorizing all requests from nodes and have the |
| 60 | +admission plugin to be relatively simple. |
| 61 | + |
| 62 | + |
| 63 | +### Semantic of watch |
| 64 | + |
| 65 | + To make all our existing "list-watch-related" frameworks work correctly for this |
| 66 | +kind of watch, we would like to preserve all the crucial invariants of watch. This |
| 67 | +in particular means: |
| 68 | + |
| 69 | +1. There is at most one watch event for a given resource version. |
| 70 | +2. Watch events are delivered in the increasing order of resource versions. |
| 71 | + |
| 72 | + In the ideal situation, if a new pod that is referencing object X appears (and |
| 73 | +the we are watching objects of that type), we send add event for object X. |
| 74 | +However, that would break the above assumptions, because: |
| 75 | + |
| 76 | +1. There can be more objects referenced by a given pod (so we can't send all of |
| 77 | +them with a rv corresponding to that pod add/update/delete) |
| 78 | +2. If we decide for sending those with their original rv`s, then we could |
| 79 | +potentially go back in time. |
| 80 | + |
| 81 | + As a result, we propose the following semantic: |
| 82 | + |
| 83 | +1. If a pod is being created/updated/deleted, no watch events are delivered. |
| 84 | +It's responsibility of a user to grab current version of all objects that are |
| 85 | +being referenced by this pods. |
| 86 | +2. From that point, events for all objects being referenced by this pod (e.g. |
| 87 | +add/update/delete of secret being referenced by this pod) are being delivered |
| 88 | +to the watcher as long as the pod exists on the node. |
| 89 | + |
| 90 | + With this semantic we will be able to reuse e.g. reflector framework with just |
| 91 | +minor modifications. |
| 92 | + |
| 93 | +TODO: Describe how this should be used. |
| 94 | + |
| 95 | + |
| 96 | +### Determining whether object is referenced by pods from a given node. |
| 97 | + |
| 98 | + The tricky part in determining whether an object X is referenced by any pods |
| 99 | +bound to a given node is to avoid different kind of race conditions and do it |
| 100 | +in deterministic way. |
| 101 | + |
| 102 | + The crucial requirements here are: |
| 103 | + |
| 104 | +1. Whenever "list" request returns a list of objects and a resource version "rv", |
| 105 | +starting a watch from the returned "rv" will never drop any events. |
| 106 | +2. For a given watch request (with resource version "rv"), the returned stream |
| 107 | +of events is always the same (e.g. very slow laggin watch may not cause dropped |
| 108 | +events). |
| 109 | + |
| 110 | + To satisfy the above requirements, we can't really rely only on the existing |
| 111 | +tools/frameworks that we have. Theoretically, we would be able to build the |
| 112 | +mapping from an object to list of nodes it is bound to using standard |
| 113 | +reflector/informer framework. But having separate informers for different kinds |
| 114 | +of objects (which is the only possible way to use them) may result in races. |
| 115 | +As an example, imagine a pod P referencing secret S, bounding pod P to node N |
| 116 | +happening right before creating secret S. Then the following race may potentially |
| 117 | +happen: |
| 118 | +1. user observes binding of pod P, he tries to retrieve the current value of a |
| 119 | +secret and it still doesn't exist. From now she knows that as soon as it will |
| 120 | +be created, it should be delivered via watch to him. |
| 121 | +2. however, pod informer is lagging in our code, and we first observe secret |
| 122 | +creation. It's not referenced yet (we didn't yet observe the pod), so we don't |
| 123 | +send any event. |
| 124 | +3. we observe the pod binding, but according to our semantics, we don't send |
| 125 | +any event. |
| 126 | +And we didn't send the expected "add secret" event to the watcher. |
| 127 | + |
| 128 | + To solve the problem reliably, we need to be able to correctly serialize the |
| 129 | +watch events between different object types in a deterministic way. |
| 130 | + |
| 131 | + One potential solution would be to identify this watch by a resource version |
| 132 | +combined from resource versions coming from different object kinds (e.g. |
| 133 | +pods have rv = rv1, secrets have rv = rv2, ...). Then we could keep the history |
| 134 | +of objects being processed to update the in-memory mapping and that would be |
| 135 | +deterministic. But the order might have been different e.g. after restarting |
| 136 | +apiserver. Which also means that it wouldn't work in HA setups (assuming we |
| 137 | +don't have some external "serializing" component). |
| 138 | + |
| 139 | + Fortunately, we can solve it in much simpler way, with one additional assumption: |
| 140 | + |
| 141 | +1. All object types necessary to determine the in-memory mapping share the same |
| 142 | +resource version series. |
| 143 | + |
| 144 | + With that assumption, we can have a "multi-object-type" watch that will serialize |
| 145 | +events for different object types for us. Having exactly watch responsible for |
| 146 | +delivering all objects (pods, secrets, ...) will guarantee that if we are currently |
| 147 | +at resource version rv, we procesed object of all types up to rv and nothing with |
| 148 | +resource version greater than rv. Which is exactly what we need. |
| 149 | + |
| 150 | +NOTE: We are not going to implement a generic "multi-object-type" watch and |
| 151 | +expose it in the api (which is much bigger task). This will be purely implementation |
| 152 | +detail hidden in the code (see more details in the next section). |
| 153 | + |
| 154 | + |
| 155 | +### Implementation details |
| 156 | + |
| 157 | + Once the request is authorized, this is reaching the ```core ``` part of apiserver, |
| 158 | +and it needs to support the ```node selector ```. With the requirements and |
| 159 | +design decisions made above, we would like to make the changes as local as possible. |
| 160 | +Thus, we will solve this at the apiserver storage layer, by making the following |
| 161 | +changes: |
| 162 | + |
| 163 | +1. change ```SelectionPredicate ``` to contain also ```NodeSelector ``` and |
| 164 | +propagate those from generic registry. |
| 165 | + |
| 166 | +2. create a ```ObjectToNodesMapping ``` class (TODO: come up with better name) |
| 167 | +that will be maintaining (in-memory) mapping from an object (kind/namespace/name) |
| 168 | +into list of nodes to which we have at least one pod referencing this object bound. |
| 169 | +To achieve it, we will implement a simplified version of "multi-object-type" |
| 170 | +watch as following: |
| 171 | + |
| 172 | +- we will instantiate a separate etcd implementation of ```storage.Interface ``` |
| 173 | +with a different codec. |
| 174 | + |
| 175 | +- the codec will be working only on for the newly created ```etcdObject ``` type; |
| 176 | +the ```etcdObject ``` will contain key & value coming directly from etcd as its |
| 177 | +fields. We will need to inject some mechanism for setting the key into the |
| 178 | +existing etcd-based implementation (similar to storage.Versioner) |
| 179 | +TODO: Can we do it simpler? |
| 180 | + |
| 181 | +- using that we will create a reflector that will be listing+watching everything; |
| 182 | +By making SelectionPredicate an interface, we will create a super small and simple |
| 183 | +implementation of it to filter returned ```etcdObject ```based on their etcd key |
| 184 | +to filter out all uninterested object types. |
| 185 | + |
| 186 | +- we will create a dedicated store implementation that from incoming stream |
| 187 | +of ```etcdObjects ```, based on the etcd key will be determining the object type, |
| 188 | +then using the original codec will be decoding them into real objects and then |
| 189 | +triggering the appropriate handler function. |
| 190 | + |
| 191 | + |
| 192 | +3. having that, we will create a new class ```NodeSelectorFilterer ``` (TODO: |
| 193 | +come up with better name) that will be implementing ```storage.Interface ```. |
| 194 | + |
| 195 | +4. ```NodeSelectorFilterer ``` will be a wrapper around what we currently use |
| 196 | +as a storage (implementation of the interface for etcd + cacher (cacher is not |
| 197 | +required though)). |
| 198 | + |
| 199 | +4. every request except from LIST and WATCH requests with set ```NodeSelector ``` |
| 200 | +will be forwarded to the wrapper implementation. |
| 201 | + |
| 202 | +5. LIST and WATCH request with set ```NodeSelector ``` will be served directly |
| 203 | +from ```NodeSelectorFilterer ``` will be served her based on the contents of |
| 204 | +the store described above (it will contain some limited cached history for |
| 205 | +objects that can be watched) similarly to what we do in "cacher + watchCache". |
| 206 | + |
| 207 | +6. Correctly initialize the storage for every registry by wrapping the already |
| 208 | +existing one with the ```NodeSelectorFilterer ```. |
0 commit comments