|
| 1 | +# Background |
| 2 | + |
| 3 | + As part of increasing security of a cluster, we are planning to limit the ability |
| 4 | +of a given Kubelet (in general: node), to be able to read only resources associated |
| 5 | +with it. Those resources, in particular means: secrets, configmaps & |
| 6 | +persistentvolumeclaims. This is needed to avoid situation when compromising node |
| 7 | +de facto means compromising a cluster. For more details & discussions see |
| 8 | +https://github.com/kubernetes/kubernetes/issues/40476. |
| 9 | + |
| 10 | + However, by some extension to this effort, we would like to improve scalability |
| 11 | +of the system, by significantly reducing amount of api calls coming from kubelets. |
| 12 | +As of now, to avoid situation that kubelet is watching all secrets/configmaps/... |
| 13 | +in the system, it is not using watch. Instead of that, it is retrieving indidual pods, |
| 14 | +by sending individual GET requests. However, it is sending those requests periodically |
| 15 | +to enable automatic updates of mounted secrets/configmaps/... In large clusters, |
| 16 | +this is generating huge unnecessary load, as this load in principle should be |
| 17 | +watch-based. We would like to address this together with solving the authorization issue. |
| 18 | + |
| 19 | +# Proposal |
| 20 | + |
| 21 | + In this proposal, I'm not focusing on how exactly security should be done - I'm just |
| 22 | +sketching very high level approach and exact authorization mechanism should be discussed |
| 23 | +separately. |
| 24 | + |
| 25 | + At the high level, what we want to achieve is to enable LIST and WATCH requests to |
| 26 | +support filtering of "objects only attached to pods bound to a given node". We obviously |
| 27 | +want to be able to authorize other types of requests (in particular GETs), so that the |
| 28 | +design has to be consistent with that. |
| 29 | + |
| 30 | + To solve this, I propose to introduce a new filtering mechanism (next to label selector |
| 31 | +and field selector): ```node selector ``` (we probably need better name though). Its |
| 32 | +semantic will be to filter only objects that are attached to pods bound to a given node |
| 33 | +and it will be supported for some predefined set of object types. |
| 34 | + |
| 35 | +# Detailed design |
| 36 | + |
| 37 | + We will introduce the following ```node selector ``` filtering mechanism: |
| 38 | + |
| 39 | +``` |
| 40 | +// TODO: Consider making an interface for it. |
| 41 | +type NodeSelector struct { |
| 42 | + // TODO: Should this be repeated field to allow for some fancy controllers |
| 43 | + // that will have access to multiple nodes? |
| 44 | + nodeName string |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | + The NodeSelector field will be added to ```ListOptions ``` (next to label & field |
| 49 | +selectors) and will be supported only by LIST and WATCH requests. |
| 50 | + |
| 51 | + With that mechanism in place, all List/Watch requests coming from kubelets will have |
| 52 | +to have this field correctly set. We will create a dedicated admission plugin that will |
| 53 | +be responsible for checking if a given ```node selector ``` is allowed from a given |
| 54 | +client (the exact mechanism for doing this is out of scope for this doc) and either |
| 55 | +rejecting this request or letting it go. Note that doing this may require modifying |
| 56 | +admission attributes from implementation point of view. |
| 57 | + |
| 58 | + TODO: Consider adding NodeSelector to ```GetOptions ``` - if we would do that, |
| 59 | +we could have unified patern from authorizing all requests from nodes and have the |
| 60 | +admission plugin to be relatively simple. |
| 61 | + |
| 62 | + Once the request is authorized, we need to modify apiserver to be able to support |
| 63 | +the ```node selector ```. We would like to make the changes as local as possible, |
| 64 | +thus we will solve it at the apiserver storage layer. Going into details: |
| 65 | + |
| 66 | +1. we will create a new class ```NodeSelectorFilterer ``` (TODO: come up with |
| 67 | +better name) that will be implementing ```storage.Interface ```. |
| 68 | +2. ```NodeSelectorFilterer ``` will be a wrapper around what we are currently |
| 69 | +using as storage (which is implemenetation of this interface for etcd + (in case |
| 70 | +of most resource kinds) cacher). |
| 71 | +3. for List and Watch calls, we will send them to the wrapped implementation, |
| 72 | +catch the result, filter it based on ```node selector ``` and send the filtered |
| 73 | +result back to the user; all other requests will be simply forwarded to wrapped |
| 74 | +implementation |
| 75 | +4. ```NodeSelectorFilterer ``` will maintain (in-memory) mapping from an object |
| 76 | +(namespace/name) into list of nodes to which we have at least one pod referencing |
| 77 | +this object bound. This mapping will be build using standard reflector/informer |
| 78 | +framework by selflooping into kuberneter API. |
| 79 | +5. ```NodeSelectorFilterer ``` will be per resource type object (similarly as |
| 80 | +cacher is), thus we need to share e.g. pod informer between those. |
| 81 | +6. As an optimization, we should consider setting appropriate trigger function |
| 82 | +in cacher (based on the mapping from above that we will already have). |
| 83 | + |
| 84 | + |
| 85 | + Once we have the ```NodeSelectorFilterer ``` implemented, the changes that will |
| 86 | +need to be done in apiserver will just be: |
| 87 | + |
| 88 | +1. Change ```SelectionPredicate ``` to contain also ```NodeSelector ``` and |
| 89 | +propage those from generic registry |
| 90 | +2. Correctly initialze the storage for every registry by wrapping the already |
| 91 | +existing one with the ```NodeSelectorFilterer ``` |
| 92 | + |
| 93 | + |
| 94 | +TODO: If we bound a first pod referencing a given object to a node (or delete |
| 95 | +the last one), ADD watch event for the object (or DELETE) should be send to the |
| 96 | +watcher. The ADD shouldn't be problematic, ensure that DELETE will not cause |
| 97 | +problems (I think it shouldn't as delete pod means that either it was already |
| 98 | +removed by kubelet, or it is non-graceful deletion and it doesn't matter that |
| 99 | +much). |
0 commit comments