Skip to content

Commit a19ba32

Browse files
committed
List/Watch/Get of objects associated with node
1 parent 3a54a93 commit a19ba32

File tree

1 file changed

+99
-0
lines changed

1 file changed

+99
-0
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Background
2+
3+
As part of increasing security of a cluster, we are planning to limit the ability
4+
of a given Kubelet (in general: node), to be able to read only resources associated
5+
with it. Those resources, in particular means: secrets, configmaps &
6+
persistentvolumeclaims. This is needed to avoid situation when compromising node
7+
de facto means compromising a cluster. For more details & discussions see
8+
https://github.com/kubernetes/kubernetes/issues/40476.
9+
10+
However, by some extension to this effort, we would like to improve scalability
11+
of the system, by significantly reducing amount of api calls coming from kubelets.
12+
As of now, to avoid situation that kubelet is watching all secrets/configmaps/...
13+
in the system, it is not using watch. Instead of that, it is retrieving indidual pods,
14+
by sending individual GET requests. However, it is sending those requests periodically
15+
to enable automatic updates of mounted secrets/configmaps/... In large clusters,
16+
this is generating huge unnecessary load, as this load in principle should be
17+
watch-based. We would like to address this together with solving the authorization issue.
18+
19+
# Proposal
20+
21+
In this proposal, I'm not focusing on how exactly security should be done - I'm just
22+
sketching very high level approach and exact authorization mechanism should be discussed
23+
separately.
24+
25+
At the high level, what we want to achieve is to enable LIST and WATCH requests to
26+
support filtering of "objects only attached to pods bound to a given node". We obviously
27+
want to be able to authorize other types of requests (in particular GETs), so that the
28+
design has to be consistent with that.
29+
30+
To solve this, I propose to introduce a new filtering mechanism (next to label selector
31+
and field selector): ```node selector ``` (we probably need better name though). Its
32+
semantic will be to filter only objects that are attached to pods bound to a given node
33+
and it will be supported for some predefined set of object types.
34+
35+
# Detailed design
36+
37+
We will introduce the following ```node selector ``` filtering mechanism:
38+
39+
```
40+
// TODO: Consider making an interface for it.
41+
type NodeSelector struct {
42+
// TODO: Should this be repeated field to allow for some fancy controllers
43+
// that will have access to multiple nodes?
44+
nodeName string
45+
}
46+
```
47+
48+
The NodeSelector field will be added to ```ListOptions ``` (next to label & field
49+
selectors) and will be supported only by LIST and WATCH requests.
50+
51+
With that mechanism in place, all List/Watch requests coming from kubelets will have
52+
to have this field correctly set. We will create a dedicated admission plugin that will
53+
be responsible for checking if a given ```node selector ``` is allowed from a given
54+
client (the exact mechanism for doing this is out of scope for this doc) and either
55+
rejecting this request or letting it go. Note that doing this may require modifying
56+
admission attributes from implementation point of view.
57+
58+
TODO: Consider adding NodeSelector to ```GetOptions ``` - if we would do that,
59+
we could have unified patern from authorizing all requests from nodes and have the
60+
admission plugin to be relatively simple.
61+
62+
Once the request is authorized, we need to modify apiserver to be able to support
63+
the ```node selector ```. We would like to make the changes as local as possible,
64+
thus we will solve it at the apiserver storage layer. Going into details:
65+
66+
1. we will create a new class ```NodeSelectorFilterer ``` (TODO: come up with
67+
better name) that will be implementing ```storage.Interface ```.
68+
2. ```NodeSelectorFilterer ``` will be a wrapper around what we are currently
69+
using as storage (which is implemenetation of this interface for etcd + (in case
70+
of most resource kinds) cacher).
71+
3. for List and Watch calls, we will send them to the wrapped implementation,
72+
catch the result, filter it based on ```node selector ``` and send the filtered
73+
result back to the user; all other requests will be simply forwarded to wrapped
74+
implementation
75+
4. ```NodeSelectorFilterer ``` will maintain (in-memory) mapping from an object
76+
(namespace/name) into list of nodes to which we have at least one pod referencing
77+
this object bound. This mapping will be build using standard reflector/informer
78+
framework by selflooping into kuberneter API.
79+
5. ```NodeSelectorFilterer ``` will be per resource type object (similarly as
80+
cacher is), thus we need to share e.g. pod informer between those.
81+
6. As an optimization, we should consider setting appropriate trigger function
82+
in cacher (based on the mapping from above that we will already have).
83+
84+
85+
Once we have the ```NodeSelectorFilterer ``` implemented, the changes that will
86+
need to be done in apiserver will just be:
87+
88+
1. Change ```SelectionPredicate ``` to contain also ```NodeSelector ``` and
89+
propage those from generic registry
90+
2. Correctly initialze the storage for every registry by wrapping the already
91+
existing one with the ```NodeSelectorFilterer ```
92+
93+
94+
TODO: If we bound a first pod referencing a given object to a node (or delete
95+
the last one), ADD watch event for the object (or DELETE) should be send to the
96+
watcher. The ADD shouldn't be problematic, ensure that DELETE will not cause
97+
problems (I think it shouldn't as delete pod means that either it was already
98+
removed by kubelet, or it is non-graceful deletion and it doesn't matter that
99+
much).

0 commit comments

Comments
 (0)