Skip to content

Commit cb15b6f

Browse files
committed
List/Watch/Get of objects associated with node
1 parent 3a54a93 commit cb15b6f

File tree

1 file changed

+208
-0
lines changed

1 file changed

+208
-0
lines changed
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
# Background
2+
3+
As part of increasing security of a cluster, we are planning to limit the ability
4+
of a given Kubelet (in general: node), to be able to read only resources associated
5+
with it. Those resources, in particular means: secrets, configmaps &
6+
persistentvolumeclaims. This is needed to avoid situation when compromising node
7+
de facto means compromising a cluster. For more details & discussions see
8+
https://github.com/kubernetes/kubernetes/issues/40476.
9+
10+
However, by some extension to this effort, we would like to improve scalability
11+
of the system, by significantly reducing amount of api calls coming from kubelets.
12+
As of now, to avoid situation that kubelet is watching all secrets/configmaps/...
13+
in the system, it is not using watch. Instead of that, it is retrieving indidual pods,
14+
by sending individual GET requests. However, it is sending those requests periodically
15+
to enable automatic updates of mounted secrets/configmaps/... In large clusters,
16+
this is generating huge unnecessary load, as this load in principle should be
17+
watch-based. We would like to address this together with solving the authorization issue.
18+
19+
# Proposal
20+
21+
In this proposal, I'm not focusing on how exactly security should be done - I'm just
22+
sketching very high level approach and exact authorization mechanism should be discussed
23+
separately.
24+
25+
At the high level, what we want to achieve is to enable LIST and WATCH requests to
26+
support filtering of "objects only attached to pods bound to a given node". We obviously
27+
want to be able to authorize other types of requests (in particular GETs), so that the
28+
design has to be consistent with that.
29+
30+
To solve this, I propose to introduce a new filtering mechanism (next to label selector
31+
and field selector): ```node selector ``` (we probably need better name though). Its
32+
semantic will be to filter only objects that are attached to pods bound to a given node
33+
and it will be supported for some predefined set of object types.
34+
35+
# Detailed design
36+
37+
We will introduce the following ```node selector ``` filtering mechanism:
38+
39+
```
40+
// TODO: Consider making an interface for it.
41+
type NodeSelector struct {
42+
// TODO: Should this be repeated field to allow for some fancy controllers
43+
// that will have access to multiple nodes?
44+
nodeName string
45+
}
46+
```
47+
48+
The NodeSelector field will be added to ```ListOptions ``` (next to label & field
49+
selectors) and will be supported only by LIST and WATCH requests.
50+
51+
With that mechanism in place, all List/Watch requests coming from kubelets will have
52+
to have this field correctly set. We will create a dedicated admission plugin that will
53+
be responsible for checking if a given ```node selector ``` is allowed from a given
54+
client (the exact mechanism for doing this is out of scope for this doc) and either
55+
rejecting this request or letting it go. Note that doing this may require modifying
56+
admission attributes from implementation point of view.
57+
58+
TODO: Consider adding NodeSelector to ```GetOptions ``` - if we would do that,
59+
we could have unified patern from authorizing all requests from nodes and have the
60+
admission plugin to be relatively simple.
61+
62+
63+
### Semantic of watch
64+
65+
To make all our existing "list-watch-related" frameworks work correctly for this
66+
kind of watch, we would like to preserve all the crucial invariants of watch. This
67+
in particular means:
68+
69+
1. There is at most one watch event for a given resource version.
70+
2. Watch events are delivered in the increasing order of resource versions.
71+
72+
In the ideal situation, if a new pod that is referencing object X appears (and
73+
the we are watching objects of that type), we send add event for object X.
74+
However, that would break the above assumptions, because:
75+
76+
1. There can be more objects referenced by a given pod (so we can't send all of
77+
them with a rv corresponding to that pod add/update/delete)
78+
2. If we decide for sending those with their original rv`s, then we could
79+
potentially go back in time.
80+
81+
As a result, we propose the following semantic:
82+
83+
1. If a pod is being created/updated/deleted, no watch events are delivered.
84+
It's responsibility of a user to grab current version of all objects that are
85+
being referenced by this pods.
86+
2. From that point, events for all objects being referenced by this pod (e.g.
87+
add/update/delete of secret being referenced by this pod) are being delivered
88+
to the watcher as long as the pod exists on the node.
89+
90+
With this semantic we will be able to reuse e.g. reflector framework with just
91+
minor modifications.
92+
93+
TODO: Describe how this should be used.
94+
95+
96+
### Determining whether object is referenced by pods from a given node.
97+
98+
The tricky part in determining whether an object X is referenced by any pods
99+
bound to a given node is to avoid different kind of race conditions and do it
100+
in deterministic way.
101+
102+
The crucial requirements here are:
103+
104+
1. Whenever "list" request returns a list of objects and a resource version "rv",
105+
starting a watch from the returned "rv" will never drop any events.
106+
2. For a given watch request (with resource version "rv"), the returned stream
107+
of events is always the same (e.g. very slow laggin watch may not cause dropped
108+
events).
109+
110+
To satisfy the above requirements, we can't really rely only on the existing
111+
tools/frameworks that we have. Theoretically, we would be able to build the
112+
mapping from an object to list of nodes it is bound to using standard
113+
reflector/informer framework. But having separate informers for different kinds
114+
of objects (which is the only possible way to use them) may result in races.
115+
As an example, imagine a pod P referencing secret S, bounding pod P to node N
116+
happening right before creating secret S. Then the following race may potentially
117+
happen:
118+
1. user observes binding of pod P, he tries to retrieve the current value of a
119+
secret and it still doesn't exist. From now she knows that as soon as it will
120+
be created, it should be delivered via watch to him.
121+
2. however, pod informer is lagging in our code, and we first observe secret
122+
creation. It's not referenced yet (we didn't yet observe the pod), so we don't
123+
send any event.
124+
3. we observe the pod binding, but according to our semantics, we don't send
125+
any event.
126+
And we didn't send the expected "add secret" event to the watcher.
127+
128+
To solve the problem reliably, we need to be able to correctly serialize the
129+
watch events between different object types in a deterministic way.
130+
131+
One potential solution would be to identify this watch by a resource version
132+
combined from resource versions coming from different object kinds (e.g.
133+
pods have rv = rv1, secrets have rv = rv2, ...). Then we could keep the history
134+
of objects being processed to update the in-memory mapping and that would be
135+
deterministic. But the order might have been different e.g. after restarting
136+
apiserver. Which also means that it wouldn't work in HA setups (assuming we
137+
don't have some external "serializing" component).
138+
139+
Fortunately, we can solve it in much simpler way, with one additional assumption:
140+
141+
1. All object types necessary to determine the in-memory mapping share the same
142+
resource version series.
143+
144+
With that assumption, we can have a "multi-object-type" watch that will serialize
145+
events for different object types for us. Having exactly watch responsible for
146+
delivering all objects (pods, secrets, ...) will guarantee that if we are currently
147+
at resource version rv, we procesed object of all types up to rv and nothing with
148+
resource version greater than rv. Which is exactly what we need.
149+
150+
NOTE: We are not going to implement a generic "multi-object-type" watch and
151+
expose it in the api (which is much bigger task). This will be purely implementation
152+
detail hidden in the code (see more details in the next section).
153+
154+
155+
### Implementation details
156+
157+
Once the request is authorized, this is reaching the ```core ``` part of apiserver,
158+
and it needs to support the ```node selector ```. With the requirements and
159+
design decisions made above, we would like to make the changes as local as possible.
160+
Thus, we will solve this at the apiserver storage layer, by making the following
161+
changes:
162+
163+
1. change ```SelectionPredicate ``` to contain also ```NodeSelector ``` and
164+
propagate those from generic registry.
165+
166+
2. create a ```ObjectToNodesMapping ``` class (TODO: come up with better name)
167+
that will be maintaining (in-memory) mapping from an object (kind/namespace/name)
168+
into list of nodes to which we have at least one pod referencing this object bound.
169+
To achieve it, we will implement a simplified version of "multi-object-type"
170+
watch as following:
171+
172+
- we will instantiate a separate etcd implementation of ```storage.Interface ```
173+
with a different codec.
174+
175+
- the codec will be working only on for the newly created ```etcdObject ``` type;
176+
the ```etcdObject ``` will contain key & value coming directly from etcd as its
177+
fields. We will need to inject some mechanism for setting the key into the
178+
existing etcd-based implementation (similar to storage.Versioner)
179+
TODO: Can we do it simpler?
180+
181+
- using that we will create a reflector that will be listing+watching everything;
182+
By making SelectionPredicate an interface, we will create a super small and simple
183+
implementation of it to filter returned ```etcdObject ```based on their etcd key
184+
to filter out all uninterested object types.
185+
186+
- we will create a dedicated store implementation that from incoming stream
187+
of ```etcdObjects ```, based on the etcd key will be determining the object type,
188+
then using the original codec will be decoding them into real objects and then
189+
triggering the appropriate handler function.
190+
191+
192+
3. having that, we will create a new class ```NodeSelectorFilterer ``` (TODO:
193+
come up with better name) that will be implementing ```storage.Interface ```.
194+
195+
4. ```NodeSelectorFilterer ``` will be a wrapper around what we currently use
196+
as a storage (implementation of the interface for etcd + cacher (cacher is not
197+
required though)).
198+
199+
4. every request except from LIST and WATCH requests with set ```NodeSelector ```
200+
will be forwarded to the wrapper implementation.
201+
202+
5. LIST and WATCH request with set ```NodeSelector ``` will be served directly
203+
from ```NodeSelectorFilterer ``` will be served her based on the contents of
204+
the store described above (it will contain some limited cached history for
205+
objects that can be watched) similarly to what we do in "cacher + watchCache".
206+
207+
6. Correctly initialize the storage for every registry by wrapping the already
208+
existing one with the ```NodeSelectorFilterer ```.

0 commit comments

Comments
 (0)