From 93c1c9721d58225da16d794da2cba4056647ad78 Mon Sep 17 00:00:00 2001 From: Wojciech Tyczynski Date: Fri, 10 Mar 2017 15:14:38 +0100 Subject: [PATCH] List/Watch/Get of objects associated with node --- contributors/design-proposals/bulk_watch.md | 476 ++++++++++++++++++++ 1 file changed, 476 insertions(+) create mode 100644 contributors/design-proposals/bulk_watch.md diff --git a/contributors/design-proposals/bulk_watch.md b/contributors/design-proposals/bulk_watch.md new file mode 100644 index 00000000000..64db9053706 --- /dev/null +++ b/contributors/design-proposals/bulk_watch.md @@ -0,0 +1,476 @@ +# Background + + As part of increasing security of a cluster, we are planning to limit the +ability of a given Kubelet (in general: node), to be able to read only +resources associated with it. Those resources, in particular means: secrets, +configmaps & persistentvolumeclaims. This is needed to avoid situation when +compromising node de facto means compromising a cluster. For more details & +discussions see https://github.com/kubernetes/kubernetes/issues/40476. + + However, by some extension to this effort, we would like to improve scalability +of the system, by significantly reducing amount of api calls coming from +kubelets. As of now, to avoid situation that kubelet is watching all secrets/ +configmaps/... in the system, it is not using watch for this purpose. Instead of +that, it is retrieving indidual objects, by sending individual GET requests. +However, to enable automatic updates of mounted secrets/configmaps/..., Kubelet +is sending those GET requests periodically. In large clusters, this is +generating huge unnecessary load, as this load in principle should be +watch-based. We would like to address this together with solving the +authorization issue. + + +# Proposal + + In this proposal, we're not focusing on how exactly security should be done. +We're just sketching very high level approach and exact authorization mechanism +should be discussed separately. + + At the high level, what we would like to achieve is to enable LIST and WATCH +requests to support more sophisticated filtering (e.g. we would like to be able +to somehow ask for all Secrets attached to all pods bound to a given node in a +single LIST or WATCH request). However, the design has to be consistent with +authorization of other types of requests (in particular GETs). + + To solve this problem, we propose to introduce the idea of `bulk watch ` and +`bulk get ` (and in the future other bulk operations). This idea was already +appearing in the community and now we have good usecase to proceed with it. + + Once a bulk watch is set up, we also need to periodically verify its ACLs. +Whenever a user (in our case Kubelet) loses access to a given resource, the +watch should be closed within some bounded time. The rationale behind this +requirement is that by using (bulk) watch we still want to enforce ACLs +similarly to how we enforce them with get operations (and that Kubelet would +eventually be unable to access a secret no matter if it is watching or +polling it). + + That said, periodic verification of ACLs isn't specific to bulk watch and +needs to be solved also in `regular` watch (e.g. user watching just a single +secret may also lose access to it and such watch should also be closed in this +case). So this requirement is common for both regular and bulk watch. We +just need to solve this problem on low enough level that would allow us to +reuse the same mechanism in both cases - we will solve it by sending an error +event to the watcher and then just closing this particular watch. + + At the high level, we would like the API to be generic enough so that it will +be useful for many different usecases, not just this particular one (another +example may be a controller that needs to deal with just subset of namespaces). +As a result, below we are describing requirements that the end solution has to +meet to satisfy our needs +- a single bulk requests has to support multiple resource types (e.g. get a +node and all pods associated with it) +- the wrappers for aggregating multiple objects (in case of list we can return +a number of objects of different kinds) should be `similar` to lists in core +API (by lists I mean e.g. `PodList` object) +- the API has to be implemented also in aggregator so that bulk operations +are supported also if different resource types are served by different +apiservers +- clients has to be able to alter their watch subscribtions incrementally (it +may not be implemented in the initial version though, but has to be designed) + + +# Detailed design + + As stated in above requirements, we need to make bulk operations work across +different resource types (e.g. watch pod P and secret S within a single watch +call). Spanning multiple resources, resource types or conditions will be more +and more important for large number of watches. As an example, federation will +be adding watches for every type it federates. With that in mind, bypassing +aggregation at the resource type level and going to aggregation over objects +with different resource types will allow us to more aggresively optimize in the +future (it doesn't mean you have to watch resources of different types in a +single watch, but we would like to make it possible). + + That means, that we need to implement the bulk operation at the aggregation +level. The implications of it are discussed below. + + Moreover, our current REST API doesn't even offer an easy way to handle +"multiple watches of a given type" within a single request. As a result, instead +of inventing new top level pattern per type, we can introduce a new resource +type that follows normal RESTful rules and solves even more generic problem +of spanning multiple different resource types. + + We will start with introducing a new dedicated API group: + ``` + /apis/bulk.k8s.io/ + ``` + that underneath will have a completely separate implementation. + + In all text below, we are assuming v1 version of the API, but it will obviously +go through alpha and beta stages before (it will start as v1alpha1). + + In this design, we will focus only on bulk get (list) and watch operations. +Later, we would like to introduce new resources to support bulk create, update +and delete operations, but that's not part of this design. + + We will start with introducing `getoperation` resource and supporting the +following operation: +``` +POST /apis/bulk.k8s.io/v1/getoperation +``` + We can't simply make this an http GET request, due to limitations of GET for +the size (length) of the url (in which we would have to pass filter options). + + We could consider adding `watch` operation using the same pattern with just +`?watch=1` parameter. However, the main drawback of this approach is that it +won't allow for dynamic altering of watch subscriptions (which we definitely +also need to support). +As a result, we need another API for watch that will also support incremental +subscriptions - it will look as following: +``` +websocket /apis/bulk.k8s.io/v1/getoperation?watch=1 +``` + +*Note: For consistency, we also considered introducing websocket API for +handling LIST requests, where first client sends a filter definition over the +channel and then server sends back the response, but we dropped this for now.* + +*Note: We aso considered implementing the POST-based watch handler that doesn't +allow for altering subsriptions, which should be very simple once we have list +implemented. But since websocket API is needed anyway, we also dropped it.* + + +### Filtering definition + + We need to make our filtering mechanism to support different resource types at +the same time. On the other hand, we would like it to be as consistent with all +other Kubernetes APIs as possible. So we define the selector for bulk operations +as following: + +``` +type GetOperations struct { + Operations []GetOperation +} + +type GetOperation struct { + Resource GroupVersionResource + + // We would like to reuse the same ListOptions definition as we are using + // in regular APIs. + // TODO: We may consider supporting multiple ListOptions for a single + // GetOperation. + Options ListOptions +} +``` + + We need to be able to detect whether a given user is allowed to get/list +objects requested by a given "GetOperation". For that purpose, we will create +some dedicated admission plugin (or potentially reuse already existing one). +That one will not support partial rejections and will simply allow or reject +the whole request. + +For watch operations, as described in requirements, we also need to periodically +(or maybe lazily) verify whether their ACLs didn't change (and user is still +allowed to watch requested objects). However, as also mentioned above, this +periodic checking isn't specific to bulk operations and we need to support the +same mechanism for regular watch too. We will just ensure that this mechanism +tracking and periodically verifying ACLs is implemented low enough in apiserver +machinery so that we will be able to reuse exactly the same one for the purpose +of bulk watch operations. +For watch request, we will support partial rejections. The exact details of it +will be described together with dynamic watch description below. + + +### Dynamic watch + + As mentioned in the Proposal section, we will implement bulk watch that will +allow for dynamic subscribtion/unsubscribtion for (sets of) objects on top of +websockets protocol. + + Note that we already support websockets in the regular Kubernetes API for +watch requests (in addition to regular http requests), so for the purpose of +bulk watch we will be extending websocket support. + + The the high level, the propocol will look: +1. client opens a new websocket connection to a bulk watch endpoint to the +server via ghttp GET +1. this results in creating a single channel that is used only to handle +communication for subscribing/unsubscribing for watches; no watch events are +delivered via this particular channel. +*TODO: Consider switching to two channels, one for incoming and one for +outgoing communication* +1. to subscribe for a watch of a given (set of) objects, user sends `Watch` +object over the channel; in response a new channel is created and the message +with the channel identifier is send back to the user (we will be using integers +as channel identifiers). +*TODO: Check if integers are mandatory or if we can switch to something like +ch1, ch2 ... .* +1. once subscribed, all objects matching a given selector will be send over +the newly created channel +1. to stop watching for a given (set of) objects, user sends `CloseWatch` +object over the the channel; in response the corresponding watch is broken and +corresponding channel within websocket is closed +1. once done, user should close the whole websocket connection (this results in +breaking all still opened channels and corresponding watches). + + With that high level protocol, there are still multiple details that needs +to be figured out. First, we need to define `Watch` and `CloseWatch` +message. We will solve it with the single `Request` object: +``` +type Request struct { + // Only one of those is set. + Watch *Watch + CloseWatch *CloseWatch +} + +type Identifier int64 + +// Watch request for objects matching a given selector. +type Watch struct { + Selector GetOperation +} + +// Request to stop watching objects from a watch identified by the channel. +type CloseWatch struct { + Channel Identifier +} + +// Depending on the request, channel that was created or deleted. +type Response struct { + Channel Identifier +} +``` +With the above structure we can guarantee that we only send and receive +objects of a single type over the channel. + +We should also introduce some way of correleting responses with requests +when a client is sending multiple of them at the same time. To achieve this +we will add a `request identified` field to the `Request` that user can set +and that will then be returned as part of `Response`. With this mechanism +user can set the identifier to increasing integers and then will be able +to correlate responses with requests he sent before. So the final structure +will be as following: +``` +type Request struct { + ID Identifier + // Only one of those is set. + Watch *Watch + CloseWatch *CloseWatch +} + +// Depending on the request, channel that was created or deleted. +type Response struct { + // Propagated from the Request. + RequestID Identifier + Channel Identifier +} +``` + +Another detail we need to point at is about semantic. If there are multiple +selectors selecting exactly the same object, the object will be send multiple +times (once for every channel interested in that object). +If we decide to also have http-POST-based watch, since there will be basically +a single channel there (as it is in watch in regular API), such an object will +be send once. This semantic difference needs to be explicitly described. + +*TODO: Since those channels for individual watches are kind of independent, +we need to decide whether they are supposed to be synchronized with each other. +In particular, if we have two channels watching for objects of the same type +are events guaranteed to be send in the increasing order of resource versions. +I think this isn't necessary, but it needs to be explicit.* + +Yet another thing to consider is what if the server wants to close the watch +even though user didn't request it (some potential reasons of it may be failed +periodic ACL check or some kind of timeout). +We will solve it by saying that in such situation we will send an error via +the corresponding channel and every error automatically closes the channel. +It is responsibility of a user to re-subscribe if that's possible. + +We will also reuse this mechanism for partial rejections. We will be able to +reject (or close failed periodic ACL check) any given channel separately from +all other existing channels. + + +### Watch semantics + + There are a lot of places in the code (including all our list/watch-related +frameworks like reflector) that rely on two crucial watch invariants: +1. watch events are delivered in increasing order of resource version +1. there is at most one watch event delivered for any resource version + + However, we have no guarantee that resource version series is shared between +diferent resource types (in fact in default GCE setup events are not sharing +the same series as they are stored in a separate etcd instance). That said, +to avoid introducing too many assumptions (that already aren't really met) +we can't guarantee exactly the same. + + With the above description of "dynamic watch", within a single channel you +are allowed to only watch for objects of a single type. So it is enough to +introduce only the following assumption: +1. within a single resource type, all objects are sharing the same resource +version series. + + This means, we can still shard etcd by resource types, but we can't really +shard by e.g. by namespaces. Note that this doesn't introduce significant +limitations compared to what we already have, because even now you can watch +all objects of a single type and there is no mechanism to inject multiple +resource versions into it. So this assumption is not making things worse. + + To support multi-resource-type watches, we can build another framework on +top of frameworks we already have as following: +- we will have a single reflector/informer/... per resource type +- we will create an aggregator/dispatcher in front of them that will be +responsible for aggregating requests from underlying frameworks into a single +one and then dispatching incoming watch events to correct reflect/informer. +This will obviously require changes to existing frameworks, but those should +be local changes. + + One more thing to mention is detecting resource type of object being send via +watch. With "dynamic watch" proposal, we already know it based on the channel +from which it came (only objects of a single type can be send over the single +channel). + + Note that this won't be true if we would decide for regular http-based watch +and as a result we would have to introduce a dedicated type for bulk watch +event containing object type. This is yet another argument to avoid implementing +http-based bulk watch at all. + + +### Implementation details + + As already mentioned above, we need to support API aggregation. The API +(endpoint) has to be implemented in kube-aggregator. For the implementation +we have two alternatives: +1. aggregator forwards the request to all apiservers and aggregates results +2. based on discovery information, aggregator knows which type is supported +by which apiserver so it is forwarding requests with just appropriate +resource types to corresponding apiservers and then aggregates results. + +Neither of those is difficult, so we should proceed with the second, which +has an advantage for watch, because for a given subrequest only a single +apiserver will be returning some events. However, no matter which one we +choose, client will not see any difference between contacting apiserver or +aggregator, which is crucial requirement here. + +NOTE: For watch requests, as an initial step we can consider implementing +this API only in aggregator and simply start an individual watch for any +subrequest. With http2 we shouldn't get rid of descriptors and it can be +enough as a prrof of concept. However, with such approach there will be +difference between sending a given request to aggregator and apiserver +so we need to implement it properly in apiserver before entering alpha +anyway. This would just give us early results faster. + + The implementation of bulk get and bulk watch in a single apiserver will +also work as kind of aggregator. Whenever a request is coming, it will: +- check what resource type(s) are requested in this request +- for every resource type, combine only parts of the filter that are about +this particular resource type and send the request down the stack +- gather responses for all those resource types and combine them into +single response to the user. + + The only non-trivial operation above is sending the request for a single +resource type down the stack. In order to implement it, we will need to +slighly modify the interface of "Registry" in apiserver. The modification +will have to allow passing both what we are passing now and BulkListOptions +(in some format) (this may e.g. changing signature to accept BulkListOptions +and translating ListOptions to BulkListOptions in the current code). + + With those changes in place, there is a question of how to call this +code. There are two main options: +1. Make each registered resource type, register also in BulkAggregator +(or whatever we call it) and call those methods directly +1. Expose also per-resource bulk operation in the main API, e.g.: +``` +POST /api/v1/namespace/default/pods/_bulkwatch +``` +and use the official apiserver API for delegating requests. However, this +may collide with resource named `_bulkwatch ` and detecting whether +this is bulk operation or regular api operation doesn't seem to be worth +pursuing. + +As a result, we will proceed with option 1. + + +## Considered alternatives + + We considered introducing a different way of filtering that would basically be +"filter only objects that the node is allowed to see". However, to make this +kind of watch work correctly with our list/watch-related frameworks, it would +need to preserve the crucial invariants of watch. This in particular means: + +1. There is at most one watch event for a given resource version. +2. Watch events are delivered in the increasing order of resource versions. + + Ideally, whenever a new pod referencing object X is bound to a node, we send +"add" event for object X to the watcher. However, that would break the above +assumptions because: + +1. There can be more objects referenced by a given pod (so we can't send all +of them with the rv corresponding to that pod add/update/delete) +2. If we decide for sending those events with their original resource version, +then we could potentially go back in time. + + As a result, we considered the following alternatives to solve this problems: + +1. Don't set the event being result of pod creation/update/deletion. +It would be responsibility of a user to grab current version of all object that +are being referenced by this new pod. And only from that point, events for all +objects being referenced would be delivered to the watcher as long as the pod +existing on the node. + +This approach in a way it leaking the watch logic to the watcher, that needs +to duplicate the logic of tracking what objects are referenced. + +2. Whenever a new pod is bound to a node (or existing is deleted) we send +all add/delete events of attached object to the watcher with the same resource +version being the one of the modified pod. + +In this approach, we violate the first assumption (this is a problem in case +of breaking watch, as we don't really know where to resume) as well as we +send events with fake resource versions, which might be misleading to watchers. + +3. Another potential option is to change the watch api so that, instead of +sending a single object in a watch event, we would be sending a list of objects +as part of single watch event. + +This would solve all the problems from previous two solutions, but this is +change in the api (we would need to introduce a new API for it), and would also +require changes in our list/watch-related frameworks. + + In all above proposals, the tricky part is determining whether an object X +is referenced by any pods bound to a given node is to avoid race conditions and +do it in deterministic way. The crucial requirements are: + +1. Whenever "list" request returns a list of objects and a resource version "rv", +starting a watch from the returned "rv" will never drop any events. +2. For a given watch request (with resource version "rv"), the returned stream +of events is always the same (e.g. very slow laggin watch may not cause dropped +events). + +We can't really satisfy these conditions using the existing machinery. To solve +this problem reliably we need to be able to serialize events between different +object types in a deterministic way. +This could be done via resource version, but would require assumption that: + +1. all object types necessary to determine the in-memory mapping share the same +resource version series. + +With that assumption, we can have a "multi-object-type" watch that will serialize +events for different object types for us. Having exactly one watch responsible +for delivering all objects (pods, secrets, ...) will guarantee that if we are +currently at resource version "rv", we processed objects of all types up to rv +and nothing with resource version greater than rv. Which is exactly what we need. + + +## Other Notes + + We were seriously considering implementing http-POST-based approach as an +additional, simpler watch to implement watch (not supporting altering +subscriptions). In this approach whenever a user want to start watching for +another object (or set of objects) or drop one (or a set), he needs to break the +watch and initiated a new one with a different filter. This approach isn't +perfect, but can solve different usecase of naive clients and is much simpler +to implement. + + However, this has multiple drawbacks, including: +- this is a second mechanism for doing the same thing (we need "dynamic watch" +no matter if we implement it or not) +- there would be some semantic differences between those approaches (e.g. if +there are few selectors selecting the same object, in dynamic approach it will +be send multiple times, once over each channel, here it would be send once) +- we would have to introduce a dedicate "BulkWatchEvent" type to incorporate +resource type. This would make those two incompatible even at the ouput format. + + With all of those in mind, even though the implementation would be much much +simpler (and could potentially be a first step and would probably solve the +original "kubelet watching secrets" problem good enough), we decided not to +proceed with it at all.