-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3157: allow informers for getting a stream of data instead of chunking #3142
Conversation
119d33a
to
2d778e2
Compare
/assign @wojtek-t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass on this.
|
||
Today informers are the primary source of LIST requests. | ||
The LIST is used to get a consistent snapshot of data currently in etcd to build up a client-side in-memory cache. | ||
The primary issue with LIST requests is unpredictable memory consumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning that LIST can be served from two places:
- from etcd (default)
- from kube-apiserver cache (watchcache) - if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion="0").
FWIW, the second is what informers actually do by-default now (they can fallback to the former, but it's actually a fallback).
I think we should make this more explicit, as it's actually an important context for this KEP too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the Appendix
section and added a reference to it. PTAL
This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet. | ||
Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius. | ||
Note that in that scenario even clusters in an HA setup are affected. | ||
Recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to this storm of LISTs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those aren't really a problem in if informers are not falling back to listing from etcd. If they are listing from watchcache, then actually we have proper indices in watchcache, and hence it all amortizes well.
The problem is when they fallback to listing from etcd, as e.g. to list pods from a given node from etcd, we actually need to list all pods, unmarshall and only the filter out unneded ones. And we're falling back e.g. in case of watchcache is not yet initialized, which happens often after kube-apiserver restart.
It's another piece of context that can be useful to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated this bit and the Appendix section PTAL
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today). | ||
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. | ||
This would allow us to keep memory allocations constant. | ||
The server is bounded by the maximum allowed size of an object of 2 MB in etcd plus a few additional allocations, that will be explained later in this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the etcd (default) limit is 1.5MB.
That's said - it's serialized object - by default we store protobufs in etcd, so the object in memory can be much bigger (even an order of magnitude potentially).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified
The rough idea/plan is as follows: | ||
|
||
- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request. | ||
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here to avoid prevent me forgetting about it (it's only partially related to this point)
We should ensure that in case watchcache is not yet initialized, we won't be falling back to etcd:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L653-L657
but rather something like "wait with timeout and in case of timeout return an error of not-yet-initialized" or sth like that.
If we won't do that, we will be risking that in case of storm of requests on not-yet-initialized kube-apiserver, we will actually face the same failure mode as we experience now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense but we should ensure it won't deadlock - in case of the timeout the informers should be able to recover i.e. fallback to LIST/WATCH semantics.
- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request. | ||
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache. | ||
- step 2a: send all objects currently stored in memory for the given resource. | ||
- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning that RV in etcd is global (i.e. across all types).
Whereas watchcache is watching only for changes for objects of a given type.
So if you have a case that my watchcache of pods is synchronized to RV=100,
but the next operation was secret creation with RV=101, watchcache RV will not get updated.
This was to significant extent addressed by https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1904-efficient-watch-resumption and reusing progress-notify feature from etcd.
That said, the frequency of progress-notify events we're using now is 5s. Which means that your LIST is expected to take on average at least 2.5s before getting the bookmark mentioned below.
This sounds fine to me, but I think we should be very explicit about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified.
The cacheWatcher.incoming is a buffered channel and has a different size for different Resources (10 or 1000). | ||
Since the cacheWatcher starts processing the cacheWatcher.incoming channel only after sending all initial events it might block once its buffered channel tips over. | ||
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. | ||
All watchers that have failed to deliver the event will be closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where we will land effectively is that if we won't be able to send the "initial state" before +1 events will be delivered to the watch in the meantime, the watcher will get closed.
We may try to play a bit with the buffer size, to mitigate the impact of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be fine. The size of the buffer is different for different resources but in the range of 10 or 1000. In the worst-case scenario, we restart the WATCH request.
First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening. | ||
There are two sources of resourceVersions. | ||
The first one is regular events that contain RVs besides objects. | ||
The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those aren't delivered to end client at all - this only consumed only by kube-apiserver and not forwarded further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified
At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event. | ||
As of today, the interval is set to 1 second. | ||
The bookmark event contains an empty object and the current resourceVersion. | ||
By default, a watchCache expires roughly every 1 minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
watcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
The bookmark event contains an empty object and the current resourceVersion. | ||
By default, a watchCache expires roughly every 1 minute. | ||
|
||
The expiry interval initially will be decreased to 1 second in this feature's code-path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't scale. Delivering a bookmark every 1s to every watcher will blow us up (unless I misunderstood what you want to do here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the default delivery (expiry) time for bookmarks events is 1 minute. That means an initial request (RV=0 and the resourceVersionMatch set) would have to wait up to 60s. I propose to decrease that time only for watchers in that special state.
That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.<br><br> | ||
Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject. | ||
Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed.<br><br> | ||
With RemoveSelfLink [graduating](https://github.com/kubernetes/kubernetes/blob/956dfb5196c2884b61804e94810e29ca077301ee/staging/src/k8s.io/apiserver/pkg/features/kube_features.go#L138) to GA (and already disabled by default) we are able to safely avoid this copying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unrelated to this KEP (I mean, we should do that, but we should drive that from RemoveSelfLink) - do we have to mention it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
2. Reduce the number of allocations in the WatchServer<br><br> | ||
The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients. | ||
It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image. | ||
Two of which deserve special attention, namely the allocations 1 and 3 because they won't reuse memory and rely on the GC for cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try running your experiment with JSON as an output format (as opposed to protobuf)?
JSON is obviously more verbose, but I'm wondering if it is doing that many allocations too (from quick glance into the code it seems to behave better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t many thanks for taking the time to read and review the KEP. PTAL |
68e5d14
to
9ce605b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - just quick next pass, I will look at it in more detail early next week.
--> | ||
|
||
Today informers are the primary source of LIST requests. | ||
The LIST is used to get a consistent snapshot of data currently in etcd to build up a client-side in-memory cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "currently in etcd" is not necessary true for lists served from watchcache.
I would just remove this part of the sentence.
- reduce etcd load by serving from watch cache | ||
- get a replacement for paginated lists from watch-cache, which is not feasible without major investment | ||
- enforce consistency in the sense of freshness of the returned list | ||
- be backward compatible with new client -> old server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to discuss one more thing, i.e. overhead on the client side.
There is a significant overhead from relisting on the client-side too.
In the ideal world, I would like to have this addressed too, but either case I would like to see this added either to goals or non-goals explicitly.
Clarifying this one - sorry for mental shortcut.
The problem I have on my mind, that in a happy case, when the component initializes its cache once (LIST at the beginning) and then watching, we can quite well predict its resource usage. This is how people often are setting requests/limits for components.
However, when your component (using an informer/reflector/...) needs to relist, what happens is that it effectively LISTs stuff from kube-apiserver and for a moment have both the old-copy of the state and the new-one.
Which means that in many cases we're effectively doubling its footprint.
For the largest (and thus most problematic) clusters, generally the diff between the old and new isn't that huge, but we still have two copies of objects even for those that didn't change (one in the cache and the second that we got from the LIST).
However, if we're going to switch to watching, at least in theory we wouldn't have to keep the whole second copy of the state. What we could do is:
- for objects that didn't change, just mark them as fresh in the cache
- for objects that changes, store their new version in the buffer below
- once we got the whole state (via this watch stream) then atomically:
(a) sweep over the cache and remove objects not marked as fresh
(b) put the object from that buffer into the cache
This would help a lot with the client-side memory spikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand you correctly, you would – without me knowing exactly the factoring of the informers/reflectors right now – basically make use of the individual incoming events and compare those with the existing store of the informer. If they differ, you keep them for the later, atomic switch-over. If they don't differ, you throw away the events, but just take a reference on the existing object for your new snapshot.
In other words, streaming opens a door for optimizing away this double memory consumption of a giant unpaged watch-cache backed LIST request.
I.e. this is some additional value proposition of the enhancement because we can do these kind of optimizations later-on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this is what I meant.
And yes - it allows us to also optimize client side - I was asking if this is in-scope of the enhancement, or we should be treating it as "future work".
[I would really like this to happen too, but for the sake of making progress, I can live with "future work" too :) ]
nitty-gritty. | ||
--> | ||
|
||
In order to lower memory consumption while getting a list of data and make it more predictable,we propose to use consistent streaming from the watch-cache instead of paging from etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "consistent streaming" really mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means that the returned data is consistent - served from etcd via a quorum read.
|
||
In order to lower memory consumption while getting a list of data and make it more predictable,we propose to use consistent streaming from the watch-cache instead of paging from etcd. | ||
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today). | ||
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth mentioning, that watching with RV=0 is already relatively close to what we need, and describe its current semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified in the Note
section (a few lines down)
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today). | ||
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. | ||
This would allow us to keep memory allocations constant. | ||
The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I don't understand the question - we don't have a strict limit for number of objects, so technically we're not bounded, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the memory usage is bounded by the max size of an object plus all the allocations that are needed to handle the requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't strictly true.
Basically, to serve the current state (even via watch), you still need to copy (sure - not whole objects, but the pointers to them) [well - technically, we could be doing copy-on-write tree to store the current state, but that's not what we're doing]. So we're actually allocating a pointer-per-objects to store that:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/watch_cache.go#L598
Also, it's not for watch we wait for processing the next even until the all data from the previous one is send and the memory buffer is released. So I think this sentence is a bit misleading.
The rough idea/plan is as follows: | ||
|
||
- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request. | ||
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems my comment got lost, so copying here:
I think we should try to make the new semantic as similar to the current one as possible.
What I have on my mind in particular is that currently, when reflector has to re-list, it first sets the RV= to the version of last observed change.
This is extremely important to avoid going back in time.
I think we should maintain it - in the new API:
if you don't specify the RV, then indeed you contact etcd to get the current one
but if you specify one, you actually don't have to fallback to etcd, just use the provided one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I captured it in the Design Details
section, see if you like it.
It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data. | ||
There are ways to do that cheaply, e.g. we could issue a count request against the datastore. | ||
Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache. | ||
After sending initial data the cacheWatcher starts listening on an incoming channel for new events, including a bookmark event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: technically it starts watching on incoming channel immediately, just buffers those events until initial state is send out
The cacheWatcher.incoming is a buffered channel and has a different size for different Resources (10 or 1000). | ||
Since the cacheWatcher starts processing the cacheWatcher.incoming channel only after sending all initial events it might block once its buffered channel tips over. | ||
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. | ||
All watchers that have failed to deliver the event will be closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively opens us for a new failure mode.
The scenario that I have on my mind is a type where we observe high churn (let's say 100 object changes per second) with a huge number of objects.
Currently, if we need to relist, we just:
- LIST (let's assume it can finish successfully within its 1m timeout)
- starts watching from that
Given that watchcache keeps at least 75s of history, watching should just work, and worst case in case of a bug that it lost its history, we at least made a progress (by updating the cache in the component) via the LIST request.
In the new proposed design:
- We open a watch, and start streaming the whole state
- At the same time we accumulate in the incoming buffer the watch events happening in the meantime
- Soon after the buffers fills in (before we even managed to stream the initial state)
- And we close the watch
On the client side we need to retry, but we didn't make any progress (because we didn't get the whole initial state even, so couldn't update the cache).
I think it's solvable (we just don't close the watch immediately, just wait until it will stream the data and close it ~then), but it requires some changes, which we should definitely do as part of these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I think that simply delaying closure might break the consistency guarantees we are trying to provide here. I think that delaying closure only if the buffer contains data up to the desired RV (bookmarkAfterResourceVersion
) would preserve consistency. Assuming we would send those data right after the initial events. Thoughts?
Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV="0" watches. | ||
The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons. | ||
|
||
Note 2: informers need consistent lists to avoid time-travel when switching to another HA instance of kube-apiserver with outdated/lagging watch cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying:
I had it on my plate to reply there - this is addressed within a single incarnation (i.e. if a binary doesn't restart, the reflector will not time-travel).
This wasn't addressed only across component restarts (so basically only for the initial list - relists work fine, worst case they just fallback to etcd).
@p0lyn0mial p0lyn0mial yesterday
should we put it as a goal of this KEP? (Ensure the watch cache is fresh when RV=0).
I think it should be easy once we have this KEP implemented.
Partially. I think that we shouldn't change the default behavior of RV=0 (it by definition means at least from RV=0, which is trivially always satisfied). And that's fine.
I think that we should address that by defining some new "ResourceVersionMatch" value that will be able to ensure that (or sth like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ic, we could use the same value used for the "new" WATCH request. I proposed MostRecent
Consider including folks who also work outside the SIG or subproject. | ||
--> | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would like to see in this section is the exact changes in the API that you're also proposing (the new ResourceVersionMatch option that you're proposing, the exact semantics, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, see if it make more sense now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see it being more explicit - see https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2523-consistent-resource-versions-semantics#add-a-resourceversionmatch-query-parameter as an example for somewhat related changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
List the specific goals of the KEP. What is it trying to achieve? How will we | ||
know that this has succeeded? | ||
--> | ||
- protect kube-apiserver and its node against list-based OOM attacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't sound like you can protect against attacks (someone intentionally listing). Your proposal appears to be about allowing well-behaved clients (informers) to change their client code to issue a watch instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To a degree also against attacks. If you know your system does not need many lists, you can heavily throttle (p&f). Today we cannot because everybody has to list.
|
||
In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use consistent streaming from the watch-cache instead of paging from etcd. | ||
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see [Appendix](#appendix) section for more details on how informers operate today). | ||
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you actually changing LIST or are you changing informers (client) to use a watch and then tweaking watches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are going to change informers to use a single WATCH request instead of a LIST followed by a WATCH request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the primary problem is that apiserver can run out of memory, fixing clients is not going to solve that problem. (I haven't read this yet)
The rough idea/plan is as follows: | ||
|
||
- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request. | ||
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does our client actually care about getting the latest RV or does it are about knowing which RV it has gotten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, that's the whole point here and a long-known problem we have today of time-travel under certain conditions. Behaviour then is undefined. See @smarterclayton's kubernetes/kubernetes#59848.
- step 2: upon receiving the request from an informer, contact etcd to get the latest RV. It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will serve consistent data, even from the cache. | ||
- step 2a: send all objects currently stored in memory for the given resource. | ||
- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2. | ||
- step 2c: send a bookmark event to the informer with the given RV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to use this to indicate that the initial list is finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, @wojtek-t's proposal that the first bookmark event marks the beginning of normal, non-initial events.
Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests. | ||
In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client. | ||
The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption. | ||
Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APF should be able to limit a collection to 1 concurrent LIST (i.e., serialize all list calls). You're saying there's a condition where apiserver cannot handle a single LIST call?
- get rid of list or list pagination | ||
- rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches). | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an alternative:
a) Force everyone to use pagination (e.g. serialize non-paginated lists)
b) The server may decide to reduce the page size based on the size of the response from etcd.
c) The continue token is of course changed to point at the next unseen item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also: kubernetes/kubernetes#98541
Past discussion: kubernetes/kubernetes#90179
--> | ||
|
||
- [ ] Metrics | ||
- Metric name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the intent of the KEP, it's appropriate to have metrics for the alpha release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP intends to lower the memory consumption needed to handle "list" requests. The best metric for measuring memory consumption of an API server I have found so far is container_memory_working_set_bytes
(already exposed). The sig-scalability offered to help test memory consumption during some load tests (i.e. on 5K nodes). I think these tests along with the metric are a good starting point during the alpha phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the issue with the full buffer we could have the following metrics
apiserver_cache_watcher_buffer_length (histogram, what was the buffer size)
apiserver_watch_cache_lag (histogram, for how far the cache is behind the expected RV)
apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
My thoughts: I don't like that this solution forces clients to upgrade to fix the problem. I think API machinery has a bit of a justly-deserved reputation for doing this and this makes it worse. It will be especially hard for clients in other languages. I don't like that this solution leaves us with two ways of getting a list-- and the easy and intuitive one will also break the server sometimes. I do think that streaming a list is a superior way of getting it to the client. Have we brainstormed ways to fix LIST? (Imagine changing the server such that an initial unpaginated list does something different under the hood: call watch, use a serializer that omits the watch event envelope, and stops after it is caught up.) Finally, since we're changing clients anyway, I don't like that the solution doesn't optimize things as much as possible. There's a class of design where the server only sends data the client doesn't have cached; one example is written up here: kubernetes/kubernetes#90339 Ultimately, although I'm conflicted, I don't really think these reasons are quite compelling enough to block this KEP, so I guess I won't. |
This design provides the ability for well-behaved clients to reduce the load they place on the server and improve their own deserialization memory cost. Providing this capability and using it in reflectors may make it possible to improve the list-via-watch-cache-pagination problem afterwards by reducing use. @wojtek-t has implementation concerns that must be addressed before code in k/k is changed, but if he doesn't get back to reviewing this before KEP freeze, I think it is implementable. @p0lyn0mial please update with alternatives suggested and considered and add a means of measuring the usage of this feature. |
8115937
to
87eb3f9
Compare
@deads2k I have updated the KEP with a potential solution. Although I don't understand why it should block alpha. After all, in the worst-case scenario informers will simply retry requests asking for the entire set again. It is not efficient but a similar situation can happen today with regular LIST requests. Currently, there is a timeout for LIST requests of 60 seconds. That means a slow reflector might fail synchronization as well and would have to re-establish the connection again and again. |
87eb3f9
to
94c6092
Compare
Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before. | ||
|
||
For an alpha version, we might simply delay closing the watch request until all data is sent to the client. | ||
We could try collecting some metrics for measuring how far the cache is behind the expected RV, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will collect metrics
@@ -385,13 +385,39 @@ Since the cacheWatcher starts processing the cacheWatcher.incoming channel only | |||
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. | |||
All watchers that have failed to deliver the event will be closed. | |||
|
|||
Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before. | |||
|
|||
For an alpha version, we might simply delay closing the watch request until all data is sent to the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will delay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect this to behave well even in heavily loaded clusters. To increase confidence in the approach, we will collect metrics ...
We could try collecting some metrics for measuring how far the cache is behind the expected RV, | ||
what's the average buffer size, and a counter for closed watch requests due to an overfull buffer. | ||
|
||
For a beta version, we could explore this area further. One viable option worth considering would be, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a beta version, we have further options if they turn out to be necessary:
94c6092
to
b3c621a
Compare
2. make the buffer dynamic - especially when the difference between RVs is > than 1000 | ||
3. inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before. | ||
4. cap the size (cannot allocate more than X MB of memory) of the buffer | ||
5. maybe even apply some compression techniques to the buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like only storing a low-memory shallow reference and take the actual objects for the event from the store.
b3c621a
to
9a37cad
Compare
…nking. The kube-apiserver is vulnerable to memory explosion. The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption. Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an HA mode but also other programs that share the same machine. In this KEP we propose a potential solution to this issue.
9a37cad
to
b14a2f7
Compare
ordinarily I'd wait for @wojtek-t, but that's not possible at this moment before freeze. He would have a chance to hold if he wished. If he chooses to exercise that option later, I think these circumstances are extenuating enough to accommodate that. I think this KEP is well described enough that I suspect he'd be biased toward merge, so I'm doing that. I also think this a good approach, but @wojtek-t's knowledge of the watch cache mechanics exceeds mine. /approve PS, the PRR is fine too. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, p0lyn0mial The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- reduce etcd load by serving from watch cache | ||
- get a replacement for paginated lists from watch-cache, which is not feasible without major investment | ||
- enforce consistency in the sense of freshness of the returned list | ||
- be backward compatible with new client -> old server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this is what I meant.
And yes - it allows us to also optimize client side - I was asking if this is in-scope of the enhancement, or we should be treating it as "future work".
[I would really like this to happen too, but for the sake of making progress, I can live with "future work" too :) ]
|
||
Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before. | ||
|
||
For an alpha version, we will delay closing the watch request until all data is sent to the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "all initial data", right?
Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before. | ||
|
||
For an alpha version, we will delay closing the watch request until all data is sent to the client. | ||
We expect this to behave well even in heavily loaded clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have data to back it up, but I'm afraid it may not be enough. Let me think about it more.
But (let me read the text below) assuming we have an option to improve for Beta, I would be fine with this for Alpha.
For an alpha version, we will delay closing the watch request until all data is sent to the client. | ||
We expect this to behave well even in heavily loaded clusters. | ||
To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV, | ||
what's the average buffer size, and a counter for closed watch requests due to an overfull buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm not fully following what exactly we want to measure here, but we can probably clarify that during code review.
what's the average buffer size, and a counter for closed watch requests due to an overfull buffer. | ||
|
||
For a beta version, we have further options if they turn out to be necessary: | ||
1. comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem to be focusing on the case where watchCache is lagging compared to etcd.
This isn't really the case I'm afraid of.
If we're lagging significantly, I think it's perfectly ok to just wait (without doing anything) until we catch up (with timeout) and if we don't, just fail the request. Then it was super cheap and it's fine (and not really a regression).
The case I'm worried about is:
- say watchcache is perfectly keeping up with etcd
- suddenly we're opening a gazillions of watches to stream a huge amount of data (say e.g. all kube-proxies in 5k-node cluster were restarted at exactly the same time)
- we're starting streaming the data to all of them
- in the meantime there is high churn of other events
- all buffers are getting full and we're triggering closing all the watches without finishing streaming the initial state
If we really close this watch, then the watcher didn't make any progress. So we can go into this kind of crashloop.
But now let's say that as suggested for Alpha, we will not close the watch immediately, but rather stream the initial state and close the watch only after that. This seems fine as long it will not take ages.
But thinking about this more, the "watch initialization support in P&F" should be giving this to us. We should just double check if the support I added in kubernetes/kubernetes#103660 (+ the previous PR for it) is actually ok, or whether we should tweak it further.
So I think I'm fine with this.
What I think we should do though (maybe not for Alpha though) is to not start with sending the "current watch state", but rather ensure that watchcache is already up-to-date before starting sending anything.
I would like to avoid comparing things like "RV different <1000" - I think we should just do the pattern of
"waitUntilFresh + start sending"
The problem with this approach is that RV returned from etcd is a global one, so in many cases we would be forced to wait for "progress notify" event from etcd. Which isn't perfect because (with 5s we're using by default), it would effectively be adding 2.5s latency on avg.
But maybe what we can do then, is to effectively force "RequestProgres()" calls: https://pkg.go.dev/go.etcd.io/etcd/client/v3#section-readme
with some exponential backoff then - we would just have to coordinate it across different resources somehow, but that sounds doable.
I guess the bottom line is hat I believe we have a reasonable path forward here - it's more a matter of carefully doing the work and the amount of that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k - FYI
// For watch calls, it begins with synthetic "Added" events of all resources up to the most recent ResourceVersion. | ||
// It ends with a synthetic "Bookmark" event containing the most recent ResourceVersion. | ||
// For list calls, it has the same semantics as leaving ResourceVersion and ResourceVersionMatch unset. | ||
ResourceVersionMatchMostRecent ResourceVersionMatch = "MostRecent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - I started wondering if we really need that.
"MostRecent" is effectively the semantic of resourceVersion=""
. We are using that semantic for List.
So instead of introducing yet another constant, I think we should just:
- set RV="" for watch calls
- handle this from watchcache (if it is enabled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be very careful to ensure that we will be backward compatible here, but I think it's doable.
Hi @p0lyn0mial , could you give some references about this issue? |
hey, please have a look at https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3157-watch-list#manual-testing-without-the-changes-in-place In addition to that you might be interested in the synthetical perf tests that have been prepared for this feature ( i.e. http://perf-dash.k8s.io/#/?jobname=watch-list-off&metriccategoryname=E2E&metricname=LoadResources&PodName=kube-apiserver-bootstrap-e2e-master%2Fkube-apiserver&Resource=memory) |
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.
Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an
HA mode but also other programs that share the same machine.
In this KEP we propose a potential solution to this issue.