-
Notifications
You must be signed in to change notification settings - Fork 6
Limit impact of reliable backend startup routine on Redis load #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As the usage has grown we've started seeing some impact form `KEYS` being blocking and holding up the Redis instance. This is made worse by the fact than on startup many workers will be attempting this at the same time. This change moves to using the non blocking `SCAN` instead. The main concern here is that the collection of keys can change during iteration however for this use case I think this is fine as the 2 changes we can expect are: - A key goes away: this would be another startup routine removing a worker key after re-enqueuing its job so should be safe. - A key appearing: this would be a new worker coming online and should not be impacting the recovery mechanism as it should also be part of `expected_processing_queue_keys`.
Is it guaranteed that changes in the underlying keys cannot result in a client missing out a key which that present at the time of their initial request? I know that this can happen to certain iterators in Python (and other languages) if the underlying collection is modified while iterating over it. A naive (and contrived) implementation showing the issue I mean is something like: In [1]: l = list(range(10))
...: for idx in range(len(l)):
...: print(l[idx])
...: l.remove(idx)
...:
0
2
4
6
8
IndexErrorIgnoring the |
|
Yes it's possible for keys to go away while we're iterating through scan here. While this is generally unsafe, i think that under normal operations this is fine as a worker key going away should only occur if another startup routine comes along and drops it which we'd have done anyway. This could possibly become an issue with multiple conflicting sets of workers running their startup routine but I am not sure we should optimise for this case. |
Sorry, I realise that keys actually going away during a scan is fine. What I'm asking is whether it's possible (perhaps due to a race condition in the scanning) for a key to not be observed by the client even though it remains in the system. |
|
Aha, now I've had a bit more time I've found the relevant part of the docs:
This is the guarantee I was looking for -- ensuring that we can't miss out on keys that remain present. |
|
For visibility even if this was merged: not specifying a We're experimenting and validating with higher values and depending on results should either update this again and make the count configurable with sensible defaults or revert this wholesale. |
This reverts commit 04b985f, reversing changes made to 6ff4b90. While using SCAN instead of KEYS is the recommended approach, it turns out that it has worse real-world behaviour in the scenario that the change was intended to make. Specifically the queue start-up times were observed to be much longer while the load on the redis instance was not reduced meaningfully, negatively impacting other clients worse than when KEYS was used.
As the usage has grown we've started seeing some impact form
KEYSbeingblocking and holding up the Redis instance. This is made worse by the
fact than on startup many workers will be attempting this at the same
time.
This change moves to using the non blocking
SCANinstead. The mainconcern here is that the collection of keys can change during iteration
however for this use case I think this is fine as the 2 changes we can
expect are:
worker key after re-enqueuing its job so should be safe.
not be impacting the recovery mechanism as it should also be part of
expected_processing_queue_keys.Relying on small scope + existing tests here.
We've discussed some other things we could do here; specifically doing this less which could be:
But for now let's see what the SCAN change brings and iterate from there.