Limit impact of reliable backend startup routine on Redis load #73

lirsacc · 2022-08-19T08:59:46Z

As the usage has grown we've started seeing some impact form KEYS being
blocking and holding up the Redis instance. This is made worse by the
fact than on startup many workers will be attempting this at the same
time.

This change moves to using the non blocking SCAN instead. The main
concern here is that the collection of keys can change during iteration
however for this use case I think this is fine as the 2 changes we can
expect are:

A key goes away: this would be another startup routine removing a
worker key after re-enqueuing its job so should be safe.
A key appearing: this would be a new worker coming online and should
not be impacting the recovery mechanism as it should also be part of
expected_processing_queue_keys.

Relying on small scope + existing tests here.

We've discussed some other things we could do here; specifically doing this less which could be:

Only run this on worker 0 of any given queue
Do some locking to avoid thundering all machines doing this at once

But for now let's see what the SCAN change brings and iterate from there.

As the usage has grown we've started seeing some impact form `KEYS` being blocking and holding up the Redis instance. This is made worse by the fact than on startup many workers will be attempting this at the same time. This change moves to using the non blocking `SCAN` instead. The main concern here is that the collection of keys can change during iteration however for this use case I think this is fine as the 2 changes we can expect are: - A key goes away: this would be another startup routine removing a worker key after re-enqueuing its job so should be safe. - A key appearing: this would be a new worker coming online and should not be impacting the recovery mechanism as it should also be part of `expected_processing_queue_keys`.

PeterJCLaw · 2022-08-22T12:11:32Z

This change moves to using the non blocking SCAN instead. The main concern here is that the collection of keys can change during iteration however for this use case I think this is fine as the 2 changes we can expect are:

Is it guaranteed that changes in the underlying keys cannot result in a client missing out a key which that present at the time of their initial request? I know that this can happen to certain iterators in Python (and other languages) if the underlying collection is modified while iterating over it.

A naive (and contrived) implementation showing the issue I mean is something like:

In [1]: l = list(range(10))
   ...: for idx in range(len(l)):
   ...:     print(l[idx])
   ...:     l.remove(idx)
   ...: 
0
2
4
6
8
IndexError

Ignoring the IndexError, note the omission of values which are still present in the list.

lirsacc · 2022-08-22T13:05:09Z

Yes it's possible for keys to go away while we're iterating through scan here.

While this is generally unsafe, i think that under normal operations this is fine as a worker key going away should only occur if another startup routine comes along and drops it which we'd have done anyway. This could possibly become an issue with multiple conflicting sets of workers running their startup routine but I am not sure we should optimise for this case.

PeterJCLaw · 2022-08-22T13:24:44Z

Yes it's possible for keys to go away while we're iterating through scan here.

Sorry, I realise that keys actually going away during a scan is fine. What I'm asking is whether it's possible (perhaps due to a race condition in the scanning) for a key to not be observed by the client even though it remains in the system.

PeterJCLaw · 2022-08-22T13:30:08Z

Aha, now I've had a bit more time I've found the relevant part of the docs:

A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCAN returned it to the user.

This is the guarantee I was looking for -- ensuring that we can't miss out on keys that remain present.

lirsacc · 2022-08-23T16:32:38Z

For visibility even if this was merged: not specifying a count in scan_iter can lead to many small queries and take significantly longer the the previous implementation.

We're experimenting and validating with higher values and depending on results should either update this again and make the count configurable with sensible defaults or revert this wholesale.

This reverts commit 04b985f, reversing changes made to 6ff4b90. While using SCAN instead of KEYS is the recommended approach, it turns out that it has worse real-world behaviour in the scenario that the change was intended to make. Specifically the queue start-up times were observed to be much longer while the load on the redis instance was not reduced meaningfully, negatively impacting other clients worse than when KEYS was used.

lirsacc marked this pull request as ready for review August 22, 2022 09:18

lirsacc requested a review from PeterJCLaw August 22, 2022 09:18

PeterJCLaw approved these changes Aug 22, 2022

View reviewed changes

lirsacc merged commit 04b985f into master Aug 23, 2022

lirsacc deleted the limit-keys-calls-on-startup branch August 23, 2022 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limit impact of reliable backend startup routine on Redis load #73

Limit impact of reliable backend startup routine on Redis load #73

Uh oh!

lirsacc commented Aug 19, 2022 •

edited

Loading

Uh oh!

PeterJCLaw commented Aug 22, 2022

Uh oh!

lirsacc commented Aug 22, 2022

Uh oh!

PeterJCLaw commented Aug 22, 2022

Uh oh!

PeterJCLaw commented Aug 22, 2022

Uh oh!

lirsacc commented Aug 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Limit impact of reliable backend startup routine on Redis load #73

Limit impact of reliable backend startup routine on Redis load #73

Uh oh!

Conversation

lirsacc commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterJCLaw commented Aug 22, 2022

Uh oh!

lirsacc commented Aug 22, 2022

Uh oh!

PeterJCLaw commented Aug 22, 2022

Uh oh!

PeterJCLaw commented Aug 22, 2022

Uh oh!

lirsacc commented Aug 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

lirsacc commented Aug 19, 2022 •

edited

Loading