Skip to content

Conversation

@lirsacc
Copy link
Contributor

@lirsacc lirsacc commented Aug 19, 2022

As the usage has grown we've started seeing some impact form KEYS being
blocking and holding up the Redis instance. This is made worse by the
fact than on startup many workers will be attempting this at the same
time.

This change moves to using the non blocking SCAN instead. The main
concern here is that the collection of keys can change during iteration
however for this use case I think this is fine as the 2 changes we can
expect are:

  • A key goes away: this would be another startup routine removing a
    worker key after re-enqueuing its job so should be safe.
  • A key appearing: this would be a new worker coming online and should
    not be impacting the recovery mechanism as it should also be part of
    expected_processing_queue_keys.

Relying on small scope + existing tests here.


We've discussed some other things we could do here; specifically doing this less which could be:

  • Only run this on worker 0 of any given queue
  • Do some locking to avoid thundering all machines doing this at once

But for now let's see what the SCAN change brings and iterate from there.

As the usage has grown we've started seeing some impact form `KEYS` being
blocking and holding up the Redis instance. This is made worse by the
fact than on startup many workers will be attempting this at the same
time.

This change moves to using the non blocking `SCAN` instead. The main
concern here is that the collection of keys can change during iteration
however for this use case I think this is fine as the 2 changes we can
expect are:

- A key goes away: this would be another startup routine removing a
  worker key after re-enqueuing its job so should be safe.
- A key appearing: this would be a new worker coming online and should
  not be impacting the recovery mechanism as it should also be part of
  `expected_processing_queue_keys`.
@lirsacc lirsacc marked this pull request as ready for review August 22, 2022 09:18
@lirsacc lirsacc requested a review from PeterJCLaw August 22, 2022 09:18
@PeterJCLaw
Copy link
Contributor

This change moves to using the non blocking SCAN instead. The main concern here is that the collection of keys can change during iteration however for this use case I think this is fine as the 2 changes we can expect are:

Is it guaranteed that changes in the underlying keys cannot result in a client missing out a key which that present at the time of their initial request? I know that this can happen to certain iterators in Python (and other languages) if the underlying collection is modified while iterating over it.

A naive (and contrived) implementation showing the issue I mean is something like:

In [1]: l = list(range(10))
   ...: for idx in range(len(l)):
   ...:     print(l[idx])
   ...:     l.remove(idx)
   ...: 
0
2
4
6
8
IndexError

Ignoring the IndexError, note the omission of values which are still present in the list.

@lirsacc
Copy link
Contributor Author

lirsacc commented Aug 22, 2022

Yes it's possible for keys to go away while we're iterating through scan here.

While this is generally unsafe, i think that under normal operations this is fine as a worker key going away should only occur if another startup routine comes along and drops it which we'd have done anyway. This could possibly become an issue with multiple conflicting sets of workers running their startup routine but I am not sure we should optimise for this case.

@PeterJCLaw
Copy link
Contributor

Yes it's possible for keys to go away while we're iterating through scan here.

Sorry, I realise that keys actually going away during a scan is fine. What I'm asking is whether it's possible (perhaps due to a race condition in the scanning) for a key to not be observed by the client even though it remains in the system.

@PeterJCLaw
Copy link
Contributor

Aha, now I've had a bit more time I've found the relevant part of the docs:

A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCAN returned it to the user.

This is the guarantee I was looking for -- ensuring that we can't miss out on keys that remain present.

@lirsacc lirsacc merged commit 04b985f into master Aug 23, 2022
@lirsacc lirsacc deleted the limit-keys-calls-on-startup branch August 23, 2022 09:21
@lirsacc
Copy link
Contributor Author

lirsacc commented Aug 23, 2022

For visibility even if this was merged: not specifying a count in scan_iter can lead to many small queries and take significantly longer the the previous implementation.

We're experimenting and validating with higher values and depending on results should either update this again and make the count configurable with sensible defaults or revert this wholesale.

PeterJCLaw added a commit that referenced this pull request Sep 7, 2022
This reverts commit 04b985f, reversing
changes made to 6ff4b90.

While using SCAN instead of KEYS is the recommended approach, it
turns out that it has worse real-world behaviour in the scenario
that the change was intended to make. Specifically the queue start-up
times were observed to be much longer while the load on the redis
instance was not reduced meaningfully, negatively impacting other
clients worse than when KEYS was used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants