Race condition with retries and multiple workers

### Symptoms

I'm observing jobs randomly being run sooner than scheduled. As far as I can tell, this occurs with multiple workers and retried jobs.

### Investigation

I believe I see a race condition in the code.

Disclaimer: I haven't verified this particular path, as it's extremely difficult to reproduce it. I did, however, verified that scaling down to 1 worker makes the issue go away.

Consider this scenario:

1. Worker A fetches the job X:

https://github.com/python-arq/arq/blob/3914e48c530f0e5acf02cb340b1f8f3e201ba895/arq/worker.py#L386

2. Worker B fetches multiple jobs, including the job X. Worker B goes ahead and iterates through them, making its way to the job X:

https://github.com/python-arq/arq/blob/3914e48c530f0e5acf02cb340b1f8f3e201ba895/arq/worker.py#L435

3. In the meantime, Worker A finishes the job X and catches a `Retry`. It increments the job score:

https://github.com/python-arq/arq/blob/3914e48c530f0e5acf02cb340b1f8f3e201ba895/arq/worker.py#L701

4. Now, Worker B gets a chance to run X. It reads the score again:

https://github.com/python-arq/arq/blob/3914e48c530f0e5acf02cb340b1f8f3e201ba895/arq/worker.py#L449

And now, it’s in the future yet worker B continues normally :boom: As far as I can tell, steps 3 and 4 are not protected by the sync primitives. Does this sound plausible?

### Possible fix

I haven't studied the code well enough. It looks like an additional check, like `score > timestamp_ms()` could be added around here to prevent the execution of a future retry:

https://github.com/python-arq/arq/blob/3914e48c530f0e5acf02cb340b1f8f3e201ba895/arq/worker.py#L450

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Race condition with retries and multiple workers #482

Symptoms

Investigation

Possible fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Race condition with retries and multiple workers #482

Description

Symptoms

Investigation

Possible fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions