Description
Symptoms
I'm observing jobs randomly being run sooner than scheduled. As far as I can tell, this occurs with multiple workers and retried jobs.
Investigation
I believe I see a race condition in the code.
Disclaimer: I haven't verified this particular path, as it's extremely difficult to reproduce it. I did, however, verified that scaling down to 1 worker makes the issue go away.
Consider this scenario:
- Worker A fetches the job X:
Line 386 in 3914e48
- Worker B fetches multiple jobs, including the job X. Worker B goes ahead and iterates through them, making its way to the job X:
Line 435 in 3914e48
- In the meantime, Worker A finishes the job X and catches a
Retry
. It increments the job score:
Line 701 in 3914e48
- Now, Worker B gets a chance to run X. It reads the score again:
Line 449 in 3914e48
And now, it’s in the future yet worker B continues normally 💥 As far as I can tell, steps 3 and 4 are not protected by the sync primitives. Does this sound plausible?
Possible fix
I haven't studied the code well enough. It looks like an additional check, like score > timestamp_ms()
could be added around here to prevent the execution of a future retry:
Line 450 in 3914e48