Don't page for long running jobs #2434
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This changes our pageable event for the background queue from "a job is
more than $MAX_JOB_TIME minutes old" to "a job is not currently being
run and is more than $MAX_JOB_TIME minutes old", meaning we will no
longer page for a single job taking longer than $MAX_JOB_TIME minutes to
run.
I've been getting paged a lot lately because
update_downloads
hastaken longer than our threshold to complete, meaning "a job has been in
the queue for too long". While I'd love to know if any other job is
stuck in an infinite loop, this has never triggered for a bug like that.
Any other job having that bug will clog the entire queue pretty quickly.
If a job is making progress, I don't need to be paged. For
update_downloads
, really there's nothing I can do about it unlessthere are two instances running concurrently (in which case I can
restart the worker and delete one of them while it's unlocked), but if I
didn't do that it will eventually resolve itself (unless we're
continuously being hammered by bots in which case I will get paged when
this inevitably clogs the entire queue).
The worst case scenario for
update_downloads
is that download countsare out of sync for a while. Frankly, if it's not blocking index
updates, I do not need to be woken up for it.
I've had to change the query slightly so the count happens on the Rust
side, since we can't do
FOR UPDATE
on an aggregate query. We could dothis with a subselect, but we'd need to drop to raw SQL for that which
is a bit of a pain here for very little gain.