-
Notifications
You must be signed in to change notification settings - Fork 650
Only enqueue 1 update_downloads job at a time #2539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This ensures that if an `update_downloads` job is already running, a duplicate job will not be enqueued. Currently, when multiple jobs are running in parallel, they end up doing duplicate work resulting in temporary overcounts that must be corrected in the next run. The concurrent tasks also slow down the overall process and can result in runaway performance problems as further jobs are spawned. This commit also updates the monitoring to specifically check if the update downloads job runs for too long (120 minutes by default). The main check for stalled jobs will not trigger for `update_downloads` as the row is locked for the duration of the job (and `skip_locked` is used in that query).
This is a different approach to addressing the same issue as #2433, without the 2 drawbacks. The approach there can still be pursued, but in particular I think we need to address the 2nd drawback before merging it. Earlier this week I happened to catch an instance of this issue in progress while looking at the metrics dashboard, presumably before the monitoring generated a page. This PR prevents the issue of cascading performance problem by not enqueuing duplicate jobs. Once I removed the extra jobs and restarted the background worker, the |
|
||
let start_time = background_jobs | ||
.filter(job_type.eq("update_downloads")) | ||
.select(created_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the crates.io schema that much: is this when the job was queued or started? If this is when it's queued, it might get false alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the time the job was first enqueued. Given the long (2 hour) default time, even if there is a delay in the starting the job after it is enqueued there should not be false alerts. The job typically completes in <10 minutes, and is only expect to run longer when someone downloads all versions of all crates in quick succession. The long default time is intended to allow the job plenty of time to complete before alerting, even in this occasional extreme case.
|
||
let max_job_time = dotenv::var("MONITOR_MAX_UPDATE_DOWNLOADS_TIME") | ||
.map(|s| s.parse::<u32>().unwrap() as i64) | ||
.unwrap_or(120); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the unit for this? Could you clarify that in comments?
Looking at the code below you compare it with minutes, so it would be 2 hours. If that's the case I don't see the point of this monitoring, as check_stalled_background_jobs
fires if a job was in the queue for more than 15 minutes, alerting us 1hr45min before this check fires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the unit is in minutes. Added a new commit to add comments.
The existing check_stalled_background_jobs
check includes skip_locked
, so tasks that are currently running are not caught by that check. In that context, "stalled" means that the job has failed (and likely been retried several times) but is not currently running. I've renamed the function to be a bit more clear and have added more context in the doc comment.
if count > 0 { | ||
println!("Did not enqueue update_downloads, existing job already in progress"); | ||
Ok(()) | ||
} else { | ||
Ok(tasks::update_downloads().enqueue(&conn)?) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First getting the count and then adding the job might end up not adding the job even though the queue is empty if a race condition happens. I guess it doesn't matter much though, the next job it will only have more work to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if a job is already in the queue at the beginning of enqueue-job
then it is fine if the background job is not enqueued. It might be possible to consolidate the actions into a single atomic query, but then we would be effectively re-implementing enqueue
from swirl
and that would probably be harder to maintain.
@pietroalbini I've added a commit to address your review comments and clarify some parts of the logic (in particular for how the two similar jobs don't conflict with each other). |
@bors r+ |
📌 Commit a0b3e7b has been approved by |
☀️ Test successful - checks-travis |
Strange, it appears this was stuck in bors for 5 days, and finally triggered after I |
This ensures that if an
update_downloads
job is already running, aduplicate job will not be enqueued. Currently, when multiple jobs are
running in parallel, they end up doing duplicate work resulting in
temporary overcounts that must be corrected in the next run. The
concurrent tasks also slow down the overall process and can result in
runaway performance problems as further jobs are spawned.
This commit also updates the monitoring to specifically check if the
update downloads job runs for too long (120 minutes by default). The
main check for stalled jobs will not trigger for
update_downloads
asthe row is locked for the duration of the job (and
skip_locked
is usedin that query).
r? @pietroalbini