Skip to content

Only enqueue 1 update_downloads job at a time #2539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 5, 2020

Conversation

jtgeibel
Copy link
Member

This ensures that if an update_downloads job is already running, a
duplicate job will not be enqueued. Currently, when multiple jobs are
running in parallel, they end up doing duplicate work resulting in
temporary overcounts that must be corrected in the next run. The
concurrent tasks also slow down the overall process and can result in
runaway performance problems as further jobs are spawned.

This commit also updates the monitoring to specifically check if the
update downloads job runs for too long (120 minutes by default). The
main check for stalled jobs will not trigger for update_downloads as
the row is locked for the duration of the job (and skip_locked is used
in that query).

r? @pietroalbini

This ensures that if an `update_downloads` job is already running, a
duplicate job will not be enqueued. Currently, when multiple jobs are
running in parallel, they end up doing duplicate work resulting in
temporary overcounts that must be corrected in the next run. The
concurrent tasks also slow down the overall process and can result in
runaway performance problems as further jobs are spawned.

This commit also updates the monitoring to specifically check if the
update downloads job runs for too long (120 minutes by default). The
main check for stalled jobs will not trigger for `update_downloads` as
the row is locked for the duration of the job (and `skip_locked` is used
in that query).
@jtgeibel
Copy link
Member Author

This is a different approach to addressing the same issue as #2433, without the 2 drawbacks. The approach there can still be pursued, but in particular I think we need to address the 2nd drawback before merging it.

Earlier this week I happened to catch an instance of this issue in progress while looking at the metrics dashboard, presumably before the monitoring generated a page. This PR prevents the issue of cascading performance problem by not enqueuing duplicate jobs. Once I removed the extra jobs and restarted the background worker, the update_downloads jobs was able to finish relatively quickly.


let start_time = background_jobs
.filter(job_type.eq("update_downloads"))
.select(created_at)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the crates.io schema that much: is this when the job was queued or started? If this is when it's queued, it might get false alerts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the time the job was first enqueued. Given the long (2 hour) default time, even if there is a delay in the starting the job after it is enqueued there should not be false alerts. The job typically completes in <10 minutes, and is only expect to run longer when someone downloads all versions of all crates in quick succession. The long default time is intended to allow the job plenty of time to complete before alerting, even in this occasional extreme case.


let max_job_time = dotenv::var("MONITOR_MAX_UPDATE_DOWNLOADS_TIME")
.map(|s| s.parse::<u32>().unwrap() as i64)
.unwrap_or(120);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the unit for this? Could you clarify that in comments?

Looking at the code below you compare it with minutes, so it would be 2 hours. If that's the case I don't see the point of this monitoring, as check_stalled_background_jobs fires if a job was in the queue for more than 15 minutes, alerting us 1hr45min before this check fires.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the unit is in minutes. Added a new commit to add comments.

The existing check_stalled_background_jobs check includes skip_locked, so tasks that are currently running are not caught by that check. In that context, "stalled" means that the job has failed (and likely been retried several times) but is not currently running. I've renamed the function to be a bit more clear and have added more context in the doc comment.

Comment on lines +23 to +28
if count > 0 {
println!("Did not enqueue update_downloads, existing job already in progress");
Ok(())
} else {
Ok(tasks::update_downloads().enqueue(&conn)?)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First getting the count and then adding the job might end up not adding the job even though the queue is empty if a race condition happens. I guess it doesn't matter much though, the next job it will only have more work to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if a job is already in the queue at the beginning of enqueue-job then it is fine if the background job is not enqueued. It might be possible to consolidate the actions into a single atomic query, but then we would be effectively re-implementing enqueue from swirl and that would probably be harder to maintain.

@jtgeibel
Copy link
Member Author

@pietroalbini I've added a commit to address your review comments and clarify some parts of the logic (in particular for how the two similar jobs don't conflict with each other).

@pietroalbini
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Jun 30, 2020

📌 Commit a0b3e7b has been approved by pietroalbini

@bors
Copy link
Contributor

bors commented Jul 5, 2020

⌛ Testing commit a0b3e7b with merge 8602f09...

@bors
Copy link
Contributor

bors commented Jul 5, 2020

☀️ Test successful - checks-travis
Approved by: pietroalbini
Pushing 8602f09 to master...

@bors bors merged commit 8602f09 into rust-lang:master Jul 5, 2020
@jtgeibel
Copy link
Member Author

jtgeibel commented Jul 5, 2020

Strange, it appears this was stuck in bors for 5 days, and finally triggered after I r+ed another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants