Only enqueue 1 update_downloads job at a time #2539

jtgeibel · 2020-05-30T16:37:56Z

This ensures that if an update_downloads job is already running, a
duplicate job will not be enqueued. Currently, when multiple jobs are
running in parallel, they end up doing duplicate work resulting in
temporary overcounts that must be corrected in the next run. The
concurrent tasks also slow down the overall process and can result in
runaway performance problems as further jobs are spawned.

This commit also updates the monitoring to specifically check if the
update downloads job runs for too long (120 minutes by default). The
main check for stalled jobs will not trigger for update_downloads as
the row is locked for the duration of the job (and skip_locked is used
in that query).

r? @pietroalbini

This ensures that if an `update_downloads` job is already running, a duplicate job will not be enqueued. Currently, when multiple jobs are running in parallel, they end up doing duplicate work resulting in temporary overcounts that must be corrected in the next run. The concurrent tasks also slow down the overall process and can result in runaway performance problems as further jobs are spawned. This commit also updates the monitoring to specifically check if the update downloads job runs for too long (120 minutes by default). The main check for stalled jobs will not trigger for `update_downloads` as the row is locked for the duration of the job (and `skip_locked` is used in that query).

jtgeibel · 2020-05-30T16:46:44Z

This is a different approach to addressing the same issue as #2433, without the 2 drawbacks. The approach there can still be pursued, but in particular I think we need to address the 2nd drawback before merging it.

Earlier this week I happened to catch an instance of this issue in progress while looking at the metrics dashboard, presumably before the monitoring generated a page. This PR prevents the issue of cascading performance problem by not enqueuing duplicate jobs. Once I removed the extra jobs and restarted the background worker, the update_downloads jobs was able to finish relatively quickly.

pietroalbini · 2020-06-01T10:39:33Z

src/bin/monitor.rs

+
+    let start_time = background_jobs
+        .filter(job_type.eq("update_downloads"))
+        .select(created_at)


I'm not familiar with the crates.io schema that much: is this when the job was queued or started? If this is when it's queued, it might get false alerts.

This is the time the job was first enqueued. Given the long (2 hour) default time, even if there is a delay in the starting the job after it is enqueued there should not be false alerts. The job typically completes in <10 minutes, and is only expect to run longer when someone downloads all versions of all crates in quick succession. The long default time is intended to allow the job plenty of time to complete before alerting, even in this occasional extreme case.

pietroalbini · 2020-06-01T10:40:32Z

src/bin/monitor.rs

+
+    let max_job_time = dotenv::var("MONITOR_MAX_UPDATE_DOWNLOADS_TIME")
+        .map(|s| s.parse::<u32>().unwrap() as i64)
+        .unwrap_or(120);


What's the unit for this? Could you clarify that in comments?

Looking at the code below you compare it with minutes, so it would be 2 hours. If that's the case I don't see the point of this monitoring, as check_stalled_background_jobs fires if a job was in the queue for more than 15 minutes, alerting us 1hr45min before this check fires.

Yes, the unit is in minutes. Added a new commit to add comments.

The existing check_stalled_background_jobs check includes skip_locked, so tasks that are currently running are not caught by that check. In that context, "stalled" means that the job has failed (and likely been retried several times) but is not currently running. I've renamed the function to be a bit more clear and have added more context in the doc comment.

pietroalbini · 2020-06-01T10:45:26Z

src/bin/enqueue-job.rs

+            if count > 0 {
+                println!("Did not enqueue update_downloads, existing job already in progress");
+                Ok(())
+            } else {
+                Ok(tasks::update_downloads().enqueue(&conn)?)
+            }


First getting the count and then adding the job might end up not adding the job even though the queue is empty if a race condition happens. I guess it doesn't matter much though, the next job it will only have more work to do.

Yeah, if a job is already in the queue at the beginning of enqueue-job then it is fine if the background job is not enqueued. It might be possible to consolidate the actions into a single atomic query, but then we would be effectively re-implementing enqueue from swirl and that would probably be harder to maintain.

jtgeibel · 2020-06-19T19:07:56Z

@pietroalbini I've added a commit to address your review comments and clarify some parts of the logic (in particular for how the two similar jobs don't conflict with each other).

pietroalbini · 2020-06-30T12:18:55Z

@bors r+

bors · 2020-06-30T12:18:56Z

📌 Commit a0b3e7b has been approved by pietroalbini

bors · 2020-07-05T21:47:12Z

⌛ Testing commit a0b3e7b with merge 8602f09...

bors · 2020-07-05T21:59:12Z

☀️ Test successful - checks-travis
Approved by: pietroalbini
Pushing 8602f09 to master...

jtgeibel · 2020-07-05T22:13:43Z

Strange, it appears this was stuck in bors for 5 days, and finally triggered after I r+ed another PR.

rust-highfive assigned pietroalbini May 30, 2020

rust-highfive added the S-waiting-on-review label May 30, 2020

jtgeibel mentioned this pull request May 30, 2020

Break update_downloads into smaller jobs #2433

Closed

pietroalbini requested changes Jun 1, 2020

View reviewed changes

Improve comments and nomenclature in monitoring logic

a0b3e7b

pietroalbini approved these changes Jun 30, 2020

View reviewed changes

bors merged commit 8602f09 into rust-lang:master Jul 5, 2020

bors mentioned this pull request Jul 5, 2020

Partition the version_downloads table #2203

Closed

jtgeibel deleted the there-can-only-be-one branch July 5, 2020 22:13

jtgeibel mentioned this pull request Apr 1, 2021

Fix slight download over-count on 2019-11-07 #3478

Closed

jtgeibel mentioned this pull request Apr 23, 2021

Ensure the update_downloads job doesn't run concurrently #2266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only enqueue 1 update_downloads job at a time #2539

Only enqueue 1 update_downloads job at a time #2539

Uh oh!

jtgeibel commented May 30, 2020

Uh oh!

jtgeibel commented May 30, 2020

Uh oh!

pietroalbini Jun 1, 2020

Uh oh!

jtgeibel Jun 19, 2020

Uh oh!

pietroalbini Jun 1, 2020

Uh oh!

jtgeibel Jun 19, 2020

Uh oh!

pietroalbini Jun 1, 2020

Uh oh!

jtgeibel Jun 19, 2020

Uh oh!

jtgeibel commented Jun 19, 2020

Uh oh!

pietroalbini commented Jun 30, 2020

Uh oh!

bors commented Jun 30, 2020

Uh oh!

bors commented Jul 5, 2020

Uh oh!

bors commented Jul 5, 2020

Uh oh!

jtgeibel commented Jul 5, 2020

Uh oh!

Uh oh!

Only enqueue 1 update_downloads job at a time #2539

Only enqueue 1 update_downloads job at a time #2539

Uh oh!

Conversation

jtgeibel commented May 30, 2020

Uh oh!

jtgeibel commented May 30, 2020

Uh oh!

pietroalbini Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

jtgeibel Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

pietroalbini Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

jtgeibel Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

pietroalbini Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

jtgeibel Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

jtgeibel commented Jun 19, 2020

Uh oh!

pietroalbini commented Jun 30, 2020

Uh oh!

bors commented Jun 30, 2020

Uh oh!

bors commented Jul 5, 2020

Uh oh!

bors commented Jul 5, 2020

Uh oh!

jtgeibel commented Jul 5, 2020

Uh oh!

Uh oh!