Break update_downloads into smaller jobs

sgrif · sgrif · commit 9d4b2a37e6e9 · 2020-04-16T15:10:50.000-06:00
This changes the behavior of the `update_downloads` background job from processing all rows serially to spawning a smaller job for each 1000 rows that need to be processed. This shortens the amount of time that any one job runs (making us less likely to hit timeouts in the runner and encounter issues that #2267 and #1804 addressed). More importantly, it means that we are able to do more in parallel, reducing the overall time it takes to count downloads. About the Problem === There are two main thresholds we care about for how long this job takes to run: - If it takes longer than the interval at which we enqueue this job (typically every 10 minutes, currently every hour due to the issues this PR addresses), we can end up with two instances of it running in parallel. This causes downloads to get double counted, and the jobs tend to contend for row locks and slow each other down. The double counting will be corrected the next time the job runs. This only tends to happen if a crawler downloads a large number of crates in rapid succession, causing the rows we have to process to increase from our normal volume of ~10k per hour to ~150k. When this occurs, we're likely to hit the second threshold. - If it takes longer than `$MAX_JOB_TIME` (currently set to 60 for the reasons below, defaults to 15), I will be paged. This has been happening much more frequently as of late (which is why that env var is currently at 60 minutes). It's unclear if this is because crawlers are downloading large volumes of crates more frequently, or if we're just seeing normal volume push us over 15 minutes to process serially. Splitting into smaller jobs doesn't directly help either of those thresholds, but being able to process rows in parallel does, since the overall time this takes to complete will go down dramatically (currently by a factor of 4, but we can probably set the number of threads to higher than CPU cores and still see benefits since we're I/O bound). Based on extremely anecdotal, non-scientific measurements of "I ran `select count(*) from version_downloads where downloads != counted` while the job was churning through >100k rows roughly every minute a few times", we can process roughly ~4k rows per minute, which seems about right for 6 queries per row. We can substantially increase throughput if we reduce this to one round trip, but for now we can expect this to take roughly 15 seconds per batch. The longest I've ever seen this job take (and I get paged if it takes too long, I've 100% seen the longest run times) is just over an hour. Since this should reduce it by *at least* a factor of 4, this will mean the time it takes to run if every version was downloaded at least once since the last run will be around 15 minutes. If we can bring this down to a single round trip per row, that should further reduce it to around 2.5 minutes Since this means we'll use all available worker threads in parallel, it also means that even if we have `update_downloads` queued again before the previous run completed, it's unlikely to ever be looking at the same rows in parallel, since the batches from the second run wouldn't be handled until all but worker_count - 1 batches from the first run have completed. Drawbacks === There are two main drawbacks to this commit: - Since we no longer process rows serially before running `update_recent_crate_downloads`, the data in `recent_crate_downloads` will reflect the *previous* run of `update_downloads`, meaning it's basically always 10-20 minutes behind. This is a regression over a few months ago, where it was typically 3-13 minutes behind, but an improvement over today, where it's 3-63 minutes behind. - The entire background queue will be blocked while `update_downloads` runs. This was the case prior to #1804. At the time of that commit, we did not consider blocking publishes to be a problem. We added the additional thread (assuming only one would be taken by `update_downloads` at any given time) to prevent the runner from crashing because it couldn't tell if progress was being made. That won't be an issue with this commit (since we're always going to make progress in relatively small chunks), but does mean that index updates will potentially be delayed by as much as 15 minutes in the worst case. (this number may be higher than is realistic since we've only observed >1 hour runs with the job set to queue hourly, meaning more rows to process per run). Typically the delay will only be at most 30 seconds. If I wasn't getting paged almost every day, I'd say this PR should be blocked on the second issue (which is resolved by adding queue priority to swirl). But given the operational load this issue is causing, I think increasing the worst case delay for index updates is a reasonable tradeoff for now. Impl details === I've written the test in a sorta funky way, adding functions to get a connection in and out of a test DB pool. This was primarily so I could change the tests to queue the job, and then run any pending jobs, without too much churn (this would otherwise require having the runner own the connection, and putting any uses of the connection in braces since we'd have to fetch it from the pool each time). This relies on an update to swirl (which is not in master at the time of writing this commit) for ease of testing. Testing `update_downloads` after this change requires actually running the background job. At the time of writing this, on master that would mean needing to construct a `background_jobs::Environment`, which involves cloning git indexes. The update to swirl means we can have the jobs take a connection directly, changing their environment type to `()`, making them much easier to test.
diff --git a/src/db.rs b/src/db.rs
@@ -31,9 +31,18 @@ impl DieselPool {
         }
     }
 
-    fn test_conn(conn: PgConnection) -> Self {
+    pub fn test_conn(conn: PgConnection) -> Self {
         DieselPool::Test(Arc::new(ReentrantMutex::new(conn)))
     }
+
+    pub fn unwrap_test_conn(self) -> Result<PgConnection, Self> {
+        match self {
+            DieselPool::Test(shared_conn) => Arc::try_unwrap(shared_conn)
+                .map(|c| c.into_inner())
+                .map_err(Self::Test),
+            other => Err(other),
+        }
+    }
 }
 
 #[allow(missing_debug_implementations)]
diff --git a/src/tasks/update_downloads.rs b/src/tasks/update_downloads.rs
@@ -6,25 +6,34 @@ use crate::{
 use diesel::prelude::*;
 use swirl::PerformError;
 
+#[cfg(not(test))]
+const ROWS_PER_BATCH: i64 = 1000;
+
+#[cfg(test)]
+const ROWS_PER_BATCH: i64 = 1;
+
 #[swirl::background_job]
 pub fn update_downloads(conn: &PgConnection) -> Result<(), PerformError> {
-    update(&conn)?;
-    Ok(())
-}
-
-fn update(conn: &PgConnection) -> QueryResult<()> {
     use self::version_downloads::dsl::*;
     use diesel::dsl::now;
     use diesel::select;
 
-    let rows = version_downloads
-        .filter(processed.eq(false))
-        .filter(downloads.ne(counted))
-        .load(conn)?;
-
-    println!("Updating {} versions", rows.len());
-    collect(conn, &rows)?;
-    println!("Finished updating versions");
+    println!("Enqueuing jobs to count downloads");
+    let mut last_id = Some(0);
+    while let Some(id) = last_id {
+        let rows = version_downloads
+            .filter(processed.eq(false))
+            .filter(downloads.ne(counted))
+            .filter(version_id.gt(id))
+            .limit(ROWS_PER_BATCH)
+            .select(version_id)
+            .load(conn)?;
+        last_id = rows.last().copied();
+        if let Some(max_id) = last_id {
+            update_downloads_batch(id, max_id).enqueue(&conn)?;
+        }
+    }
+    println!("Finished enqueuing jobs");
 
     // Anything older than 24 hours ago will be frozen and will not be queried
     // against again.
@@ -43,6 +52,23 @@ fn update(conn: &PgConnection) -> QueryResult<()> {
     Ok(())
 }
 
+#[swirl::background_job]
+pub fn update_downloads_batch(
+    conn: &PgConnection,
+    min_version_id: i32,
+    max_version_id: i32,
+) -> Result<(), PerformError> {
+    use self::version_downloads::dsl::*;
+
+    let rows = version_downloads
+        .filter(processed.eq(false))
+        .filter(downloads.ne(counted))
+        .filter(version_id.between(min_version_id, max_version_id))
+        .load(conn)?;
+    collect(conn, &rows)?;
+    Ok(())
+}
+
 fn collect(conn: &PgConnection, rows: &[VersionDownload]) -> QueryResult<()> {
     use diesel::update;
 
@@ -89,6 +115,24 @@ mod test {
     };
     use std::collections::HashMap;
 
+    fn run_update(conn: PgConnection) -> PgConnection {
+        use crate::db::DieselPool;
+        use swirl::{Job, Runner};
+
+        super::update_downloads().enqueue(&conn).unwrap();
+        let pool = DieselPool::test_conn(conn);
+        {
+            let runner = Runner::builder(())
+                .thread_count(1)
+                .connection_pool(pool.clone())
+                .build();
+            runner.run_all_pending_jobs().unwrap();
+            runner.check_for_failed_jobs().unwrap();
+        }
+        pool.unwrap_test_conn()
+            .unwrap_or_else(|_| panic!("couldn't unwrap pool"))
+    }
+
     fn conn() -> PgConnection {
         let conn = PgConnection::establish(&env("TEST_DATABASE_URL")).unwrap();
         conn.begin_test_transaction().unwrap();
@@ -142,7 +186,7 @@ mod test {
             .execute(&conn)
             .unwrap();
 
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let version_downloads = versions::table
             .find(version.id)
             .select(versions::downloads)
@@ -153,7 +197,7 @@ mod test {
             .select(crates::downloads)
             .first(&conn);
         assert_eq!(Ok(1), crate_downloads);
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let version_downloads = versions::table
             .find(version.id)
             .select(versions::downloads)
@@ -178,7 +222,7 @@ mod test {
             ))
             .execute(&conn)
             .unwrap();
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let processed = version_downloads::table
             .filter(version_downloads::version_id.eq(version.id))
             .select(version_downloads::processed)
@@ -202,7 +246,7 @@ mod test {
             ))
             .execute(&conn)
             .unwrap();
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let processed = version_downloads::table
             .filter(version_downloads::version_id.eq(version.id))
             .select(version_downloads::processed)
@@ -252,7 +296,7 @@ mod test {
             .filter(crates::id.eq(krate.id))
             .first::<Crate>(&conn)
             .unwrap();
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let version2 = versions::table
             .find(version.id)
             .first::<Version>(&conn)
@@ -265,7 +309,7 @@ mod test {
             .unwrap();
         assert_eq!(krate2.downloads, 2);
         assert_eq!(krate2.updated_at, krate_before.updated_at);
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let version3 = versions::table
             .find(version.id)
             .first::<Version>(&conn)
@@ -300,7 +344,7 @@ mod test {
             .execute(&conn)
             .unwrap();
 
-        super::update(&conn).unwrap();
+        let conn = run_update(conn);
         let versions_changed = versions::table
             .select(versions::updated_at.ne(now - 2.days()))
             .get_result(&conn);

Original file line number	Diff line number	Diff line change
`@@ -31,9 +31,18 @@ impl DieselPool {`
`31`	`31`	`}`
`32`	`32`	`}`
`33`	`33`
`34`		`- fn test_conn(conn: PgConnection) -> Self {`
	`34`	`+ pub fn test_conn(conn: PgConnection) -> Self {`
`35`	`35`	`DieselPool::Test(Arc::new(ReentrantMutex::new(conn)))`
`36`	`36`	`}`
	`37`	`+`
	`38`	`+ pub fn unwrap_test_conn(self) -> Result<PgConnection, Self> {`
	`39`	`+ match self {`
	`40`	`+ DieselPool::Test(shared_conn) => Arc::try_unwrap(shared_conn)`
	`41`	`+ .map(\|c\| c.into_inner())`
	`42`	`+ .map_err(Self::Test),`
	`43`	`+ other => Err(other),`
	`44`	`+ }`
	`45`	`+ }`
`37`	`46`	`}`
`38`	`47`
`39`	`48`	`#[allow(missing_debug_implementations)]`