Skip to content

Break update_downloads into smaller jobs #2433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Apr 16, 2020

This changes the behavior of the update_downloads background job from
processing all rows serially to spawning a smaller job for each 1000
rows that need to be processed. This shortens the amount of time that
any one job runs (making us less likely to hit timeouts in the runner
and encounter issues that #2267 and #1804 addressed). More importantly,
it means that we are able to do more in parallel, reducing the overall
time it takes to count downloads.

About the Problem

There are two main thresholds we care about for how long this job takes
to run:

  • If it takes longer than the interval at which we enqueue this job
    (typically every 10 minutes, currently every hour due to the issues this
    PR addresses), we can end up with two instances of it running in
    parallel. This causes downloads to get double counted, and the jobs tend
    to contend for row locks and slow each other down. The double counting
    will be corrected the next time the job runs. This only tends to happen
    if a crawler downloads a large number of crates in rapid succession,
    causing the rows we have to process to increase from our normal volume
    of ~10k per hour to ~150k. When this occurs, we're likely to hit the
    second threshold.

  • If it takes longer than $MAX_JOB_TIME (currently set to 60 for the
    reasons below, defaults to 15), I will be paged. This has been happening
    much more frequently as of late (which is why that env var is currently
    at 60 minutes). It's unclear if this is because crawlers are downloading
    large volumes of crates more frequently, or if we're just seeing normal
    volume push us over 15 minutes to process serially.

Splitting into smaller jobs doesn't directly help either of those
thresholds, but being able to process rows in parallel does, since the
overall time this takes to complete will go down dramatically (currently
by a factor of 4, but we can probably set the number of threads to
higher than CPU cores and still see benefits since we're I/O bound).

Based on extremely anecdotal, non-scientific measurements of "I ran
select count(*) from version_downloads where downloads != counted
while the job was churning through >100k rows roughly every minute a few
times", we can process roughly ~4k rows per minute, which seems about
right for 6 queries per row. We can substantially increase throughput if
we reduce this to one round trip, but for now we can expect this to take
roughly 15 seconds per batch. The longest I've ever seen this job take
(and I get paged if it takes too long, I've 100% seen the longest run
times) is just over an hour. Since this should reduce it by at least a
factor of 4, this will mean the time it takes to run if every version
was downloaded at least once since the last run will be around 15
minutes. If we can bring this down to a single round trip per row, that
should further reduce it to around 2.5 minutes

Since this means we'll use all available worker threads in parallel, it
also means that even if we have update_downloads queued again before
the previous run completed, it's unlikely to ever be looking at the same
rows in parallel, since the batches from the second run wouldn't be
handled until all but worker_count - 1 batches from the first run have
completed.

Drawbacks

There are two main drawbacks to this commit:

  • Since we no longer process rows serially before running
    update_recent_crate_downloads, the data in recent_crate_downloads
    will reflect the previous run of update_downloads, meaning it's
    basically always 10-20 minutes behind. This is a regression over a few
    months ago, where it was typically 3-13 minutes behind, but an
    improvement over today, where it's 3-63 minutes behind.

  • The entire background queue will be blocked while update_downloads
    runs. This was the case prior to Make the job runner a bit more resilient to slow jobs or other errors #1804. At the time of that commit, we
    did not consider blocking publishes to be a problem. We added the
    additional thread (assuming only one would be taken by
    update_downloads at any given time) to prevent the runner from
    crashing because it couldn't tell if progress was being made. That won't
    be an issue with this commit (since we're always going to make progress
    in relatively small chunks), but does mean that index updates will
    potentially be delayed by as much as 15 minutes in the worst case.
    (this number may be higher than is realistic since we've only observed

1 hour runs with the job set to queue hourly, meaning more rows to
process per run). Typically the delay will only be at most 30 seconds.

If I wasn't getting paged almost every day, I'd say this PR should be
blocked on the second issue (which is resolved by adding queue priority
to swirl). But given the operational load this issue is causing, I think
increasing the worst case delay for index updates is a reasonable
tradeoff for now.

Impl details

I've written the test in a sorta funky way, adding functions to get a
connection in and out of a test DB pool. This was primarily so I could
change the tests to queue the job, and then run any pending jobs,
without too much churn (this would otherwise require having the runner
own the connection, and putting any uses of the connection in braces
since we'd have to fetch it from the pool each time).

This relies on an update to swirl (which is not in master at the time of
writing this commit) for ease of testing. Testing update_downloads
after this change requires actually running the background job. At the
time of writing this, on master that would mean needing to construct a
background_jobs::Environment, which involves cloning git indexes. The
update to swirl means we can have the jobs take a connection directly,
changing their environment type to (), making them much easier to
test.

r? @jtgeibel

@jtgeibel
Copy link
Member

Note: The first commit has been merged in another PR, so this PR consists of only the 2nd commit, even though the GH diff view shows the full changes.

I've opened an alternative/supplemental approach in #2539. In particular I'm worried about the 2nd drawback described above. I think this approach could still work if index jobs could be given higher priority than the other jobs.

@bors
Copy link
Contributor

bors commented Oct 29, 2020

☔ The latest upstream changes (presumably #2946) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

This changes the behavior of the `update_downloads` background job from
processing all rows serially to spawning a smaller job for each 1000
rows that need to be processed. This shortens the amount of time that
any one job runs (making us less likely to hit timeouts in the runner
and encounter issues that rust-lang#2267 and rust-lang#1804 addressed). More importantly,
it means that we are able to do more in parallel, reducing the overall
time it takes to count downloads.

About the Problem
===

There are two main thresholds we care about for how long this job takes
to run:

- If it takes longer than the interval at which we enqueue this job
(typically every 10 minutes, currently every hour due to the issues this
PR addresses), we can end up with two instances of it running in
parallel. This causes downloads to get double counted, and the jobs tend
to contend for row locks and slow each other down. The double counting
will be corrected the next time the job runs. This only tends to happen
if a crawler downloads a large number of crates in rapid succession,
causing the rows we have to process to increase from our normal volume
of ~10k per hour to ~150k. When this occurs, we're likely to hit the
second threshold.

- If it takes longer than `$MAX_JOB_TIME` (currently set to 60 for the
reasons below, defaults to 15), I will be paged. This has been happening
much more frequently as of late (which is why that env var is currently
at 60 minutes). It's unclear if this is because crawlers are downloading
large volumes of crates more frequently, or if we're just seeing normal
volume push us over 15 minutes to process serially.

Splitting into smaller jobs doesn't directly help either of those
thresholds, but being able to process rows in parallel does, since the
overall time this takes to complete will go down dramatically (currently
by a factor of 4, but we can probably set the number of threads to
higher than CPU cores and still see benefits since we're I/O bound).

Based on extremely anecdotal, non-scientific measurements of "I ran
`select count(*) from version_downloads where downloads != counted`
while the job was churning through >100k rows roughly every minute a few
times", we can process roughly ~4k rows per minute, which seems about
right for 6 queries per row. We can substantially increase throughput if
we reduce this to one round trip, but for now we can expect this to take
roughly 15 seconds per batch. The longest I've ever seen this job take
(and I get paged if it takes too long, I've 100% seen the longest run
times) is just over an hour. Since this should reduce it by *at least* a
factor of 4, this will mean the time it takes to run if every version
was downloaded at least once since the last run will be around 15
minutes. If we can bring this down to a single round trip per row, that
should further reduce it to around 2.5 minutes

Since this means we'll use all available worker threads in parallel, it
also means that even if we have `update_downloads` queued again before
the previous run completed, it's unlikely to ever be looking at the same
rows in parallel, since the batches from the second run wouldn't be
handled until all but worker_count - 1 batches from the first run have
completed.

Drawbacks
===

There are two main drawbacks to this commit:

- Since we no longer process rows serially before running
`update_recent_crate_downloads`, the data in `recent_crate_downloads`
will reflect the *previous* run of `update_downloads`, meaning it's
basically always 10-20 minutes behind. This is a regression over a few
months ago, where it was typically 3-13 minutes behind, but an
improvement over today, where it's 3-63 minutes behind.

- The entire background queue will be blocked while `update_downloads`
runs. This was the case prior to rust-lang#1804. At the time of that commit, we
did not consider blocking publishes to be a problem. We added the
additional thread (assuming only one would be taken by
`update_downloads` at any given time) to prevent the runner from
crashing because it couldn't tell if progress was being made. That won't
be an issue with this commit (since we're always going to make progress
in relatively small chunks), but does mean that index updates will
potentially be delayed by as much as 15 minutes in the worst case.
(this number may be higher than is realistic since we've only observed
>1 hour runs with the job set to queue hourly, meaning more rows to
process per run). Typically the delay will only be at most 30 seconds.

If I wasn't getting paged almost every day, I'd say this PR should be
blocked on the second issue (which is resolved by adding queue priority
to swirl). But given the operational load this issue is causing, I think
increasing the worst case delay for index updates is a reasonable
tradeoff for now.

Impl details
===

I've written the test in a sorta funky way, adding functions to get a
connection in and out of a test DB pool. This was primarily so I could
change the tests to queue the job, and then run any pending jobs,
without too much churn (this would otherwise require having the runner
own the connection, and putting any uses of the connection in braces
since we'd have to fetch it from the pool each time).

This relies on an update to swirl (which is not in master at the time of
writing this commit) for ease of testing. Testing `update_downloads`
after this change requires actually running the background job. At the
time of writing this, on master that would mean needing to construct a
`background_jobs::Environment`, which involves cloning git indexes. The
update to swirl means we can have the jobs take a connection directly,
changing their environment type to `()`, making them much easier to
test.
@Turbo87 Turbo87 force-pushed the sg-run-update-downloads-in-batches branch from 9d4b2a3 to b972d8d Compare December 6, 2020 14:32
@Turbo87
Copy link
Member

Turbo87 commented Dec 6, 2020

since #2539 was merged and deployed I will close this PR now. if this is still considered relevant please let me know and reopen 🙏

@Turbo87 Turbo87 closed this Dec 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants