Skip to content

Reduce database writes for the download endpoint #3413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 16, 2021

Conversation

pietroalbini
Copy link
Member

crates.io receives a lot of download requests, and we're reaching a point where executing a write query to PostgreSQL for each request is causing operational issues. Over the past few weeks we've been getting regular pages due to download requests being slow to execute or outright fail for a couple of seconds, due to the database being overwhelmed.

This PR avoids that problem by collecting download counts in memory and periodically persisting them into the database, hopefully reducing the write load coming to our primary database thanks to the new downloads_counter module.

The implementation is designed to reduce the amount of locking needed as much as possible to prevent requests from being unnecessarily delayed. Instead of storing the counters in a RwLock<HashMap<i32, AtomicUsize>>, which would require locking the whole map when a new version needs to be tracked, this uses the dashmap crate.

DashMap offers an API similar to the standard library's HashMap, but stores the items in num_cpus()*4 individually locked RwLock<HashMap<K, V>> (called "shards") based on the key's hash. Splitting the items in shards allows to reduce how much of the DashMap is locked at the same time, with concurrent download requests locking each other only inside a single shard.

Persisting download counts also takes advantage of sharding: to avoid "stopping the world" when the background thread starts persisting the counters only one shard is persisted at the time. The current configuration persists all shards over the span of one minute, and that's configurable at runtime with the DOWNLOADS_PERSIST_INTERVAL_MS environment variable.

To avoid losing downloads the code also persists all the shards at the same time when the server gracefully shuts down. There is still the possibility of losing download counts if the server is killed without a chance of gracefully shutting down (for example due to platform issues), but the possibility of losing less a minute of download counts is far less important than reducing database load.

r? @jtgeibel

@pietroalbini
Copy link
Member Author

Opened the PR early to allow the review to start ahead of time, but this PR is lacking tests (which we should add before deploying the PR).

counted_versions,
counted_downloads,
pending_downloads,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, is there a reason why we're using println!() everywhere instead of the log or tracing macros?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is really a reason, but I don't see value in switching to another library right now.

Copy link
Member

@jtgeibel jtgeibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me overall

@pietroalbini
Copy link
Member Author

The PR has been deployed to staging.crates.io and it seems to correctly persist download counts!

There are two things that are left to do:

  • Add automated tests for DownloadsCounter itself: it's now generally tested by the download test, but it'd be better to add in-depth tests for it as it's going to be one of the system in the critical path for download requests.
  • Ensure pending downloads are persisted when the application is rebooted: this works locally and the code should be implemented correctly, but a brief test on staging shows that it might not work on Heroku.

As we're getting a lot of pages about this I don't think it's critical we address those two issues before deploying to production, but we should make sure we land followup PRs for it.

@jtgeibel
Copy link
Member

Looks great to me! On my local setup, this improved the endpoint from 876 rps to 14,068 rps! (Though I don't have many versions, so this is mostly just testing atomics, and not any sort of contention on the DashMap.)

@bors r+

@bors
Copy link
Contributor

bors commented Mar 16, 2021

📌 Commit f646e45 has been approved by jtgeibel

@bors
Copy link
Contributor

bors commented Mar 16, 2021

⌛ Testing commit f646e45 with merge bc58bb7...

@bors
Copy link
Contributor

bors commented Mar 16, 2021

☀️ Test successful - checks-actions
Approved by: jtgeibel
Pushing bc58bb7 to master...

@bors bors merged commit bc58bb7 into rust-lang:master Mar 16, 2021
@pietroalbini pietroalbini deleted the downloads-counter branch March 16, 2021 08:47
bors added a commit that referenced this pull request Mar 16, 2021
Prevent transaction deadlocks in DownloadsCounter

While analyzing the behavior of `DownloadsCounter` (introduced in #3413) in production, we noticed that there were a few cases of transaction deadlocks tonight in the background thread persisting the download counts.

I did some investigation on the cause of this and came up with this fix. As [recommended by the PostgreSQL docs](https://www.postgresql.org/docs/11/explicit-locking.html#LOCKING-DEADLOCKS), to avoid deadlocks when multiple transactions affect multiple rows in a table, the affected keys needs to be modified in a consistent order across the transactions. We implement that by sorting by `version_id` the items we're going to insert.

<details><summary>Local reproduction instructions</summary>

The exact error we saw in production can be reproduced by opening two instances of `psql` locally and running those two set of queries in each `psql` (without running `COMMIT`):

```sql
-- First instance of psql
BEGIN;
INSERT INTO version_downloads (version_id) VALUES (1) ON CONFLICT (version_id, date) DO UPDATE SET downloads = excluded.downloads + 1;

-- Second instance of psql
BEGIN;
INSERT INTO version_downloads (version_id) VALUES (2) ON CONFLICT (version_id, date) DO UPDATE SET downloads = excluded.downloads + 1;
```

Then, once the commands are sent to both instances of `psql`, sending these other commands will cause the deadlock:

```sql
-- First instance of psql
INSERT INTO version_downloads (version_id) VALUES (2) ON CONFLICT (version_id, date) DO UPDATE SET downloads = excluded.downloads + 1;

-- Second instance of psql
INSERT INTO version_downloads (version_id) VALUES (1) ON CONFLICT (version_id, date) DO UPDATE SET downloads = excluded.downloads + 1;
```

Instead, if both transactions try to insert first `1` and then `2` the first `psql` will work fine, and the second `psql` will wait for the first `psql` to `COMMIT` before applying the changes. That's the behavior we want and it's the one this PR implements.

</details>

r? `@jtgeibel`
@pietroalbini
Copy link
Member Author

Opened #3415 and #3416 to handle the outstanding work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants