Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead link validation Redis key size optimisation #2400

Open
sarayourfriend opened this issue Jun 14, 2023 · 2 comments
Open

Dead link validation Redis key size optimisation #2400

sarayourfriend opened this issue Jun 14, 2023 · 2 comments
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API

Comments

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Jun 14, 2023

Problem

Dead link validation uses the following template for Redis keys: valid:{result.url}. The URLs for works can be very long, especially from providers like Wikimedia and Rawpixel (both of which are quite common). Even the shorter common URLs like Flickr can be easily optimised.

These keys could be optimised to save an estimated 1 to 1.5 GB in our production Redis.

Description

Change dead link validation keys to use the UUID of the work with the hyphens stripped. This would reduce the relatively small Flickr keys to 90 B rather than 120 B. It would reduce Wikimedia keys from 300 B to 90 B.

Cumulatively this would bring a massive reduction in key size.

Caveat: We need to avoid the performance penalty that would come from switching the key format without saving existing validation data. There are two approaches to this. One is faster but more complicated and the other is much slower but far simpler. We will follow the approach below because:

  1. The performance management/throttling needs of the fast version are unclear to me and I don't know how to estimate how "fast" processing 26 million keys in this way would actually be.
  2. Our Redis memory usage will go up as a result of Implementation Plan: Fine-grained API response cache management #1969. The fast version is only worth it if we can reduce data usage quickly enough so that we can downgrade our Redis instance and save money before Implementation Plan: Fine-grained API response cache management #1969 is implemented. After that IP is implemented, we will almost certainly need to upgrade our Redis instance size again.
  3. It isn't even that "slow" in the context of our project.
  4. It is significantly simpler and way harder for us to add bugs or cause problems with our app.

Decided approach: Configure check_dead_links to read cached validation from both key formats but only save to the new one. 120 days after that change is deployed, all keys in the previous format will have expired. At that point we can remove any logic that queries from them. This avoids the complications of the fast version but takes 120 days.

An alternative approach was considered that may be faster but rejected for the reasons above:

Switching the key format will come with a potentially significant search penalty due to reprocessing links for which we already have validation information. We would need to port existing keys to the new format. URL is not indexed in the API database, so we need to query Elasticsearch using url.keyword to get the identifier. This requires iterating through Redis keys which must be done using SCAN to prevent performance issues. Key TTL should be copied to the new key using the TTL command to retrieve it. This operation needs to be throttled to avoid performance impact to production Redis and Elasticsearch. A new key prefix needs to be used to prevent reprocessing of already converted keys and prevent the SCAN from going forever. This also allows the check_dead_links function to temporarily check both prefixes and key formats during the transition and to start saving new validation data under the new format without suffering a penalty during the transition and without adding new data that the key re-write process would need to manage.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Jun 14, 2023
@zackkrida
Copy link
Member

I think the slow version sounds perfectly acceptable, and not even that slow relative to this project. 👍

@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Jul 11, 2023
@obulat
Copy link
Contributor

obulat commented Jul 11, 2023

Lowered the priority to medium because we presently have sufficient storage capacity (based on the weekly developer chat discussion)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants