Dead link validation Redis key size optimisation #2400

sarayourfriend · 2023-06-14T02:10:43Z

Problem

Dead link validation uses the following template for Redis keys: valid:{result.url}. The URLs for works can be very long, especially from providers like Wikimedia and Rawpixel (both of which are quite common). Even the shorter common URLs like Flickr can be easily optimised.

These keys could be optimised to save an estimated 1 to 1.5 GB in our production Redis.

Description

Change dead link validation keys to use the UUID of the work with the hyphens stripped. This would reduce the relatively small Flickr keys to 90 B rather than 120 B. It would reduce Wikimedia keys from 300 B to 90 B.

Cumulatively this would bring a massive reduction in key size.

Caveat: We need to avoid the performance penalty that would come from switching the key format without saving existing validation data. There are two approaches to this. One is faster but more complicated and the other is much slower but far simpler. We will follow the approach below because:

The performance management/throttling needs of the fast version are unclear to me and I don't know how to estimate how "fast" processing 26 million keys in this way would actually be.
Our Redis memory usage will go up as a result of Implementation Plan: Fine-grained API response cache management #1969. The fast version is only worth it if we can reduce data usage quickly enough so that we can downgrade our Redis instance and save money before Implementation Plan: Fine-grained API response cache management #1969 is implemented. After that IP is implemented, we will almost certainly need to upgrade our Redis instance size again.
It isn't even that "slow" in the context of our project.
It is significantly simpler and way harder for us to add bugs or cause problems with our app.

Decided approach: Configure check_dead_links to read cached validation from both key formats but only save to the new one. 120 days after that change is deployed, all keys in the previous format will have expired. At that point we can remove any logic that queries from them. This avoids the complications of the fast version but takes 120 days.

An alternative approach was considered that may be faster but rejected for the reasons above:

Switching the key format will come with a potentially significant search penalty due to reprocessing links for which we already have validation information. We would need to port existing keys to the new format. URL is not indexed in the API database, so we need to query Elasticsearch using url.keyword to get the identifier. This requires iterating through Redis keys which must be done using SCAN to prevent performance issues. Key TTL should be copied to the new key using the TTL command to retrieve it. This operation needs to be throttled to avoid performance impact to production Redis and Elasticsearch. A new key prefix needs to be used to prevent reprocessing of already converted keys and prevent the SCAN from going forever. This also allows the check_dead_links function to temporarily check both prefixes and key formats during the transition and to start saving new validation data under the new format without suffering a penalty during the transition and without adding new data that the key re-write process would need to manage.

The text was updated successfully, but these errors were encountered:

zackkrida · 2023-06-27T15:41:40Z

I think the slow version sounds perfectly acceptable, and not even that slow relative to this project. 👍

obulat · 2023-07-11T15:31:10Z

Lowered the priority to medium because we presently have sufficient storage capacity (based on the weekly developer chat discussion)

sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Jun 14, 2023

sarayourfriend mentioned this issue Jun 21, 2023

Cache oEmbed endpoint's derived image metadata (image dimensions) #2459

Open

obulat added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead link validation Redis key size optimisation #2400

Dead link validation Redis key size optimisation #2400

sarayourfriend commented Jun 14, 2023 •

edited

Loading

zackkrida commented Jun 27, 2023

obulat commented Jul 11, 2023

Dead link validation Redis key size optimisation #2400

Dead link validation Redis key size optimisation #2400

Comments

sarayourfriend commented Jun 14, 2023 • edited Loading

Problem

Description

zackkrida commented Jun 27, 2023

obulat commented Jul 11, 2023

sarayourfriend commented Jun 14, 2023 •

edited

Loading