Open
Description
openedon Dec 22, 2023
Description
Summary
Develop and document a process for handling dead links in the Catalog, in order to make dead link validation in the API faster.
Details
Dead link validation is a critical part of the API, and sometimes slows down the API responses. Currently, we do not have a well described process for removing dead links from the catalog. We need:
- Well documented, clear criteria for when a link is considered dead:
- Is one 404 response enough to consider a link dead, or should we define a threshold of recorded 404 responses that must be met?
- Should we validate only the direct URL, or the foreign landing URL, or both?
- A process/pipeline for handling dead links identified in the API, in the Catalog.
- One option would be a daily DAG that saves the links marked as dead in the Redis cache to a file in S3
- Well documented process for how the Catalog should handle these records, once identified:
- Should the records have
removed_from_source
set, perhaps using thebatched_update
DAG? - Should the records be removed from the Catalog database entirely, or moved to a separate "dead_links" database?
- Should the records be moved to the existing
DeletedMedia
tables? - Should we keep parquet files of the removed records instead of moving them to a different database or table?
- How should the data refresh handle these records?
- Should the records have
Documents
- Project Proposal
- Implementation Plan(s)
- Documentation for the criteria for identifying dead links
- Process for handling dead links in the Catalog, using information from the Redis cache
Milestones/Issues
Prior Art
This project combines the project ideas of Establish Guidelines and Practices for Dead links and Set up Dead links Removal Pipeline Using Redis Cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Metadata
Assignees
Labels
Type
Projects
Status
⌛ Todo