Skip to content

Dead link data incorporation #3585

Open

Description

Description

Summary

Develop and document a process for handling dead links in the Catalog, in order to make dead link validation in the API faster.

Details

Dead link validation is a critical part of the API, and sometimes slows down the API responses. Currently, we do not have a well described process for removing dead links from the catalog. We need:

  • Well documented, clear criteria for when a link is considered dead:
    • Is one 404 response enough to consider a link dead, or should we define a threshold of recorded 404 responses that must be met?
    • Should we validate only the direct URL, or the foreign landing URL, or both?
  • A process/pipeline for handling dead links identified in the API, in the Catalog.
    • One option would be a daily DAG that saves the links marked as dead in the Redis cache to a file in S3
  • Well documented process for how the Catalog should handle these records, once identified:
    • Should the records have removed_from_source set, perhaps using the batched_update DAG?
    • Should the records be removed from the Catalog database entirely, or moved to a separate "dead_links" database?
    • Should the records be moved to the existing DeletedMedia tables?
    • Should we keep parquet files of the removed records instead of moving them to a different database or table?
    • How should the data refresh handle these records?

Documents

  • Project Proposal
  • Implementation Plan(s)
    • Documentation for the criteria for identifying dead links
    • Process for handling dead links in the Catalog, using information from the Redis cache

Milestones/Issues

Prior Art

This project combines the project ideas of Establish Guidelines and Practices for Dead links and Set up Dead links Removal Pipeline Using Redis Cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ✨ goal: improvementImprovement to an existing user-facing feature💻 aspect: codeConcerns the software code in the repository🧭 project: threadAn issue used to track a project and its progress🧱 stack: catalogRelated to the catalog and Airflow DAGs

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions