Skip to content

Use the batched_update DAG with stored CSVs to update Catalog URLs #3415

Closed

Description

Problem

We have generated some CSVs with identifier and another column that we need to use to update the Catalog media table, but we don't have a way to efficiently run the media table updates.

Description

The batched update DAG is reusable DAG which can be used to perform an arbitrary batched update on a Catalog media table, while handling deadlocking and timeout concerns.
During the cleanup process in data refresh, we generate the CSVs that contain the item identifier and the cleaned up version of another column (title, url, foreign_landing_url, creator_url and tags). We need a DAG that is similar to the batched update DAG, but can use a CSV table for selecting the items that need to be updated.

It is important that this work does not delete any tags. The tag column, while present in the CSVs, should not be used.

Additional context

The CSV files are saved in the docker container of the ingestion server when we run data refresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

🌟 goal: additionAddition of new feature💻 aspect: codeConcerns the software code in the repository🟧 priority: highStalls work on the project or its dependents🧱 stack: catalogRelated to the catalog and Airflow DAGs

Type

No type

Projects

  • Status

    ✅ Done

Relationships

None yet

Development

No branches or pull requests

Issue actions