Note: this document is auto-generated and should not be manually edited
This document describes the DAGs available along with pertinent DAG information and the DAG's documentation.
The DAGs are shown in two forms:
The following are DAGs grouped by their primary tag:
DAG ID | Schedule Interval |
---|---|
commoncrawl_etl_workflow |
0 0 * * 1 |
sync_commoncrawl_workflow |
0 16 15 * * |
DAG ID | Schedule Interval |
---|---|
audio_data_refresh |
@weekly |
image_data_refresh |
@weekly |
DAG ID | Schedule Interval |
---|---|
image_expiration_workflow |
None |
recreate_audio_popularity_calculation |
None |
recreate_image_popularity_calculation |
None |
report_pending_reported_media |
@weekly |
DAG ID | Schedule Interval |
---|---|
airflow_log_cleanup |
@weekly |
check_silenced_dags |
@weekly |
pr_review_reminders |
0 0 * * 1-5 |
DAG ID | Schedule Interval |
---|---|
oauth2_authorization |
None |
oauth2_token_refresh |
0 */12 * * * |
DAG ID | Schedule Interval | Dated | Media Type(s) |
---|---|---|---|
brooklyn_museum_workflow |
@monthly |
False |
image |
cleveland_museum_workflow |
@monthly |
False |
image |
europeana_workflow |
@daily |
True |
image |
finnish_museums_workflow |
@monthly |
False |
image |
flickr_workflow |
@daily |
True |
image |
freesound_workflow |
@monthly |
False |
audio |
inaturalist_workflow |
@monthly |
False |
image |
jamendo_workflow |
@monthly |
False |
audio |
metropolitan_museum_workflow |
@daily |
True |
image |
museum_victoria_workflow |
@monthly |
False |
image |
nypl_workflow |
@monthly |
False |
image |
phylopic_workflow |
@daily |
True |
image |
rawpixel_workflow |
@monthly |
False |
image |
science_museum_workflow |
@monthly |
False |
image |
smithsonian_workflow |
@weekly |
False |
image |
smk_workflow |
@monthly |
False |
image |
stocksnap_workflow |
@monthly |
False |
image |
wikimedia_commons_workflow |
@daily |
True |
image, audio |
wordpress_workflow |
@monthly |
False |
image |
DAG ID | Schedule Interval |
---|---|
europeana_reingestion_workflow |
@weekly |
flickr_reingestion_workflow |
@weekly |
metropolitan_museum_reingestion_workflow |
@weekly |
phylopic_reingestion_workflow |
@weekly |
wikimedia_reingestion_workflow |
@weekly |
The following is documentation associated with each DAG (where available):
airflow_log_cleanup
audio_data_refresh
check_silenced_dags
europeana_reingestion_workflow
europeana_workflow
flickr_reingestion_workflow
flickr_workflow
freesound_workflow
image_data_refresh
inaturalist_workflow
jamendo_workflow
metropolitan_museum_reingestion_workflow
metropolitan_museum_workflow
oauth2_authorization
oauth2_token_refresh
phylopic_reingestion_workflow
phylopic_workflow
pr_review_reminders
rawpixel_workflow
recreate_audio_popularity_calculation
recreate_image_popularity_calculation
report_pending_reported_media
smithsonian_workflow
stocksnap_workflow
wikimedia_commons_workflow
wikimedia_reingestion_workflow
wordpress_workflow
A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big. By default, this will also clean child process logs from the 'scheduler' directory.
Can remove all log files by setting "maxLogAgeInDays" to -1.
If you want to test the DAG in the Airflow Web UI, you can also set
enableDelete to false
, and then you will see a list of log folders
that can be deleted, but will not actually delete them.
This should all go on one line:
airflow dags trigger --conf
'{"maxLogAgeInDays":-1, "enableDelete": "false"}' airflow_log_cleanup
--conf
options:
- maxLogAgeInDays: - Optional
- enableDelete: - Optional
This file generates our data refresh DAGs using a factory function. For the given media type these DAGs will first refresh the popularity data, then initiate a data refresh on the data refresh server and await the success or failure of that task.
Popularity data for each media type is collated in a materialized view. Before
initiating a data refresh, the DAG will first refresh the view in order to
update popularity data for records that have been ingested since the last refresh.
On the first run of the month, the DAG will also refresh the underlying tables,
including the percentile values and any new popularity metrics. The DAG can also
be run with the force_refresh_metrics
option to run this refresh after the first
of the month.
Once this step is complete, the data refresh can be initiated. A data refresh occurs on the data refresh server in the openverse-api project. This is a task which imports data from the upstream Catalog database into the API, copies contents to a new Elasticsearch index, and finally makes the index "live". This process is necessary to make new content added to the Catalog by our provider DAGs available to the API. You can read more in the README Importantly, the data refresh TaskGroup is also configured to handle concurrency requirements of the data refresh server.
You can find more background information on this process in the following issues and related PRs:
- [Feature] Data refresh orchestration DAG
- [Feature] Merge popularity calculations and data refresh into a single DAG
Checks for DAGs that have silenced Slack alerts which may need to be turned back on.
When a DAG has known failures, it can be ommitted from Slack error reporting by adding
an entry to the SILENCED_SLACK_NOTIFICATIONS
Airflow variable. This is a dictionary
where thekey is the dag_id
of the affected DAG, and the value is a list of
SilencedSlackNotifications (which map silenced notifications to GitHub URLs) for that
DAG.
The check_silenced_dags
DAG iterates over the entries in the
SILENCED_SLACK_NOTIFICATIONS
configuration and verifies that the associated GitHub
issues are still open. If an issue has been closed, it is assumed that the DAG should
have Slack reporting reenabled, and an alert is sent to prompt manual update of the
configuration. This prevents developers from forgetting to reenable Slack reporting
after the issue has been resolved.
The DAG runs weekly.
Content Provider: Europeana
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the images and the respective meta-data.
Notes: https://www.europeana.eu/api/v2/search.json
Content Provider: Europeana
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the images and the respective meta-data.
Notes: https://www.europeana.eu/api/v2/search.json
Content Provider: Flickr
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the images and the respective meta-data.
Notes: https://www.flickr.com/help/terms/api Rate limit: 3600 requests per hour.
Content Provider: Flickr
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the images and the respective meta-data.
Notes: https://www.flickr.com/help/terms/api Rate limit: 3600 requests per hour.
Content Provider: Freesound
ETL Process: Use the API to identify all CC-licensed images.
Output: TSV file containing the image, the respective meta-data.
Notes: https://freesound.org/apiv2/search/text' No rate limit specified. This script can be run either to ingest the full dataset or as a dated DAG.
This file generates our data refresh DAGs using a factory function. For the given media type these DAGs will first refresh the popularity data, then initiate a data refresh on the data refresh server and await the success or failure of that task.
Popularity data for each media type is collated in a materialized view. Before
initiating a data refresh, the DAG will first refresh the view in order to
update popularity data for records that have been ingested since the last refresh.
On the first run of the month, the DAG will also refresh the underlying tables,
including the percentile values and any new popularity metrics. The DAG can also
be run with the force_refresh_metrics
option to run this refresh after the first
of the month.
Once this step is complete, the data refresh can be initiated. A data refresh occurs on the data refresh server in the openverse-api project. This is a task which imports data from the upstream Catalog database into the API, copies contents to a new Elasticsearch index, and finally makes the index "live". This process is necessary to make new content added to the Catalog by our provider DAGs available to the API. You can read more in the README Importantly, the data refresh TaskGroup is also configured to handle concurrency requirements of the data refresh server.
You can find more background information on this process in the following issues and related PRs:
- [Feature] Data refresh orchestration DAG
- [Feature] Merge popularity calculations and data refresh into a single DAG
Provider: iNaturalist
Output: TSV file containing the media metadata.
Notes: [The iNaturalist API is not intended for data scraping.] (https://api.inaturalist.org/v1/docs/)
[But there is a full dump intended for sharing on S3.]
(https://github.com/inaturalist/inaturalist-open-data/tree/documentation/Metadata)
Because these are very large normalized tables, as opposed to more document
oriented API responses, we found that bringing the data into postgres first
was the most effective approach. [More detail in slack here.]
(https://wordpress.slack.com/archives/C02012JB00N/p1653145643080479?thread_ts=1653082292.714469&cid=C02012JB00N)
We use the table structure defined [here,]
(https://github.com/inaturalist/inaturalist-open-data/blob/main/Metadata/structure.sql)
except for adding ancestry tags to the taxa table.
Content Provider: Jamendo
ETL Process: Use the API to identify all CC-licensed audio.
Output: TSV file containing the audio meta-data.
Notes: https://api.jamendo.com/v3.0/tracks/ 35,000 requests per month for non-commercial apps Jamendo Music has more than 500,000 tracks shared by 40,000 artists from over 150 countries all over the world. Audio quality: uploaded as WAV/ FLAC/ AIFF bit depth: 16/24 sample rate: 44.1 or 48 kHz channels: 1/2
Content Provider: Metropolitan Museum of Art
ETL Process: Use the API to identify all CC0 artworks.
Output: TSV file containing the image, their respective meta-data.
Notes: https://metmuseum.github.io/#search "Please limit requests to 80 requests per second." May need to bump up the delay (e.g. to 3 seconds), to avoid of blocking during local development testing.
Some analysis to improve data quality was conducted using a
separate csv file here: https://github.com/metmuseum/openaccess
Get a list of object IDs:
https://collectionapi.metmuseum.org/public/collection/v1/objects?metadataDate=2022-08-10
Get a specific object:
https://collectionapi.metmuseum.org/public/collection/v1/objects/1027
The search functionality requires a specific query (term search)
in addition to date and public domain. It seems like it won't
connect with just date and license.
https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07
Content Provider: Metropolitan Museum of Art
ETL Process: Use the API to identify all CC0 artworks.
Output: TSV file containing the image, their respective meta-data.
Notes: https://metmuseum.github.io/#search "Please limit requests to 80 requests per second." May need to bump up the delay (e.g. to 3 seconds), to avoid of blocking during local development testing.
Some analysis to improve data quality was conducted using a
separate csv file here: https://github.com/metmuseum/openaccess
Get a list of object IDs:
https://collectionapi.metmuseum.org/public/collection/v1/objects?metadataDate=2022-08-10
Get a specific object:
https://collectionapi.metmuseum.org/public/collection/v1/objects/1027
The search functionality requires a specific query (term search)
in addition to date and public domain. It seems like it won't
connect with just date and license.
https://collectionapi.metmuseum.org/public/collection/v1/search?isPublicDomain=true&metadataDate=2022-08-07
Author: Madison Swain-Bowden
Created: 2021-10-13
Iterates through all the OAuth2 providers and attempts to authorize them using tokens
found in the in the OAUTH2_AUTH_KEYS
Variable. Once authorization has been
completed successfully, the auth token is removed from that Variable. The authorization
will create an access/refresh token pair in the OAUTH2_ACCESS_TOKENS
Variable.
Current Providers:
- Freesound
Author: Madison Swain-Bowden
Created: 2021-10-13
Iterates through all OAuth2 providers and attempts to refresh the access token using
the refresh token stored in the OAUTH2_ACCESS_TOKENS
Variable. This DAG will
update the tokens stored in the Variable upon successful refresh.
Current Providers:
- Freesound
Content Provider: PhyloPic
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the image, their respective meta-data.
Notes: http://phylopic.org/api/ No rate limit specified.
Content Provider: PhyloPic
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the image, their respective meta-data.
Notes: http://phylopic.org/api/ No rate limit specified.
Iterates through open PRs in our repositories and pings assigned reviewers who have not yet approved the PR or explicitly requested changes.
This DAG runs daily and pings on the following schedule based on priority label:
priority | days |
---|---|
critical | 1 day |
high | >2 days |
medium | >4 days |
low | >7 days |
The DAG does not ping on Saturday and Sunday and accounts for weekend days when determining how much time has passed since the review.
Unfortunately the DAG does not know when someone is on vacation. It is up to the author of the PR to re-assign review if one of the randomly selected reviewers is unavailable for the time period during which the PR should be reviewed.
Content Provider: Rawpixel
ETL Process: Use the API to identify all CC-licensed images.
Output: TSV file containing the image meta-data.
Notes: Rawpixel has given Openverse beta access to their API. This API is undocumented, and we will need to contact Rawpixel directly if we run into any issues. The public API max results range is limited to 100,000 results, although the API key we've been given can circumvent this limit. https://www.rawpixel.com/api/v1/search?tags=$publicdomain&page=1&pagesize=100
This file generates Apache Airflow DAGs that, for the given media type, completely wipe out the PostgreSQL relations and functions involved in calculating our standardized popularity metric. It then recreates relations and functions to make the calculation, and performs an initial calculation. The results are available in the materialized view for that media type.
These DAGs are not on a schedule, and should only be run manually when new SQL code is deployed for the calculation.
This file generates Apache Airflow DAGs that, for the given media type, completely wipe out the PostgreSQL relations and functions involved in calculating our standardized popularity metric. It then recreates relations and functions to make the calculation, and performs an initial calculation. The results are available in the materialized view for that media type.
These DAGs are not on a schedule, and should only be run manually when new SQL code is deployed for the calculation.
This DAG checks for any user-reported media pending manual review, and alerts via Slack.
Media may be reported for mature content or copyright infringement, for example. Once reported, these require manual review through the Django Admin to determine whether further action (such as deindexing the record) needs to be taken. If a record has been reported multiple times, it only needs to be reviewed once and so is only counted once in the reporting by this DAG.
Content Provider: Smithsonian
ETL Process: Use the API to identify all CC licensed images.
Output: TSV file containing the images and the respective meta-data.
Notes: https://api.si.edu/openaccess/api/v1.0/search
Content Provider: StockSnap
ETL Process: Use the API to identify all CC-licensed images.
Output: TSV file containing the image, the respective meta-data.
Notes: https://stocksnap.io/api/load-photos/date/desc/1 https://stocksnap.io/faq All images are licensed under CC0. No rate limits or authorization required. API is undocumented.
Content Provider: Wikimedia Commons
ETL Process: Use the API to identify all CC-licensed images.
Output: TSV file containing the image, the respective meta-data.
Notes: https://commons.wikimedia.org/wiki/API:Main_page No rate limit specified.
Content Provider: Wikimedia Commons
ETL Process: Use the API to identify all CC-licensed images.
Output: TSV file containing the image, the respective meta-data.
Notes: https://commons.wikimedia.org/wiki/API:Main_page No rate limit specified.
Content Provider: WordPress Photo Directory
ETL Process: Use the API to identify all openly licensed media.
Output: TSV file containing the media metadata.
Notes: https://wordpress.org/photos/wp-json/wp/v2 Provide photos, media, users and more related resources. No rate limit specified.