This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
[Quality] Delay object creation until runtime #229
Closed
Description
Current Situation
Related to #199 - we have a number of objects (DelayedRequester
, MediaStore
, etc.) that are created at import time rather than runtime. This has a negative performance impact on DAG processing time. The DAG parsing does not need to have these object available, and initializing them every time the parsing is done can unnecessarily increase parsing time. This can be seen from the logging that's done during DAG parsing:
$ airflow dags list -v
[2021-10-07 21:37:40,207] {dagbag.py:496} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2021-10-07 21:37:40,242] {media.py:61} INFO - Initialized image MediaStore with provider finnishmuseums
[2021-10-07 21:37:40,242] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,243] {media.py:180} INFO - Output path: /tmp/finnishmuseums_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,248] {media.py:61} INFO - Initialized image MediaStore with provider smithsonian
[2021-10-07 21:37:40,248] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,248] {media.py:180} INFO - Output path: /tmp/smithsonian_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,257] {media.py:61} INFO - Initialized image MediaStore with provider europeana
[2021-10-07 21:37:40,257] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,257] {media.py:180} INFO - Output path: /tmp/europeana_image_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized audio MediaStore with provider wikimedia_audio
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,316] {media.py:180} INFO - Output path: /tmp/wikimedia_audio_audio_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized image MediaStore with provider wikimedia
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,317] {media.py:180} INFO - Output path: /tmp/wikimedia_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,327] {media.py:61} INFO - Initialized image MediaStore with provider rawpixel
[2021-10-07 21:37:40,327] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,327] {media.py:180} INFO - Output path: /tmp/rawpixel_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,329] {media.py:61} INFO - Initialized image MediaStore with provider flickr
[2021-10-07 21:37:40,329] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,329] {media.py:180} INFO - Output path: /tmp/flickr_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,330] {media.py:61} INFO - Initialized image MediaStore with provider museumsvictoria
[2021-10-07 21:37:40,330] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,331] {media.py:180} INFO - Output path: /tmp/museumsvictoria_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,351] {media.py:61} INFO - Initialized image MediaStore with provider stocksnap
[2021-10-07 21:37:40,351] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,352] {media.py:180} INFO - Output path: /tmp/stocksnap_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,494] {media.py:61} INFO - Initialized image MediaStore with provider met
[2021-10-07 21:37:40,494] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,495] {media.py:180} INFO - Output path: /tmp/met_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,508] {media.py:61} INFO - Initialized image MediaStore with provider statensmuseum
[2021-10-07 21:37:40,508] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,509] {media.py:180} INFO - Output path: /tmp/statensmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,519] {media.py:61} INFO - Initialized image MediaStore with provider nypl
[2021-10-07 21:37:40,519] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,520] {media.py:180} INFO - Output path: /tmp/nypl_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,522] {media.py:61} INFO - Initialized image MediaStore with provider phylopic
[2021-10-07 21:37:40,522] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,522] {media.py:180} INFO - Output path: /tmp/phylopic_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,526] {media.py:61} INFO - Initialized image MediaStore with provider sciencemuseum
[2021-10-07 21:37:40,526] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,526] {media.py:180} INFO - Output path: /tmp/sciencemuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,529] {media.py:61} INFO - Initialized audio MediaStore with provider jamendo
[2021-10-07 21:37:40,529] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,529] {media.py:180} INFO - Output path: /tmp/jamendo_audio_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,531] {media.py:61} INFO - Initialized image MediaStore with provider clevelandmuseum
[2021-10-07 21:37:40,531] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,531] {media.py:180} INFO - Output path: /tmp/clevelandmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 15, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,536] {media.py:61} INFO - Initialized image MediaStore with provider waltersartmuseum
[2021-10-07 21:37:40,536] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,536] {media.py:180} INFO - Output path: /tmp/waltersartmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 27, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,543] {media.py:61} INFO - Initialized image MediaStore with provider brooklynmuseum
[2021-10-07 21:37:40,543] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,543] {media.py:180} INFO - Output path: /tmp/brooklynmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
dag_id | filepath | owner | paused
==========================================+==============================================+================+=======
airflow_log_cleanup | airflow_log_cleanup_workflow.py | data-eng-admin | True
brooklyn_museum_workflow | brooklyn_museum_workflow.py | data-eng-admin | True
check_new_smithsonian_unit_codes_workflow | check_new_smithsonian_unit_codes_workflow.py | data-eng-admin | True
cleveland_museum_workflow | cleveland_museum_workflow.py | data-eng-admin | True
europeana_ingestion_workflow | europeana_ingestion_workflow.py | data-eng-admin | True
europeana_sub_provider_update_workflow | europeana_sub_provider_update_workflow.py | data-eng-admin | True
europeana_workflow | europeana_workflow.py | data-eng-admin | True
finnish_museums_workflow | finnish_museums_workflow.py | data-eng-admin | True
flickr_ingestion_workflow | flickr_ingestion_workflow.py | data-eng-admin | True
flickr_sub_provider_update_workflow | flickr_sub_provider_update_workflow.py | data-eng-admin | True
flickr_workflow | flickr_workflow.py | data-eng-admin | True
healthcheck_workflow | healthcheck_workflow.py | data-eng-admin | True
image_expiration_workflow | image_expiration_workflow.py | data-eng-admin | True
jamendo_workflow | jamendo_workflow.py | data-eng-admin | True
metropolitan_museum_workflow | metropolitan_museum_workflow.py | data-eng-admin | True
museum_victoria_workflow | museum_victoria_workflow.py | data-eng-admin | True
nypl_workflow | nypl_workflow.py | data-eng-admin | True
phylopic_workflow | phylopic_workflow.py | data-eng-admin | True
postgres_image_cleaner | cleaner_workflow.py | data-eng-admin | True
rawpixel_workflow | rawpixel_workflow.py | data-eng-admin | True
recreate_audio_popularity_calculation | recreate_audio_popularity_calculation.py | data-eng-admin | True
recreate_image_popularity_calculation | recreate_image_popularity_calculation.py | data-eng-admin | True
refresh_all_audio_popularity_data | refresh_all_audio_popularity_data.py | data-eng-admin | True
refresh_all_image_popularity_data | refresh_all_image_popularity_data.py | data-eng-admin | True
refresh_audio_view_data | refresh_audio_view_data.py | data-eng-admin | True
refresh_image_view_data | refresh_image_view_data.py | data-eng-admin | True
science_museum_workflow | science_museum_workflow.py | data-eng-admin | True
smithsonian_sub_provider_update_workflow | smithsonian_sub_provider_update_workflow.py | data-eng-admin | True
smithsonian_workflow | smithsonian_workflow.py | data-eng-admin | True
staten_museum_workflow | statens_museum_workflow.py | data-eng-admin | True
stocksnap_workflow | stocksnap_workflow.py | data-eng-admin | True
tsv_to_postgres_loader | loader_workflow.py | data-eng-admin | True
tsv_to_postgres_loader_overwrite | loader_workflow.py | data-eng-admin | True
walters_workflow | walters_workflow.py | data-eng-admin | True
wikimedia_commons_workflow | wikimedia_workflow.py | data-eng-admin | True
wikimedia_ingestion_workflow | wikimedia_ingestion_workflow.py | data-eng-admin | True
This is happening at every DAG parse interval, which by default is 30 seconds.
Suggested Improvement
Initialize these objects in the main
functions of each provider, and pass them into other functions as necessary.
Benefit
Decreased DAG processing time, CPU & memory usage.
Additional context
Implementation
- 🙋 I would be interested in implementing this feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment