Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

[Quality] Delay object creation until runtime #229

Closed

Description

Current Situation

Related to #199 - we have a number of objects (DelayedRequester, MediaStore, etc.) that are created at import time rather than runtime. This has a negative performance impact on DAG processing time. The DAG parsing does not need to have these object available, and initializing them every time the parsing is done can unnecessarily increase parsing time. This can be seen from the logging that's done during DAG parsing:

$ airflow dags list -v
[2021-10-07 21:37:40,207] {dagbag.py:496} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2021-10-07 21:37:40,242] {media.py:61} INFO - Initialized image MediaStore with provider finnishmuseums
[2021-10-07 21:37:40,242] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,243] {media.py:180} INFO - Output path: /tmp/finnishmuseums_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,248] {media.py:61} INFO - Initialized image MediaStore with provider smithsonian
[2021-10-07 21:37:40,248] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,248] {media.py:180} INFO - Output path: /tmp/smithsonian_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,257] {media.py:61} INFO - Initialized image MediaStore with provider europeana
[2021-10-07 21:37:40,257] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,257] {media.py:180} INFO - Output path: /tmp/europeana_image_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized audio MediaStore with provider wikimedia_audio
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,316] {media.py:180} INFO - Output path: /tmp/wikimedia_audio_audio_v001_20211007213740.tsv
[2021-10-07 21:37:40,316] {media.py:61} INFO - Initialized image MediaStore with provider wikimedia
[2021-10-07 21:37:40,316] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,317] {media.py:180} INFO - Output path: /tmp/wikimedia_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,327] {media.py:61} INFO - Initialized image MediaStore with provider rawpixel
[2021-10-07 21:37:40,327] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,327] {media.py:180} INFO - Output path: /tmp/rawpixel_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,329] {media.py:61} INFO - Initialized image MediaStore with provider flickr
[2021-10-07 21:37:40,329] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,329] {media.py:180} INFO - Output path: /tmp/flickr_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,330] {media.py:61} INFO - Initialized image MediaStore with provider museumsvictoria
[2021-10-07 21:37:40,330] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,331] {media.py:180} INFO - Output path: /tmp/museumsvictoria_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,351] {media.py:61} INFO - Initialized image MediaStore with provider stocksnap
[2021-10-07 21:37:40,351] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,352] {media.py:180} INFO - Output path: /tmp/stocksnap_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,494] {media.py:61} INFO - Initialized image MediaStore with provider met
[2021-10-07 21:37:40,494] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,495] {media.py:180} INFO - Output path: /tmp/met_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,508] {media.py:61} INFO - Initialized image MediaStore with provider statensmuseum
[2021-10-07 21:37:40,508] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,509] {media.py:180} INFO - Output path: /tmp/statensmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,519] {media.py:61} INFO - Initialized image MediaStore with provider nypl
[2021-10-07 21:37:40,519] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,520] {media.py:180} INFO - Output path: /tmp/nypl_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,522] {media.py:61} INFO - Initialized image MediaStore with provider phylopic
[2021-10-07 21:37:40,522] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,522] {media.py:180} INFO - Output path: /tmp/phylopic_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,526] {media.py:61} INFO - Initialized image MediaStore with provider sciencemuseum
[2021-10-07 21:37:40,526] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,526] {media.py:180} INFO - Output path: /tmp/sciencemuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,529] {media.py:61} INFO - Initialized audio MediaStore with provider jamendo
[2021-10-07 21:37:40,529] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,529] {media.py:180} INFO - Output path: /tmp/jamendo_audio_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(1970, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,531] {media.py:61} INFO - Initialized image MediaStore with provider clevelandmuseum
[2021-10-07 21:37:40,531] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,531] {media.py:180} INFO - Output path: /tmp/clevelandmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 15, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,536] {media.py:61} INFO - Initialized image MediaStore with provider waltersartmuseum
[2021-10-07 21:37:40,536] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,536] {media.py:180} INFO - Output path: /tmp/waltersartmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 9, 27, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
[2021-10-07 21:37:40,543] {media.py:61} INFO - Initialized image MediaStore with provider brooklynmuseum
[2021-10-07 21:37:40,543] {media.py:162} INFO - No given output directory. Using OUTPUT_DIR from environment.
[2021-10-07 21:37:40,543] {media.py:180} INFO - Output path: /tmp/brooklynmuseum_image_v001_20211007213740.tsv
{'owner': 'data-eng-admin', 'depends_on_past': False, 'start_date': datetime.datetime(2020, 1, 1, 0, 0), 'email_on_retry': False, 'retries': 3, 'retry_delay': datetime.timedelta(seconds=900)}
dag_id                                    | filepath                                     | owner          | paused
==========================================+==============================================+================+=======
airflow_log_cleanup                       | airflow_log_cleanup_workflow.py              | data-eng-admin | True  
brooklyn_museum_workflow                  | brooklyn_museum_workflow.py                  | data-eng-admin | True  
check_new_smithsonian_unit_codes_workflow | check_new_smithsonian_unit_codes_workflow.py | data-eng-admin | True  
cleveland_museum_workflow                 | cleveland_museum_workflow.py                 | data-eng-admin | True  
europeana_ingestion_workflow              | europeana_ingestion_workflow.py              | data-eng-admin | True  
europeana_sub_provider_update_workflow    | europeana_sub_provider_update_workflow.py    | data-eng-admin | True  
europeana_workflow                        | europeana_workflow.py                        | data-eng-admin | True  
finnish_museums_workflow                  | finnish_museums_workflow.py                  | data-eng-admin | True  
flickr_ingestion_workflow                 | flickr_ingestion_workflow.py                 | data-eng-admin | True  
flickr_sub_provider_update_workflow       | flickr_sub_provider_update_workflow.py       | data-eng-admin | True  
flickr_workflow                           | flickr_workflow.py                           | data-eng-admin | True  
healthcheck_workflow                      | healthcheck_workflow.py                      | data-eng-admin | True  
image_expiration_workflow                 | image_expiration_workflow.py                 | data-eng-admin | True  
jamendo_workflow                          | jamendo_workflow.py                          | data-eng-admin | True  
metropolitan_museum_workflow              | metropolitan_museum_workflow.py              | data-eng-admin | True  
museum_victoria_workflow                  | museum_victoria_workflow.py                  | data-eng-admin | True  
nypl_workflow                             | nypl_workflow.py                             | data-eng-admin | True  
phylopic_workflow                         | phylopic_workflow.py                         | data-eng-admin | True  
postgres_image_cleaner                    | cleaner_workflow.py                          | data-eng-admin | True  
rawpixel_workflow                         | rawpixel_workflow.py                         | data-eng-admin | True  
recreate_audio_popularity_calculation     | recreate_audio_popularity_calculation.py     | data-eng-admin | True  
recreate_image_popularity_calculation     | recreate_image_popularity_calculation.py     | data-eng-admin | True  
refresh_all_audio_popularity_data         | refresh_all_audio_popularity_data.py         | data-eng-admin | True  
refresh_all_image_popularity_data         | refresh_all_image_popularity_data.py         | data-eng-admin | True  
refresh_audio_view_data                   | refresh_audio_view_data.py                   | data-eng-admin | True  
refresh_image_view_data                   | refresh_image_view_data.py                   | data-eng-admin | True  
science_museum_workflow                   | science_museum_workflow.py                   | data-eng-admin | True  
smithsonian_sub_provider_update_workflow  | smithsonian_sub_provider_update_workflow.py  | data-eng-admin | True  
smithsonian_workflow                      | smithsonian_workflow.py                      | data-eng-admin | True  
staten_museum_workflow                    | statens_museum_workflow.py                   | data-eng-admin | True  
stocksnap_workflow                        | stocksnap_workflow.py                        | data-eng-admin | True  
tsv_to_postgres_loader                    | loader_workflow.py                           | data-eng-admin | True  
tsv_to_postgres_loader_overwrite          | loader_workflow.py                           | data-eng-admin | True  
walters_workflow                          | walters_workflow.py                          | data-eng-admin | True  
wikimedia_commons_workflow                | wikimedia_workflow.py                        | data-eng-admin | True  
wikimedia_ingestion_workflow              | wikimedia_ingestion_workflow.py              | data-eng-admin | True  

This is happening at every DAG parse interval, which by default is 30 seconds.

Suggested Improvement

Initialize these objects in the main functions of each provider, and pass them into other functions as necessary.

Benefit

Decreased DAG processing time, CPU & memory usage.

Additional context

Implementation

  • 🙋 I would be interested in implementing this feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions