Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Sync commoncrawl workflow cannot be loaded if OUTPUT_DIR in env not set #1725

Closed
1 task done
obulat opened this issue Jun 30, 2021 · 0 comments · Fixed by WordPress/openverse-catalog#118
Closed
1 task done
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work

Comments

@obulat
Copy link
Contributor

obulat commented Jun 30, 2021

Description

sync_commoncrawl_workflow.py workflow in Airflow cannot be loaded. This happens because it tries to set the output file path using OUTPUT_DIR environmental variable, which is not set. Provider API scripts have a fall-back chain for setting output paths: they use OUTPUT_DIR env variable, or 'output_dir' function parameter, or /tmp as a default. As a quick fix for production environment (before we restructure the folder structure), we can either set the OUTPUT_DIR environment variable, or provide a default to CRAWL_OUTPUT_DIR . Doing both would be the best :)

Reproduction

  1. Open production environment :) or Try running Airflow without .env file and OUTPUT_DIR variable set in the environment.
  2. Open Airflow dashboard.

Expectation

Airflow should load sync_commoncrawl_workflow.py without errors

Screenshots

All workflows load without errors except the sync_commoncrawl_workflow.py. On the top of the screen, if you click the red arrow to open details, you can see this error:

Broken DAG: [/usr/local/airflow/dags/sync_commoncrawl_workflow.py] Traceback (most recent call last):
  File "/usr/local/airflow/dags/sync_commoncrawl_workflow.py", line 26, in <module>
    CRAWL_OUTPUT_DIR = os.path.join(os.environ["OUTPUT_DIR"], TSV_SUBDIR)
  File "/usr/local/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'OUTPUT_DIR'

Additional context

Instead of relying solely on the environment variable for setting the output directory:
CRAWL_OUTPUT_DIR = os.path.join(os.environ["OUTPUT_DIR"], TSV_SUBDIR)
we could add a default /tmp value, just like in provider API workflows:
CRAWL_OUTPUT_DIR = os.path.join(os.environ.get("OUTPUT_DIR", '/tmp'), TSV_SUBDIR)

Resolution

  • 🙋 I would be interested in resolving this bug.
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Jun 30, 2021
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant