[Bug] Sync commoncrawl workflow cannot be loaded if OUTPUT_DIR
in env not set
#1725
Closed
1 task done
Labels
💻 aspect: code
Concerns the software code in the repository
🛠 goal: fix
Bug fix
🟧 priority: high
Stalls work on the project or its dependents
🚦 status: awaiting triage
Has not been triaged & therefore, not ready for work
Description
sync_commoncrawl_workflow.py
workflow in Airflow cannot be loaded. This happens because it tries to set the output file path usingOUTPUT_DIR
environmental variable, which is not set. Provider API scripts have a fall-back chain for setting output paths: they useOUTPUT_DIR
env variable, or 'output_dir' function parameter, or/tmp
as a default. As a quick fix for production environment (before we restructure the folder structure), we can either set theOUTPUT_DIR
environment variable, or provide a default toCRAWL_OUTPUT_DIR
. Doing both would be the best :)Reproduction
.env
file andOUTPUT_DIR
variable set in the environment.Expectation
Airflow should load
sync_commoncrawl_workflow.py
without errorsScreenshots
All workflows load without errors except the
sync_commoncrawl_workflow.py
. On the top of the screen, if you click the red arrow to open details, you can see this error:Additional context
Instead of relying solely on the environment variable for setting the output directory:
CRAWL_OUTPUT_DIR = os.path.join(os.environ["OUTPUT_DIR"], TSV_SUBDIR)
we could add a default
/tmp
value, just like in provider API workflows:CRAWL_OUTPUT_DIR = os.path.join(os.environ.get("OUTPUT_DIR", '/tmp'), TSV_SUBDIR)
Resolution
The text was updated successfully, but these errors were encountered: