-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docker build scripts #826
Add docker build scripts #826
Conversation
1. settings.yml copied from pudl-data-release repo 2. Dockerfile.git copies code from catalyst-cooperative github account 3. Dockerfile.local makes shallow git copy of a local repo Notes: Perhaps making shallow copy of local git repo is unnecessary and we can simply copy the full state of the local repo. That may allow for testing/installing uncommited changes into the docker image, but may also increase the size of the resultant image. Perhaps there is a way to make pip install work w/o the need for git files to be also present.
This will be useful for when we run the ETL in an automated container.
1. add data-release.sh script that will be run in the docker container 2. added PUDL_SETTINGS env variable used by the above script 3. mark /pudl/inputs/data as a mountable volume (this is where zenodo files live) Caveats: shallow copying of git history (intended to reduce image size) also results in ignoring unsubmitted local changes (which is not necessarily what we want). We could build pypi tarballs locally but that has other problems (e.g. some build issues due to missing dependencies)
1. moved configs under release/settings 2. ensure that release script runs within conda env (this is kind of fragile)
Instead of doing shallow git clone, building pypi tarball and installing that with pip seems to do the trick just fine and may result in less clutter.
WIP: This is an attempt to build small docker image with pudl. Installs pudl and its dependencies with pip and avoids conda (which takes up a lot of space).
This should run tox tests on the output files, but right now it doesn't really do much. Pending: output files will need to be located (/pudl/outputs/datapkg) and transformed into the right formats before running tox tests on those.
1. clean up git repo copy from build image 2. add entrypoint for the tiny docker image
1. settings.yml copied from pudl-data-release repo 2. Dockerfile.git copies code from catalyst-cooperative github account 3. Dockerfile.local makes shallow git copy of a local repo Notes: Perhaps making shallow copy of local git repo is unnecessary and we can simply copy the full state of the local repo. That may allow for testing/installing uncommited changes into the docker image, but may also increase the size of the resultant image. Perhaps there is a way to make pip install work w/o the need for git files to be also present.
This will be useful for when we run the ETL in an automated container.
1. add data-release.sh script that will be run in the docker container 2. added PUDL_SETTINGS env variable used by the above script 3. mark /pudl/inputs/data as a mountable volume (this is where zenodo files live) Caveats: shallow copying of git history (intended to reduce image size) also results in ignoring unsubmitted local changes (which is not necessarily what we want). We could build pypi tarballs locally but that has other problems (e.g. some build issues due to missing dependencies)
1. moved configs under release/settings 2. ensure that release script runs within conda env (this is kind of fragile)
Instead of doing shallow git clone, building pypi tarball and installing that with pip seems to do the trick just fine and may result in less clutter.
WIP: This is an attempt to build small docker image with pudl. Installs pudl and its dependencies with pip and avoids conda (which takes up a lot of space).
This should run tox tests on the output files, but right now it doesn't really do much. Pending: output files will need to be located (/pudl/outputs/datapkg) and transformed into the right formats before running tox tests on those.
1. clean up git repo copy from build image 2. add entrypoint for the tiny docker image
I think that the right approach here would be to use "Squash and commit" to collapse the large number of minor commits into one logically consistent one that adds all these files in their final form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions on the files. Thank you for pulling all this together!
# NOTE: coalmine_eia923 REQUIRES fuel_receipts_costs_eia923 | ||
- coalmine_eia923 | ||
- fuel_receipts_costs_eia923 | ||
eia923_years: [2018,2019] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you having luck running with just 2 years? It often generates a bunch of internal consistency failures in the harvested tables (since any time there are 2 different values reported, it fails to meet the criteria for having a "consistent value" reported).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember anymore. I may have added this file way before I was able to successfully run the pipeline so this test file may not be operational.
This adds some data-release scripts, configurations and Dockerfiles to build variety of pudl docker images. This feature is still somewhat experimental and changes are to be expected. Variety of docker flavors exist - some pull the code from git repo, some pull the code from the local copy. The two most developed flavors are
Dockerfile.local
that pulls the local copy and uses miniconda environment, the other isDockerfile.tiny
that usespip install
and does not rely on conda - this way, the resulting docker image is smaller and also simpler to use (there is no trouble with ensuring that the right conda environment is activated).Additional
--logfile
cli argument is added to allow directing ETL output into specific file (this might be useful when running the ETL in an automated fashion).