Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker build scripts #826

Merged
merged 33 commits into from
Nov 19, 2020
Merged

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Nov 18, 2020

This adds some data-release scripts, configurations and Dockerfiles to build variety of pudl docker images. This feature is still somewhat experimental and changes are to be expected. Variety of docker flavors exist - some pull the code from git repo, some pull the code from the local copy. The two most developed flavors are Dockerfile.local that pulls the local copy and uses miniconda environment, the other is Dockerfile.tiny that uses pip install and does not rely on conda - this way, the resulting docker image is smaller and also simpler to use (there is no trouble with ensuring that the right conda environment is activated).

Additional --logfile cli argument is added to allow directing ETL output into specific file (this might be useful when running the ETL in an automated fashion).

1. settings.yml copied from pudl-data-release repo
2. Dockerfile.git copies code from catalyst-cooperative github account
3. Dockerfile.local makes shallow git copy of a local repo

Notes:
Perhaps making shallow copy of local git repo is unnecessary and we can simply
copy the full state of the local repo. That may allow for testing/installing
uncommited changes into the docker image, but may also increase the size of the
resultant image.

Perhaps there is a way to make pip install work w/o the need for git files
to be also present.
This will be useful for when we run the ETL in an automated container.
1. add data-release.sh script that will be run in the docker container
2. added PUDL_SETTINGS env variable used by the above script
3. mark /pudl/inputs/data as a mountable volume (this is where zenodo
   files live)

Caveats:
shallow copying of git history (intended to reduce image size) also
results in ignoring unsubmitted local changes (which is not necessarily
what we want). We could build pypi tarballs locally but that has other
problems (e.g. some build issues due to missing dependencies)
1. moved configs under release/settings
2. ensure that release script runs within conda env (this is kind of
   fragile)
Instead of doing shallow git clone, building pypi tarball and
installing that with pip seems to do the trick just fine and
may result in less clutter.
WIP: This is an attempt to build small docker image with
pudl. Installs pudl and its dependencies with pip and avoids
conda (which takes up a lot of space).
This should run tox tests on the output files, but right now it
doesn't really do much.

Pending:
output files will need to be located (/pudl/outputs/datapkg) and
transformed into the right formats before running tox tests on those.
1. clean up git repo copy from build image
2. add entrypoint for the tiny docker image
1. settings.yml copied from pudl-data-release repo
2. Dockerfile.git copies code from catalyst-cooperative github account
3. Dockerfile.local makes shallow git copy of a local repo

Notes:
Perhaps making shallow copy of local git repo is unnecessary and we can simply
copy the full state of the local repo. That may allow for testing/installing
uncommited changes into the docker image, but may also increase the size of the
resultant image.

Perhaps there is a way to make pip install work w/o the need for git files
to be also present.
This will be useful for when we run the ETL in an automated container.
1. add data-release.sh script that will be run in the docker container
2. added PUDL_SETTINGS env variable used by the above script
3. mark /pudl/inputs/data as a mountable volume (this is where zenodo
   files live)

Caveats:
shallow copying of git history (intended to reduce image size) also
results in ignoring unsubmitted local changes (which is not necessarily
what we want). We could build pypi tarballs locally but that has other
problems (e.g. some build issues due to missing dependencies)
1. moved configs under release/settings
2. ensure that release script runs within conda env (this is kind of
   fragile)
Instead of doing shallow git clone, building pypi tarball and
installing that with pip seems to do the trick just fine and
may result in less clutter.
WIP: This is an attempt to build small docker image with
pudl. Installs pudl and its dependencies with pip and avoids
conda (which takes up a lot of space).
This should run tox tests on the output files, but right now it
doesn't really do much.

Pending:
output files will need to be located (/pudl/outputs/datapkg) and
transformed into the right formats before running tox tests on those.
@rousik
Copy link
Collaborator Author

rousik commented Nov 18, 2020

I think that the right approach here would be to use "Squash and commit" to collapse the large number of minor commits into one logically consistent one that adds all these files in their final form.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions on the files. Thank you for pulling all this together!

release/Dockerfile.git Show resolved Hide resolved
release/Dockerfile.local Show resolved Hide resolved
release/Dockerfile.git Show resolved Hide resolved
release/settings/release.yml Show resolved Hide resolved
# NOTE: coalmine_eia923 REQUIRES fuel_receipts_costs_eia923
- coalmine_eia923
- fuel_receipts_costs_eia923
eia923_years: [2018,2019]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you having luck running with just 2 years? It often generates a bunch of internal consistency failures in the harvested tables (since any time there are 2 different values reported, it fails to meet the criteria for having a "consistent value" reported).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember anymore. I may have added this file way before I was able to successfully run the pipeline so this test file may not be operational.

src/pudl/cli.py Show resolved Hide resolved
release/data-release.sh Show resolved Hide resolved
@zaneselvans zaneselvans merged commit 375dc01 into catalyst-cooperative:sprint27 Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants