New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add docker build scripts #826

Merged

zaneselvans merged 33 commits into catalyst-cooperative:sprint27 from rousik:s27-docker

Nov 19, 2020

Collaborator

rousik commented Nov 18, 2020

This adds some data-release scripts, configurations and Dockerfiles to build variety of pudl docker images. This feature is still somewhat experimental and changes are to be expected. Variety of docker flavors exist - some pull the code from git repo, some pull the code from the local copy. The two most developed flavors are Dockerfile.local that pulls the local copy and uses miniconda environment, the other is Dockerfile.tiny that uses pip install and does not rely on conda - this way, the resulting docker image is smaller and also simpler to use (there is no trouble with ensuring that the right conda environment is activated).

Additional --logfile cli argument is added to allow directing ETL output into specific file (this might be useful when running the ETL in an automated fashion).

rousik added 30 commits

October 20, 2020 16:05


          Dockerfile for building pudl-dev conda env.

9077f00


          Add volumes and split input/output directories.

89b5d79


          Prep release/ directory with some basic docker work.

176c69b

1. settings.yml copied from pudl-data-release repo
2. Dockerfile.git copies code from catalyst-cooperative github account
3. Dockerfile.local makes shallow git copy of a local repo

Notes:
Perhaps making shallow copy of local git repo is unnecessary and we can simply
copy the full state of the local repo. That may allow for testing/installing
uncommited changes into the docker image, but may also increase the size of the
resultant image.

Perhaps there is a way to make pip install work w/o the need for git files
to be also present.


          Switch ENV to ARGs, fix pudl_setup.

eed0d54


          WIP: Draft docker-compose file for dev purposes.

13d7ba1


          Hotfix: fix goodtables to a known good commit.

f428bbb


          Allow for logging into file.

474433e

This will be useful for when we run the ETL in an automated container.


          Few more changes to the contanerization.

6b44122

1. add data-release.sh script that will be run in the docker container
2. added PUDL_SETTINGS env variable used by the above script
3. mark /pudl/inputs/data as a mountable volume (this is where zenodo
   files live)

Caveats:
shallow copying of git history (intended to reduce image size) also
results in ignoring unsubmitted local changes (which is not necessarily
what we want). We could build pypi tarballs locally but that has other
problems (e.g. some build issues due to missing dependencies)


          Additional cleanup steps for running containers.

9ef8014

1. moved configs under release/settings
2. ensure that release script runs within conda env (this is kind of
   fragile)


          Install pudl with pypi tarball.

ad9f765

Instead of doing shallow git clone, building pypi tarball and
installing that with pip seems to do the trick just fine and
may result in less clutter.


          Rename artifacts generated with the test config.

ea5e2d1


          Lightweight docker image.

66fcc38

WIP: This is an attempt to build small docker image with
pudl. Installs pudl and its dependencies with pip and avoids
conda (which takes up a lot of space).


          WIP: Data validation docker image.

1a34d27

This should run tox tests on the output files, but right now it
doesn't really do much.

Pending:
output files will need to be located (/pudl/outputs/datapkg) and
transformed into the right formats before running tox tests on those.


          Add year 2019 to the release configuration.

0a7c3b4


          Some minor changes to the docker-compose.yml.

9cf5eba


          Update dockerfile instructions.

8a71a62

1. clean up git repo copy from build image
2. add entrypoint for the tiny docker image


          Dockerfile for building pudl-dev conda env.

8b44fab


          Add volumes and split input/output directories.

e4fb8e0


          Prep release/ directory with some basic docker work.

08ec55e

1. settings.yml copied from pudl-data-release repo
2. Dockerfile.git copies code from catalyst-cooperative github account
3. Dockerfile.local makes shallow git copy of a local repo

Notes:
Perhaps making shallow copy of local git repo is unnecessary and we can simply
copy the full state of the local repo. That may allow for testing/installing
uncommited changes into the docker image, but may also increase the size of the
resultant image.

Perhaps there is a way to make pip install work w/o the need for git files
to be also present.


          Switch ENV to ARGs, fix pudl_setup.

d707cea


          WIP: Draft docker-compose file for dev purposes.

2724cfa


          Allow for logging into file.

This will be useful for when we run the ETL in an automated container.


          Few more changes to the contanerization.

28e1941

1. add data-release.sh script that will be run in the docker container
2. added PUDL_SETTINGS env variable used by the above script
3. mark /pudl/inputs/data as a mountable volume (this is where zenodo
   files live)

Caveats:
shallow copying of git history (intended to reduce image size) also
results in ignoring unsubmitted local changes (which is not necessarily
what we want). We could build pypi tarballs locally but that has other
problems (e.g. some build issues due to missing dependencies)


          Additional cleanup steps for running containers.

a2233f5

1. moved configs under release/settings
2. ensure that release script runs within conda env (this is kind of
   fragile)


          Install pudl with pypi tarball.

ff4fd29

Instead of doing shallow git clone, building pypi tarball and
installing that with pip seems to do the trick just fine and
may result in less clutter.


          Rename artifacts generated with the test config.

f13ebaa


          Lightweight docker image.

65f4f28

WIP: This is an attempt to build small docker image with
pudl. Installs pudl and its dependencies with pip and avoids
conda (which takes up a lot of space).


          WIP: Data validation docker image.

9fe680c

This should run tox tests on the output files, but right now it
doesn't really do much.

Pending:
output files will need to be located (/pudl/outputs/datapkg) and
transformed into the right formats before running tox tests on those.


          Add year 2019 to the release configuration.

99607af


          Some minor changes to the docker-compose.yml.

33be97b

rousik added 3 commits

November 18, 2020 10:39


          Update dockerfile instructions.

24d75da

1. clean up git repo copy from build image
2. add entrypoint for the tiny docker image


          Merge branch 's26-docker' of github.com:rousik/pudl into s26-docker

7c4e7a3


          Remove an old experimental Dockerfile.

375dc01

rousik requested a review from zaneselvans

November 18, 2020 18:51

Collaborator Author

rousik commented Nov 18, 2020

I think that the right approach here would be to use "Squash and commit" to collapse the large number of minor commits into one logically consistent one that adds all these files in their final form.

zaneselvans reviewed

View reviewed changes

Member

zaneselvans left a comment

Just a few questions on the files. Thank you for pulling all this together!

release/Dockerfile.git Show resolved Hide resolved

release/Dockerfile.local Show resolved Hide resolved

release/Dockerfile.git Show resolved Hide resolved

release/settings/release.yml Show resolved Hide resolved

release/settings/test.yml

+                          # NOTE: coalmine_eia923 REQUIRES fuel_receipts_costs_eia923
+                          - coalmine_eia923
+                          - fuel_receipts_costs_eia923
+                        eia923_years: [2018,2019]

Member

zaneselvans Nov 19, 2020

Were you having luck running with just 2 years? It often generates a bunch of internal consistency failures in the harvested tables (since any time there are 2 different values reported, it fails to meet the criteria for having a "consistent value" reported).

Collaborator Author

rousik Nov 19, 2020

I can't remember anymore. I may have added this file way before I was able to successfully run the pipeline so this test file may not be operational.

src/pudl/cli.py Show resolved Hide resolved

release/data-release.sh Show resolved Hide resolved

zaneselvans merged commit 375dc01 into catalyst-cooperative:sprint27

rousik mentioned this pull request

Fix few issues surfaced in the previous PR #827

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet