Dev #898

zaneselvans · 2021-01-23T04:54:25Z

No description provided.

Previously the API_KEY_EIA value was being set as a global variable at the top of the module. This meant that the value of the constant reflected the value at time of module import. So if the user imports pudl (e.g. in a notebook) and then **later** sets the environment variable, the value of the constant doesn't reflect that setting, meaning the API key is usually None. Obtaining the API key from `os.environ.get('API_KEY_EIA') when the key is used inside the functions avoids this issue.

1. transpose test metadata files 2. add file_map.csv to the metadata 3. modify test calls to accout for the changed API (using partition instead of fixed year)

@ezwelty

After seeing the speedups that @ezwelty was able to eek out of the FERC 714 ETL process and missing value imputations, I've become suspicious that the EPA CEMS ETL does not need to take anywhere near as long as it currently does. I did some poking around to see where the bottleneck is, and it seems to be the datastore -- not the extract or transform. While I haven't tracked down the issue, I did move some of the constants that pertain only to the EPA CEMS ETL into that module, and out of the constants.py graveyard.

Use abstract representations for resources and descriptors, implement LocalFileCache, GoogleCloudStorageCache and LayeredCache implementations. This is not guaranteed to be actually truly operational and correct. Also added support for setting up GCS based cache when run as a standalone script.

1. clean up some api method names 2. add type hinting to some methods 3. add tests for PudlFileReference

And fixed some oddities encountered during the testing.

1. replace PudlFileResource with PudlResourceKey, reduced functionality 2. simplified ZenodoFetcher logic 3. refreshed unit-test to cover the new simple api This is WIP and more work needs to happen to make it actually work, specifically: 1. Integrate Datastore class with these simpler classes 2. Rework caching layers for local files and gcs 3. Figure out how to configure datastore caching parameters via command line flags (use local fs or not, use gcs and which bucket and/or prefix)

1. use utf-8 encoding when storing datapackage.json in cache 2. do not add /data subdirectory when initializing LocalFileCache. It is up to the caller to determine the right cache path. 3. add some debug statements to datastore code 4. add tests for name based resource lookups 5. bypass set/delete if there are no layers in the LayeredCache

This refactors the ETL part of the code to use the new Datastore API. Dataset specific implementations of datastore are now implemented as wrappers that provide extra functionality. This makes the passing of datastore somewhat easier because we can create one instance of generic datastore that can be passed to DatasetPipeline objects that may need it and they can optionally wrap it. epaipm datastore logic has been consolidated under extract/epaipm.py and some constants have been moved into this module.

1. whem matching resource partitions, cast all values to string so that we don't get false negatives when dealing with year that may be string or integer. 2. add get_known_datasets method that will simplify the CLI code 3. Fix error handling in Datastore.get_unique_resource. Always throw KeyError and properly handle non-existing resources.

1. expose datastore.cache so that it can be correctly initialized when --populate-gcs-cache is set 2. added --partition flag so that we can select which resources we want to retrieve 3. strip leading slashes from bucket paths so that we don't end up with mangled paths like gs://bucket//prefix/blob.zip

Consolidate interim ETL / output tests

Jupyterhub beta

1. adds dfc.fanout() task to split large DataFrameCollections 2. adds __len__() function to figure out how many tables are in the collection by calling len(dfc)

* Moved all unit tests into a hierarchy under tests/unit * Removed extraneous pytest options in tox.ini being set in pytest.ini * The `unit` and `etl` testenvs in tox.ini now both append to the coverager report. * The CI testenv now runs `coverage erase` to start with a fresh slate * Removed unused @pytest.mark.datapkg decorators / cli option * PyTest will now show warnings (of which there are a few) Closes #893

Clean up PyTest config, coverage generation, unit tests

Implementation of DataFrameCollection

…in CodeCov

Sprint29

review-notebook-app · 2021-01-23T04:54:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2021-01-23T05:08:14Z

Codecov Report

Merging #898 (1e52b3b) into main (dc258e0) will increase coverage by 11.08%.
The diff coverage is 84.38%.

@@             Coverage Diff             @@
##             main     #898       +/-   ##
===========================================
+ Coverage   67.13%   78.21%   +11.08%     
===========================================
  Files          44       45        +1     
  Lines        5556     5801      +245     
===========================================
+ Hits         3730     4537      +807     
+ Misses       1826     1264      -562

Impacted Files	Coverage Δ
src/pudl/convert/datapkg_to_sqlite.py	`40.28% <0.00%> (ø)`
src/pudl/helpers.py	`91.70% <ø> (ø)`
src/pudl/transform/eia860.py	`96.77% <ø> (ø)`
src/pudl/extract/epaipm.py	`47.22% <46.67%> (+14.87%)`	⬆️
src/pudl/output/pudltabl.py	`69.88% <66.67%> (+11.24%)`	⬆️
src/pudl/workspace/datastore.py	`71.03% <73.26%> (+22.61%)`	⬆️
src/pudl/workspace/resource_cache.py	`84.11% <84.11%> (ø)`
src/pudl/etl.py	`81.33% <92.31%> (+0.33%)`	⬆️
src/pudl/extract/ferc1.py	`78.63% <96.30%> (+0.16%)`	⬆️
src/pudl/extract/epacems.py	`97.56% <96.97%> (+3.81%)`	⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc258e0...1e52b3b. Read the comment docs.

zaneselvans and others added 30 commits December 14, 2020 14:24

Consolidate tutorial notebooks for new repo

19f9647

Merge branch 'sprint29' into jupyterhub-beta

ae30f17

Merge branch 'sprint29' into jupyterhub-beta

6d0dae6

Add API_KEY_EIA to error output in FRC table.

48072a2

Remove non-working jupyterlab extensions and pin jedi version.

6c74ee3

Merge branch 'sprint29' into jupyterhub-beta

ff22cbb

Add unit testenv for running unit tests under src/pudl

d66ffaf

Fix excel_test unit test.

c0187e6

1. transpose test metadata files 2. add file_map.csv to the metadata 3. modify test calls to accout for the changed API (using partition instead of fixed year)

WIP: Refactor datastore for parallel execution.

2a05ab3

Refactoring of datastore.

bbca908

1. clean up some api method names 2. add type hinting to some methods 3. add tests for PudlFileReference

Added more datastore unit tests.

5e0ed4a

And fixed some oddities encountered during the testing.

Refactoring datastore, round 2.

b6a9e69

Add cache layer tests.

13feda9

Allow looking up datastore resources by name.

aeef7fa

Add read-only mode to resource caches.

13b6580

Split of caches to a separate module.

a58641b

Minor refactoring of the datastore fixture code.

d4f85e0

Add google-cloud-storage as install dependency.

6b5dd20

Add eia860m DOIs to the refactored datastore.

7f68cf6

Fix typos in the code.

f0de397

Fix test_ferc1_lost_data method (ds refactoring)

8fc3d9e

Hide internals cache of Datastore for cleaner design.

0a90c6c

zaneselvans and others added 27 commits January 20, 2021 15:49

pin bitarray version because their 1.6.2 release is b0rken

e25c0f6

Merge pull request #892 from catalyst-cooperative/epacems-sqlite-bugfix

67af640

Consolidate interim ETL / output tests

Merge branch 'sprint29' into jupyterhub-beta

dd1efcb

Merge branch 'sprint29' into dfc

6407682

Swap os.path.dirname for Path.parent

7308aa3

Unpin bitarray since it has been fixed upstream.

c16c1c1

Mark Zenodo integration tests as xfail on network issues.

c373755

Merge branch 'sprint29' into jupyterhub-beta

70cce4e

Remove jlab build script as it's obsolete w/ jlab 3

f633069

Merge pull request #894 from catalyst-cooperative/jupyterhub-beta

f0cdc8c

Jupyterhub beta

Tutorial notebooks have moved into the pudl-tutorials repo

2061995

Improvements to the DataFrameCollection code.

ce83563

1. adds dfc.fanout() task to split large DataFrameCollections 2. adds __len__() function to figure out how many tables are in the collection by calling len(dfc)

Update docs files for dfc module.

01a2e6d

Merge pull request #896 from catalyst-cooperative/tox-unit-tests-3

de943a3

Clean up PyTest config, coverage generation, unit tests

Merge branch 'sprint29' into dfc

af52ece

Moved dfc_test.py into test/unit

2796a8a

Merge pull request #887 from catalyst-cooperative/dfc

a46e05a

Implementation of DataFrameCollection

Move goodtables-pandas-py to conda from pip dependencies

df976f3

Set fetch-depth: 2 in github actions checkout for SHA identification …

df87297

…in CodeCov

Debug Docker build, ignore Zenodo network exceptions

e25d1bb

Debugging Docker build-push

b8655b9

Debugging Docker build

3b44402

Debugging Docker build & push

d36dad2

Giving up on debugging -- reverting to original-ish version

82db916

Only log in to Docker Hub if it's not a PR

8e3830b

Merge pull request #897 from catalyst-cooperative/sprint29

1e52b3b

Sprint29

zaneselvans merged commit 94ffc26 into main Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev #898

Dev #898

zaneselvans commented Jan 23, 2021

review-notebook-app bot commented Jan 23, 2021

codecov bot commented Jan 23, 2021 •

edited

Loading

Dev #898

Dev #898

Conversation

zaneselvans commented Jan 23, 2021

review-notebook-app bot commented Jan 23, 2021

codecov bot commented Jan 23, 2021 • edited Loading

Codecov Report

codecov bot commented Jan 23, 2021 •

edited

Loading