Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #898

Merged
merged 116 commits into from
Jan 23, 2021
Merged

Dev #898

merged 116 commits into from
Jan 23, 2021

Conversation

zaneselvans
Copy link
Member

No description provided.

zaneselvans and others added 30 commits December 14, 2020 14:24
Previously the API_KEY_EIA value was being set as a global variable at
the top of the module. This meant that the value of the constant
reflected the value at time of module import. So if the user imports
pudl (e.g.  in a notebook) and then **later** sets the environment
variable, the value of the constant doesn't reflect that setting,
meaning the API key is usually None.

Obtaining the API key from `os.environ.get('API_KEY_EIA') when the key
is used inside the functions avoids this issue.
1. transpose test metadata files
2. add file_map.csv to the metadata
3. modify test calls to accout for the changed API (using partition
instead of fixed year)
After seeing the speedups that @ezwelty was able to eek out of the FERC
714 ETL process and missing value imputations, I've become suspicious
that the EPA CEMS ETL does not need to take anywhere near as long as it
currently does.  I did some poking around to see where the bottleneck
is, and it seems to be the datastore -- not the extract or transform.
While I haven't tracked down the issue, I did move some of the constants
that pertain only to the EPA CEMS ETL into that module, and out of the
constants.py graveyard.
Use abstract representations for resources and descriptors, implement
LocalFileCache, GoogleCloudStorageCache and LayeredCache
implementations.

This is not guaranteed to be actually truly operational and correct.

Also added support for setting up GCS based cache when run as a
standalone script.
1. clean up some api method names
2. add type hinting to some methods
3. add tests for PudlFileReference
And fixed some oddities encountered during the testing.
1. replace PudlFileResource with PudlResourceKey, reduced functionality
2. simplified ZenodoFetcher logic
3. refreshed unit-test to cover the new simple api

This is WIP and more work needs to happen to make it actually work,
specifically:
1. Integrate Datastore class with these simpler classes
2. Rework caching layers for local files and gcs
3. Figure out how to configure datastore caching parameters via command
   line flags (use local fs or not, use gcs and which bucket and/or
   prefix)
1. use utf-8 encoding when storing datapackage.json in cache
2. do not add /data subdirectory when initializing LocalFileCache. It
   is up to the caller to determine the right cache path.
3. add some debug statements to datastore code
4. add tests for name based resource lookups
5. bypass set/delete if there are no layers in the LayeredCache
This refactors the ETL part of the code to use the new Datastore API.

Dataset specific implementations of datastore are now implemented as
wrappers that provide extra functionality. This makes the passing of
datastore somewhat easier because we can create one instance of generic
datastore that can be passed to DatasetPipeline objects that may need
it and they can optionally wrap it.

epaipm datastore logic has been consolidated under extract/epaipm.py and
some constants have been moved into this module.
1. whem matching resource partitions, cast all values to string so
   that we don't get false negatives when dealing with year that may
   be string or integer.
2. add get_known_datasets method that will simplify the CLI code
3. Fix error handling in Datastore.get_unique_resource. Always throw
   KeyError and properly handle non-existing resources.
1. expose datastore.cache so that it can be correctly initialized
   when --populate-gcs-cache is set
2. added --partition flag so that we can select which resources we
   want to retrieve
3. strip leading slashes from bucket paths so that we don't end up
   with mangled paths like gs://bucket//prefix/blob.zip
zaneselvans and others added 27 commits January 20, 2021 15:49
1. adds dfc.fanout() task to split large DataFrameCollections
2. adds __len__() function to figure out how many tables are in the
collection by calling len(dfc)
* Moved all unit tests into a hierarchy under tests/unit
* Removed extraneous pytest options in tox.ini being set in pytest.ini
* The `unit` and `etl` testenvs in tox.ini now both append to the coverager
  report.
* The CI testenv now runs `coverage erase` to start with a fresh slate
* Removed unused @pytest.mark.datapkg decorators / cli option
* PyTest will now show warnings (of which there are a few)

Closes #893
Clean up PyTest config, coverage generation, unit tests
Implementation of DataFrameCollection
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Jan 23, 2021

Codecov Report

Merging #898 (1e52b3b) into main (dc258e0) will increase coverage by 11.08%.
The diff coverage is 84.38%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main     #898       +/-   ##
===========================================
+ Coverage   67.13%   78.21%   +11.08%     
===========================================
  Files          44       45        +1     
  Lines        5556     5801      +245     
===========================================
+ Hits         3730     4537      +807     
+ Misses       1826     1264      -562     
Impacted Files Coverage Δ
src/pudl/convert/datapkg_to_sqlite.py 40.28% <0.00%> (ø)
src/pudl/helpers.py 91.70% <ø> (ø)
src/pudl/transform/eia860.py 96.77% <ø> (ø)
src/pudl/extract/epaipm.py 47.22% <46.67%> (+14.87%) ⬆️
src/pudl/output/pudltabl.py 69.88% <66.67%> (+11.24%) ⬆️
src/pudl/workspace/datastore.py 71.03% <73.26%> (+22.61%) ⬆️
src/pudl/workspace/resource_cache.py 84.11% <84.11%> (ø)
src/pudl/etl.py 81.33% <92.31%> (+0.33%) ⬆️
src/pudl/extract/ferc1.py 78.63% <96.30%> (+0.16%) ⬆️
src/pudl/extract/epacems.py 97.56% <96.97%> (+3.81%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc258e0...1e52b3b. Read the comment docs.

@zaneselvans zaneselvans merged commit 94ffc26 into main Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants