-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev #898
Dev #898
Conversation
Previously the API_KEY_EIA value was being set as a global variable at the top of the module. This meant that the value of the constant reflected the value at time of module import. So if the user imports pudl (e.g. in a notebook) and then **later** sets the environment variable, the value of the constant doesn't reflect that setting, meaning the API key is usually None. Obtaining the API key from `os.environ.get('API_KEY_EIA') when the key is used inside the functions avoids this issue.
1. transpose test metadata files 2. add file_map.csv to the metadata 3. modify test calls to accout for the changed API (using partition instead of fixed year)
After seeing the speedups that @ezwelty was able to eek out of the FERC 714 ETL process and missing value imputations, I've become suspicious that the EPA CEMS ETL does not need to take anywhere near as long as it currently does. I did some poking around to see where the bottleneck is, and it seems to be the datastore -- not the extract or transform. While I haven't tracked down the issue, I did move some of the constants that pertain only to the EPA CEMS ETL into that module, and out of the constants.py graveyard.
Use abstract representations for resources and descriptors, implement LocalFileCache, GoogleCloudStorageCache and LayeredCache implementations. This is not guaranteed to be actually truly operational and correct. Also added support for setting up GCS based cache when run as a standalone script.
1. clean up some api method names 2. add type hinting to some methods 3. add tests for PudlFileReference
And fixed some oddities encountered during the testing.
1. replace PudlFileResource with PudlResourceKey, reduced functionality 2. simplified ZenodoFetcher logic 3. refreshed unit-test to cover the new simple api This is WIP and more work needs to happen to make it actually work, specifically: 1. Integrate Datastore class with these simpler classes 2. Rework caching layers for local files and gcs 3. Figure out how to configure datastore caching parameters via command line flags (use local fs or not, use gcs and which bucket and/or prefix)
1. use utf-8 encoding when storing datapackage.json in cache 2. do not add /data subdirectory when initializing LocalFileCache. It is up to the caller to determine the right cache path. 3. add some debug statements to datastore code 4. add tests for name based resource lookups 5. bypass set/delete if there are no layers in the LayeredCache
This refactors the ETL part of the code to use the new Datastore API. Dataset specific implementations of datastore are now implemented as wrappers that provide extra functionality. This makes the passing of datastore somewhat easier because we can create one instance of generic datastore that can be passed to DatasetPipeline objects that may need it and they can optionally wrap it. epaipm datastore logic has been consolidated under extract/epaipm.py and some constants have been moved into this module.
1. whem matching resource partitions, cast all values to string so that we don't get false negatives when dealing with year that may be string or integer. 2. add get_known_datasets method that will simplify the CLI code 3. Fix error handling in Datastore.get_unique_resource. Always throw KeyError and properly handle non-existing resources.
1. expose datastore.cache so that it can be correctly initialized when --populate-gcs-cache is set 2. added --partition flag so that we can select which resources we want to retrieve 3. strip leading slashes from bucket paths so that we don't end up with mangled paths like gs://bucket//prefix/blob.zip
Consolidate interim ETL / output tests
Jupyterhub beta
1. adds dfc.fanout() task to split large DataFrameCollections 2. adds __len__() function to figure out how many tables are in the collection by calling len(dfc)
* Moved all unit tests into a hierarchy under tests/unit * Removed extraneous pytest options in tox.ini being set in pytest.ini * The `unit` and `etl` testenvs in tox.ini now both append to the coverager report. * The CI testenv now runs `coverage erase` to start with a fresh slate * Removed unused @pytest.mark.datapkg decorators / cli option * PyTest will now show warnings (of which there are a few) Closes #893
Clean up PyTest config, coverage generation, unit tests
Implementation of DataFrameCollection
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## main #898 +/- ##
===========================================
+ Coverage 67.13% 78.21% +11.08%
===========================================
Files 44 45 +1
Lines 5556 5801 +245
===========================================
+ Hits 3730 4537 +807
+ Misses 1826 1264 -562
Continue to review full report at Codecov.
|
No description provided.