-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate 2020 data for ferc1, eia860, eia923 #1297
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Replaced the ferc1-eia-glue.ipynb notebook with a script. * Reduced the number of output files to reflect information we're actually paying attention to -- just the plants/utilities to be mapped, one file each for EIA and FERC 1 * Added assertions about having all the years loaded in your DBs * Added assertions that check for lost plants and utilities * Added `link_to_ferc1` boolean columns to the EIA plants/utils outputs that indicate which records should be considered for linkage to their FERC 1 counterparts. * devtools/ferc1-eia-glue/find_unmapped_plants_utils.py script includes a help message (-h or --help) that explains how the outputs are being used. In service of issue #1069 and others
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
* Moved find_new_ferc1_strings from the ferc1 new year notebook into helpers.py * Added 2020 fuel type, unit, plant construction type, etc strings to the FERC 1 string maps in pudl.transform.ferc1
Only one unmapped EIA utility remains: "marathon oil co", with utility_id_eia 11631 However, the ETL fails when trying to pull in the glue tables with the following error: # At this point there should be at most one row in each of these data # frames with NaN values after we drop_duplicates in each. This is because # there will be some plants and utilities that only exist in FERC, or only # exist in EIA, and while they will have PUDL IDs, they may not have # FERC/EIA info (and it'll get pulled in as NaN) for df, df_n in zip( [plants_eia, plants_ferc1, utilities_eia, utilities_ferc1], ['plants_eia', 'plants_ferc1', 'utilities_eia', 'utilities_ferc1'] ): if df[pd.isnull(df).any(axis=1)].shape[0] > 1: > raise AssertionError( f"FERC to EIA glue breaking in {df_n}. There are too many null " "fields. Check the mapping spreadhseet.") E AssertionError: FERC to EIA glue breaking in plants_eia. There are too many null fields. Check the mapping spreadhseet. There are a few pages of EIA plants that have plant_id_eia values but no plant names. Maybe that's the problem? If I log a warning instead of raising an assertion things seem to work...
zaneselvans
commented
Oct 26, 2021
The full ETL with all FERC1 and EIA 860/923 data will run without obvious errors. There are still tests and validations that fail, but at least you can load the DB. This does *not* include eia860m or EPA CEMS data yet. FERC-714 and EIA-861 also remain to be updated for 2020. Issues that remain: * Something screwy is going on with FERC respondent 542 -- it shows up only in the `f1_respondent_id` table, and has all Null data there... and our unmapped utility finder script failed to identify it. See #1304 * A defensive assertion aimed at identifying human errors in the ID mapping sheet is failing because (probably?) we have a fair number of plants and utilities with IDs but no names in there now. See #1305 and also #1232
I ran the full ETL with eia860m integrated to get the next round of IDs for mapping, and found that there was only 1 additional unmapped EIA utility and no additional plants. I also created a new eia860m archive with the newest data (through 2021-08). It seems to "just work" I also updated the eia861 Zenodo archive to include the 2020 data, and switched the DOIs over to the 2020-inclusive archives for ferc714 and eia861. For eia861 I've updated the filemap.csv for 2020, but the column maps and other per-table metadata files still need to be updated. For ferc714 I updated the filenames and it gets through "extract" and raises and AssertionError in the transform step for the hourly demand by planning area, which will require some actual debugging. I also created a new archive for the eia923, which saw updates on October 8th. Thankfully it didn't break anything big -- just changed file named and added the balancing_authority_code_eia to gf, bf, frc, and gen tables (it gets harvested out).
* Updated README & release notes to reflect state of 2020 data availability * Switched back to using short-codes for the fuel transportation modes due to null columns (enum constraint was tied to labels, rather than codes, and this is going to get changed when we merge in the dev / metadata stuff anyway) * Switched back to calling the unidentified fuel type codes in FERC 1 "unknown" rather than "other" -- this will also get overwritten by the new metadata stuff. * Fixed EPA CEMS tests so that they work with --live-dbs * Lowered the lower bound on natural gas price aggregations in the data validations to accommodate the insanely low gas prices from 2020 *le sigh* The only known integration test and data validation failures which remain at this point are: * Expected issues with the EIA-861 / FERC-714 interim ETL or derived outputs * The expected number of rows in all of the tables has yet to be updated. However they all look like they're off by reasonable amounts (mid-single-digit percentages) given that we are adding one new year of data to 19 existing years of data.
This was
linked to
issues
Oct 27, 2021
zaneselvans
added
eia860
Anything having to do with EIA Form 860
eia923
Anything having to do with EIA Form 923
ferc1
Anything having to do with FERC Form 1
new-data
Requests for integration of new data.
testing
Writing tests, creating test data, automating testing, etc.
labels
Oct 27, 2021
zaneselvans
changed the title
First draft of 2020 FERC1-EIA ID Mapping
Integrate 2020 data for ferc1, eia860, eia923
Oct 28, 2021
cmgosnell
requested changes
Oct 28, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything look great! I have some minor questions and a few suggestions for doc in comments.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For details see the individual commit messages, which should be pretty detailed.
At this point all of the integration and data validation tests are passing with the exception of:
dev
-- but all of them are off by single-digit percentages, which is what we expect to see given that we're adding 1 additional year to 19 existing years.