Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate 2020 data for ferc1, eia860, eia923 #1297

Merged
merged 7 commits into from
Oct 28, 2021

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Oct 21, 2021

For details see the individual commit messages, which should be pretty detailed.

At this point all of the integration and data validation tests are passing with the exception of:

  • eia861 and ferc714 related stuff (as expected)
  • None of the expected numbers of records have been updated yet, since that's tedious and will change when we merge stuff in from dev -- but all of them are off by single-digit percentages, which is what we expect to see given that we're adding 1 additional year to 19 existing years.

* Replaced the ferc1-eia-glue.ipynb notebook with a script.
* Reduced the number of output files to reflect information we're
  actually paying attention to -- just the plants/utilities to be
  mapped, one file each for EIA and FERC 1
* Added assertions about having all the years loaded in your DBs
* Added assertions that check for lost plants and utilities
* Added `link_to_ferc1` boolean columns to the EIA plants/utils outputs
  that indicate which records should be considered for linkage to their
  FERC 1 counterparts.
* devtools/ferc1-eia-glue/find_unmapped_plants_utils.py script includes
  a help message (-h or --help) that explains how the outputs are being
  used.

In service of issue #1069 and others
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

* Moved find_new_ferc1_strings from the ferc1 new year notebook into helpers.py
* Added 2020 fuel type, unit, plant construction type, etc strings to the FERC 1 string
  maps in pudl.transform.ferc1
Only one unmapped EIA utility remains: "marathon oil co", with utility_id_eia 11631

However, the ETL fails when trying to pull in the glue tables with the following error:

        # At this point there should be at most one row in each of these data
        # frames with NaN values after we drop_duplicates in each. This is because
        # there will be some plants and utilities that only exist in FERC, or only
        # exist in EIA, and while they will have PUDL IDs, they may not have
        # FERC/EIA info (and it'll get pulled in as NaN)

        for df, df_n in zip(
            [plants_eia, plants_ferc1, utilities_eia, utilities_ferc1],
            ['plants_eia', 'plants_ferc1', 'utilities_eia', 'utilities_ferc1']
        ):
            if df[pd.isnull(df).any(axis=1)].shape[0] > 1:
>               raise AssertionError(
                    f"FERC to EIA glue breaking in {df_n}. There are too many null "
                    "fields. Check the mapping spreadhseet.")
E               AssertionError: FERC to EIA glue breaking in plants_eia. There are too many null fields. Check the mapping spreadhseet.

There are a few pages of EIA plants that have plant_id_eia values but no plant names.
Maybe that's the problem? If I log a warning instead of raising an assertion things seem
to work...
The full ETL with all FERC1 and EIA 860/923 data will run without
obvious errors. There are still tests and validations that fail, but at
least you can load the DB.

This does *not* include eia860m or EPA CEMS data yet. FERC-714 and
EIA-861 also remain to be updated for 2020.

Issues that remain:
* Something screwy is going on with FERC respondent 542 -- it shows up
  only in the `f1_respondent_id` table, and has all Null data there...
  and our unmapped utility finder script failed to identify it.  See
  #1304
* A defensive assertion aimed at identifying human errors in the ID
  mapping sheet is failing because (probably?) we have a fair number of
  plants and utilities with IDs but no names in there now. See #1305 and
  also #1232
I ran the full ETL with eia860m integrated to get the next round of IDs for mapping, and
found that there was only 1 additional unmapped EIA utility and no additional plants. I
also created a new eia860m archive with the newest data (through 2021-08). It seems to
"just work"

I also updated the eia861 Zenodo archive to include the 2020 data, and switched the DOIs
over to the 2020-inclusive archives for ferc714 and eia861.

For eia861 I've updated the filemap.csv for 2020, but the column maps and other
per-table metadata files still need to be updated.

For ferc714 I updated the filenames and it gets through "extract" and raises and
AssertionError in the transform step for the hourly demand by planning area, which will
require some actual debugging.

I also created a new archive for the eia923, which saw updates on October 8th.
Thankfully it didn't break anything big -- just changed file named and added the
balancing_authority_code_eia to gf, bf, frc, and gen tables (it gets harvested out).
* Updated README & release notes to reflect state of 2020 data availability
* Switched back to using short-codes for the fuel transportation modes due to null
  columns (enum constraint was tied to labels, rather than codes, and this is going to
  get changed when we merge in the dev / metadata stuff anyway)
* Switched back to calling the unidentified fuel type codes in FERC 1 "unknown" rather
  than "other" -- this will also get overwritten by the new metadata stuff.
* Fixed EPA CEMS tests so that they work with --live-dbs
* Lowered the lower bound on natural gas price aggregations in the data validations to
  accommodate the insanely low gas prices from 2020 *le sigh*

The only known integration test and data validation failures which remain at this point
are:
  * Expected issues with the EIA-861 / FERC-714 interim ETL or derived outputs
  * The expected number of rows in all of the tables has yet to be updated. However they
    all look like they're off by reasonable amounts (mid-single-digit percentages) given
    that we are adding one new year of data to 19 existing years of data.
@zaneselvans zaneselvans added eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. testing Writing tests, creating test data, automating testing, etc. labels Oct 27, 2021
@zaneselvans zaneselvans marked this pull request as ready for review October 27, 2021 17:15
@zaneselvans zaneselvans changed the title First draft of 2020 FERC1-EIA ID Mapping Integrate 2020 data for ferc1, eia860, eia923 Oct 28, 2021
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything look great! I have some minor questions and a few suggestions for doc in comments.

src/pudl/constants.py Show resolved Hide resolved
src/pudl/glue/ferc1_eia.py Show resolved Hide resolved
src/pudl/glue/ferc1_eia.py Show resolved Hide resolved
src/pudl/extract/ferc1.py Show resolved Hide resolved
@zaneselvans zaneselvans merged commit 05abe39 into 2020 Oct 28, 2021
@zaneselvans zaneselvans deleted the ferc1-eia-id-mapping-2020 branch October 28, 2021 20:50
@zaneselvans zaneselvans restored the ferc1-eia-id-mapping-2020 branch October 28, 2021 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. testing Writing tests, creating test data, automating testing, etc.
Projects
None yet
2 participants