Move the EIA plant-parts list into PUDL outputs #1157

cmgosnell · 2021-08-27T13:55:26Z

Hookay. This PR is an attempt to move the EIA plant-parts list into PUDL. The bulk of the code is in pudl.analysis.plant_parts_eia.py. I added the little standard wrappers for the three main outputs of the plant-parts list into the pudl_out object, a small collection of helper functions, one "run-it" CI test and one data validation test.

This is a biggie. There is a fair amount of documentation in pudl.analysis.plant_parts_eia.py. I highly recommend reading the module doc string before looking at any of the code.

There are two classes in there that still feel like problem children to me because they are conceptual challenging and a bit slow (LabelTrueGranularities and AddQualifier)

Desires from this PR:

Fix PerformanceWarnings
Get advice on LabelTrueGranularities (this is my least fav part... it is conceptually challenging and slow)
Get advice on AddQualifier (this part is just slow bc it is doing a ton of groupby().count()'s)
Advice on making the record_id_eia more useful/readable

Pulling over the relevant module from: https://github.com/catalyst-cooperative/rmi-ferc1-eia/ Working on Issue #952

The EIA plant-parts list is now integrated into pudl_out. There are three tables/methods that were added, each major step along the way to get to the plant-part list. I added one annual run of the plant-parts list into the CI tests. And one run of one of the test functions that check the aggregations into the validation tests. I removed the caching from the plant_parts_eia.py objects bc this is now taken care of in the pudl_out object. Working on Issue #952

review-notebook-app · 2021-08-27T13:55:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2021-08-27T16:37:57Z

Codecov Report

Merging #1157 (26950da) into dev (9d27550) will increase coverage by 0.9%.
The diff coverage is 97.7%.

@@           Coverage Diff           @@
##             dev   #1157     +/-   ##
=======================================
+ Coverage   83.2%   84.2%   +0.9%     
=======================================
  Files         65      66      +1     
  Lines       6886    7307    +421     
=======================================
+ Hits        5733    6155    +422     
+ Misses      1153    1152      -1

Impacted Files	Coverage Δ
src/pudl/analysis/plant_parts_eia.py	`97.3% <97.3%> (ø)`
src/pudl/analysis/mcoe.py	`92.4% <100.0%> (-0.3%)`	⬇️
src/pudl/helpers.py	`87.4% <100.0%> (+2.6%)`	⬆️
src/pudl/output/pudltabl.py	`88.6% <100.0%> (+0.4%)`	⬆️
src/pudl/metadata/dfs.py	`100.0% <0.0%> (ø)`
src/pudl/metadata/enums.py	`100.0% <0.0%> (ø)`
src/pudl/extract/ferc714.py	`100.0% <0.0%> (ø)`
src/pudl/metadata/fields.py	`100.0% <0.0%> (ø)`
src/pudl/metadata/labels.py	`100.0% <0.0%> (ø)`
src/pudl/metadata/constants.py	`100.0% <0.0%> (ø)`
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d27550...26950da. Read the comment docs.

Working on Issue #952

As of 9/7/21 this integration is in a PR, catalyst-cooperative/pudl#1157.

bendnorman

PR looks good! The code is stable and the outputs seem to be providing good record linkage results.

Running pudl_out.plant_parts_eia() took about 16 mins on my computer which I think is fast enough for our purposes. I did profile the code and found that loading the MCOE data took 25% of the total runtime and finding qualifying records took 31% of the runtime.

It took me a while to wrap my head around what each step in the process was doing and why. The code does a bunch of transformations and it is difficult to understand what a single record represents at the end of each transformation. As someone who is less familiar with energy systems and raw FERC data, I think providing some examples of inputs and expected outputs in unit tests would be helpful.

I would like to know more about the desired outcome of the plant parts process. Are the entities in PLANT_PARTS real entities in power plants or are they just strange slices that exist in FERC? If they are real entities, how are they related to one another? If they are just weird slices from FERC, how did we identify plant_technology, plant_prime_mover, etc as important slices to try to create from EIA data? I think answering these questions could help us potentially simplify the plant parts logic.

src/pudl/analysis/plant_parts_eia.py

bendnorman · 2021-09-22T20:56:57Z

src/pudl/analysis/plant_parts_eia.py

+infrastructure as other plant-part list records. For example, if we have a
+"plant_prime_mover" plant part record and a "plant_unit" plant part record
+which were both cobbled together from the same two generators. We want to be
+able to reduce the master unit list to only unique collections of generators,


Is it possible to provide some sample data for an example of a true granularity? As someone without much energy expertise having specific examples of how these plant parts interact would be super helpful.

bendnorman · 2021-09-22T20:59:00Z

src/pudl/analysis/plant_parts_eia.py

+        In order to accumulate every possible version of how a generator could
+        be reported, this method generates two records for each generator's
+        reported owners: one of the portion of the plant part they own and one
+        for the plant-part as a whole. The portion records are labeled in the
+        `ownership` column as "owned" and the total records are labeled as
+        "total".


Is it possible to add an example of a FERC record that splits a plant part into ownership slices?

great idea!

src/pudl/analysis/plant_parts_eia.py

bendnorman · 2021-09-22T21:18:30Z

src/pudl/output/pudltabl.py

+        Note: This table is entirely intented for use within
+        ``PudlTabl.plant_parts_eia`` and would be difficult to work with
+        outside of that context.


Does each step of the plant parts process need to live in pudltabl.py? plant_parts_eia() could create each plant part object.

yeah i was wondering this myself. I mostly set it up this way for debugging because I have very often needed to generate the final output but not these two interim outputs. I think with the deep_update argument of plant_parts_eia this workflow could stay enabled without having their own output methods.

this enables checking of consistency across time. i think that the record id eia is not the most clean column as it is. hoping to get some guidence on that in general.

zaneselvans

This has to get merged in because it's being used and the logistics of having it live in a separate repo are unworkable, but I don't really understand what's going on. Most of the comments I've made are small localized issues, but architecturally I think we have to find a way for this to be much more understandable and ideally also faster. I don't see any fundamental reason why these operations should have to take tens of minutes to run.

The identification of distinct record groups ("true granularities") feels particularly opaque.

I was also completely lost in the process of aggregating scaled generator+ownership slices.

I was unable to keep track of the nature of the tables being manipulated at various points in the process -- what to expect their natural keys to be, whether they represented individual plant parts for a single plant, or the whole dataset. I don't understand what the record IDs are.

I suspect some of the confusion could be helped by abstracting out operations that work on a given "plant part" or column that's used for scaling (capacity and ownership fraction) and separating the ownership-related operations from the plant-attribute related operations, and also from the operational status operations. Maybe setting up several different dataframes each of which is simpler to think about, which are then combined at the end into the output that's actually used for matching with FERC 1.

I also wasn't able to follow the flow of the program. Eventually it just became a bunch of different freefloating functions, with notes about where they were expected to be called, but I lost track of the high-level order of operations.

If we do get the CCAI grant and move to using a data structure like this as a fundamental building block of many parts of our overall data linkages, I think we need to take what you've learned from this process and do a new iteration.

src/pudl/helpers.py

zaneselvans · 2021-09-21T21:15:55Z

src/pudl/helpers.py

+    return result.to_frame(name=data_col).reset_index()
+
+
+def agg_cols(df_in, id_cols, sum_cols, wtavg_dict):


agg_cols isn't descriptive of what this function does -- it's a very specific kind of aggregation of columns. The one line docstring also doesn't really explain what's happening, other than it involves dataframes, summing and weighted averages. Those operations could be combined in lots of different ways to do different things.

Very fair point. I've renamed it to... sum_and_weighted_average_agg does that sound okay?

src/pudl/helpers.py

zaneselvans · 2021-09-30T17:26:04Z

src/pudl/analysis/plant_parts_eia.py

+
+        """
+        plant_parts_eia = (
+            pudl.helpers.calc_capacity_factor(


Helpers seems like a weird place to have this analytical function.

I put it here because it is now being used in both the analysis.mcoeand here. I suppose I could move it back into mcoe? But it is now used in both mcoe and this plant_parts_eia model

src/pudl/analysis/plant_parts_eia.py

…eler

zaneselvans · 2022-01-25T18:00:47Z

Hey @ezwelty we're having a doctest issue here that makes absolutely no sense to us and we've spent hours trying to debut it and wonder if it might be clear to you what's happening.

On @cmgosnell's computer the doctests for this new module src/pudl/analysis/plant_parts_eia.py pass just fine, either using:

pytest --doctest-modules src/pudl test/unit

or

tox -re unit

But on my (Linux) machine, and in the CI some of the doctests are failing and claiming that the python code embedded in the docstrings are producing no outputs, resulting in an "Expected [dataframe] got nothing" error. Examples of the build failure can be found here.

This is happening in the doctests embedded within the class-level docstrings for:

pudl.analysis.plant_parts_eia.PlantPart
pudl.analysis.plant_parts_eia.MakeMegaGenTbl

The methods that are being exercised clearly produce output when used in other contexts, and the tests are passing on her MacOS environment. Other doctests that construct dataframes by hand using pd.DataFrame() seem to produce output as expected and pass just fine.

It seems bizarre and we are stumped.

…he Got nothing error...

ezwelty · 2022-01-27T07:19:14Z

@zaneselvans Strange indeed. I see no way, reading through the code, for the functions being tested to return None (without also raising an exception, since a dataframe is expected throughout). So I wonder if the printing of dataframes is being suppressed in some environments? Honestly, I have no idea, but things I might try:

Making the failing tests more like the working ones:

>>> df = PlantPart(part_name='plant').ag_part_by_own_slice(gens_mega, sum_cols=['capacity_mw'], wtavg_dict={})
>>> df
...

Or even explicitly calling print:

>>> print(PlantPart(part_name='plant').ag_part_by_own_slice(gens_mega, sum_cols=['capacity_mw'], wtavg_dict={}))
...

katie-lamb · 2022-03-08T06:32:49Z

Task list to move doctests over to unit tests:

module doc string
MakeMegaGenTbl
4 in PlantPart
maybe these should be broken up into separate tests
move doctests out and update comments

katie-lamb · 2022-03-08T23:02:51Z

In the latest commits I merged in dev and migrated the doctests over to unit tests. Here's how it broke down into unit tests:

test_plant_part:

doctest from module doc string showing aggregation by plant
doctest of aggregation by plant_prime_fuel, plant_prime_mover, and plant_gen from PlantPart class

test_make_mega_gen_tbl:

doctest from MakeMegaGenTbl that tests creation of mega_gen table

test_slice_by_ownership:

already included in unit tests, tests slice_by_ownership method from MakeMegaGenTbl class

The examples @cmgosnell included in the doctests made sense to me so I just migrated them over directly.

Questions:

Should test_plant_part be broken up into separate tests for each aggregation?
Is the test_slice_by_ownership test redundant since slice_by_ownership is already called when the mega_gens table is made in MakeMegaGenTbl().execute() in test_make_mega_gen_tbl?
I kept all the example dataframes that aren't run through PPL code in the docstrings and added some text to explain the methods one would call to generate various output tables. Maybe there's a better way or the docstrings should be further simplified

zaneselvans

I only looked at the doctest to unit test migration portion of the PR for this go-around.

Seemed like there was a leftover formatting fixture (from the too-wide doctests), and like it might be preferable to break the one large test function that's checking several different things into several smaller more specific tests, unless I missed some cross-dependency between them.

zaneselvans · 2022-03-10T17:36:31Z

test/conftest.py

+@pytest.fixture(autouse=True, scope='session')
+def pandas_terminal_width():
+    """Set pandas displays for doctests."""
+    pd.options.display.width = 1000
+    pd.options.display.max_columns = 1000


Should this still be here, or was it only for the doctests, which have now been removed?

zaneselvans · 2022-03-10T17:48:00Z

test/unit/analysis/plant_parts_eia_test.py

+import pudl.analysis.plant_parts_eia
+
+
+def test_plant_part():


Is there a reason that all of these individual unit tests need to be combined in a single test function? Or can they easily be broken out into several more specific unit tests, which would make it easier to identify exactly what test has failed when something goes wrong. I see gens_mega shows up in many of the checks, but that could be pulled out into a GENS_MEGA constant that's used in many of the individual tests.

zaneselvans · 2022-03-10T17:55:11Z

Also now there's some conflicts with dev because I changed the output testing a bit in the nuclear generation fuel PR.

katie-lamb · 2022-03-10T21:52:07Z

src/pudl/analysis/plant_parts_eia.py

+# I tried adding this into __init__.py.
+# I tried adding this into the module docstring.
+pd.options.display.width = 1000
+pd.options.display.max_columns = 1000


Was this the final decision on where to put this?

katie-lamb · 2022-03-10T21:57:09Z

I only looked at the doctest to unit test migration portion of the PR for this go-around.

Seemed like there was a leftover formatting fixture (from the too-wide doctests), and like it might be preferable to break the one large test function that's checking several different things into several smaller more specific tests, unless I missed some cross-dependency between them.

I took out that formatting fixture but the pandas formatting (pd.options.display.width = 1000 and pd.options.display.max_columns = 1000) at the top of plant_parts_eia.py is still necessary to pass the doctests with the pandas dataframe examples that are left in there.

I broke up the different plant part aggregations into smaller more specific tests and made the global GENS_MEGA.

I also merged in dev.

zaneselvans · 2022-03-10T22:35:26Z

I think having those display options set as a side-effect of importing the pudl package will be unexpected, but also cosmetic, and something users can override, so I'm not gonna worry about it too much.

cmgosnell added 2 commits August 26, 2021 08:52

direct copy/past of EIA's master plant-part list

890cb6a

Pulling over the relevant module from: https://github.com/catalyst-cooperative/rmi-ferc1-eia/ Working on Issue #952

Fix docs test hooks

71cac6a

cmgosnell assigned zaneselvans Aug 27, 2021

cmgosnell requested review from zaneselvans and bendnorman August 27, 2021 16:19

cmgosnell assigned bendnorman and cmgosnell Aug 27, 2021

cmgosnell added 3 commits August 30, 2021 11:50

clean up tests, documentation

35fe0c0

Working on Issue #952

Merge branch 'dev' into mul_eia

6ed9759

update with new dependencies

96d4887

cmgosnell requested a review from aesharpe September 1, 2021 21:26

cmgosnell added a commit to catalyst-cooperative/rmi-ferc1-eia that referenced this pull request Sep 7, 2021

move MUL creation into PUDL.

89ab7e0

As of 9/7/21 this integration is in a PR, catalyst-cooperative/pudl#1157.

cmgosnell mentioned this pull request Sep 13, 2021

Add FERC1 output table that combines key FERC1 subtables #1209

Merged

bendnorman reviewed Sep 23, 2021

View reviewed changes

cmgosnell added 5 commits September 27, 2021 12:31

add a time-independed eia record id.

29faf79

this enables checking of consistency across time. i think that the record id eia is not the most clean column as it is. hoping to get some guidence on that in general.

add a time-independed eia record id.

165169b

this enables checking of consistency across time. i think that the record id eia is not the most clean column as it is. hoping to get some guidence on that in general.

Merge branch 'dev' into mul_eia

c563bb2

Merge branch 'dev' into mul_eia

3de55b1

remove pesky little docs builds

fb81d6a

zaneselvans reviewed Oct 19, 2021

View reviewed changes

cmgosnell added 7 commits November 5, 2021 18:26

respond to pr comments on EIA plant-part list

8d2cdfa

docs update omigosh whyy

e6cf770

docs update but actually fix it yayyyy (hopefully i'm not jinxing it)

8da64c6

docs on "parent" and "child" plant-parts for the true granularity lab…

1cd8cd4

…eler

move true granularites df into the main ppl output method.

a2be073

Merge branch 'dev' into mul_eia

5db5a15

add back in all of the attribute columns doh

ece7ea5

Merge branch 'dev' into mul_eia

6b85387

cmgosnell added 2 commits January 25, 2022 13:09

add a doctest in the plant_part_eia module doc string.. for testing t…

ca2ec36

…he Got nothing error...

add a test from doctests into unit tests

2361b4a

cmgosnell added 4 commits January 28, 2022 17:11

slightly clean the doctest output dfs maybe this will work probably not.

b624615

bb doc change test

89e4771

save df and then output

61de23a

Merge branch 'dev' into mul_eia

7a49341

katie-lamb self-assigned this Mar 1, 2022

katie-lamb added the ccai Tasks related to CCAI grant for entity matching label Mar 1, 2022

Merge branch 'dev' into mul_eia

fada71e

katie-lamb added 2 commits March 8, 2022 13:48

move plantparts tests and make mega gen doctests to unit tests

77a7b3d

added all unit tests from doctests

4d7be91

katie-lamb added 2 commits March 9, 2022 11:15

updated docstrings to pass tox

9e41f62

had to remove code blocks test

4781417

zaneselvans requested changes Mar 10, 2022

View reviewed changes

katie-lamb added 2 commits March 10, 2022 13:10

merge in dev and fix conflict in output tests

314ccaf

broke up plant part tests, took out pandas terminal width fixture

26950da

katie-lamb reviewed Mar 10, 2022

View reviewed changes

zaneselvans approved these changes Mar 10, 2022

View reviewed changes

cmgosnell merged commit 1f88c6f into dev Mar 10, 2022

zaneselvans deleted the mul_eia branch March 10, 2022 22:38

This was referenced Mar 10, 2022

Release PUDL v0.6.0 #1497

Closed

Update release notes for v0.6.0 #1525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move the EIA plant-parts list into PUDL outputs #1157

Move the EIA plant-parts list into PUDL outputs #1157

cmgosnell commented Aug 27, 2021 •

edited

Loading

review-notebook-app bot commented Aug 27, 2021

codecov bot commented Aug 27, 2021 •

edited

Loading

bendnorman left a comment

bendnorman Sep 22, 2021

bendnorman Sep 22, 2021

cmgosnell Sep 23, 2021

bendnorman Sep 22, 2021

cmgosnell Nov 5, 2021

zaneselvans left a comment

zaneselvans Sep 21, 2021

cmgosnell Nov 5, 2021

zaneselvans Sep 30, 2021

cmgosnell Nov 4, 2021 •

edited

Loading

zaneselvans commented Jan 25, 2022 •

edited

Loading

ezwelty commented Jan 27, 2022

katie-lamb commented Mar 8, 2022 •

edited

Loading

katie-lamb commented Mar 8, 2022 •

edited

Loading

zaneselvans left a comment

zaneselvans Mar 10, 2022

zaneselvans Mar 10, 2022

zaneselvans commented Mar 10, 2022

katie-lamb Mar 10, 2022

katie-lamb commented Mar 10, 2022 •

edited

Loading

zaneselvans commented Mar 10, 2022

		return result.to_frame(name=data_col).reset_index()


		def agg_cols(df_in, id_cols, sum_cols, wtavg_dict):

Move the EIA plant-parts list into PUDL outputs #1157

Move the EIA plant-parts list into PUDL outputs #1157

Conversation

cmgosnell commented Aug 27, 2021 • edited Loading

review-notebook-app bot commented Aug 27, 2021

codecov bot commented Aug 27, 2021 • edited Loading

Codecov Report

bendnorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell Nov 4, 2021 • edited Loading

Choose a reason for hiding this comment

zaneselvans commented Jan 25, 2022 • edited Loading

ezwelty commented Jan 27, 2022

katie-lamb commented Mar 8, 2022 • edited Loading

katie-lamb commented Mar 8, 2022 • edited Loading

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Mar 10, 2022

Choose a reason for hiding this comment

katie-lamb commented Mar 10, 2022 • edited Loading

zaneselvans commented Mar 10, 2022

cmgosnell commented Aug 27, 2021 •

edited

Loading

codecov bot commented Aug 27, 2021 •

edited

Loading

cmgosnell Nov 4, 2021 •

edited

Loading

zaneselvans commented Jan 25, 2022 •

edited

Loading

katie-lamb commented Mar 8, 2022 •

edited

Loading

katie-lamb commented Mar 8, 2022 •

edited

Loading

katie-lamb commented Mar 10, 2022 •

edited

Loading